Linkage disequilibrium of single-nucleotide polymorphism data: how sampling methods affect estimates of linkage disequilibrium

Linkage disequilibrium (LD) is an important measure used in the analysis of single-nucleotide polymorphism (SNP) data. We used the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 500 k SNP data to explore the effect of sampling methods on estimating of LD for SNP data. Method and data We found 332 trios in the GAW16 Framingham SNP data. Repeated random samples without replacement, of different sizes of trios and independent individuals, are drawn from these 332 trios. For each sample, the LD is calculated using the Haploview program for the chromosome 1 SNP data. Percents of D' > 0.8 and r2 > 0.8 are calculated for different distance bins based on the Haploview output. The results are summarized by sample size and sampling methods to give us an overall view of the effect of sample size and sampling methods on the LD estimation. Results Trios design gave stable estimates. A sample of 30 to 40 trios gave estimates of percent of LD > 0.8 very close to those from 332 trios. When independent individuals are used, the estimates are less stable and are different from those obtained from the 332 trios for both D' and r2, with larger differences for D'. Conclusion Our results suggest that trio design gives a stable estimate of LD. Therefore it may be more suitable for LD analysis than using independent individuals. We must be cautious when comparing the LD estimates from trios, and those from independent individuals.


Background
Linkage disequilibrium (LD) is an important measure in analyzing single-nucleotide polymorphism (SNP) data. It helps us to understand the evolutionary history of humans and other organism. A thorough understanding of LD also helps us to better design and analyze studies of SNP-disease associations.
Two major sampling methods are used in the estimate of LD. One method uses trios in the analysis. Each trio has a pair of parents and one child. The other uses independent individuals. Both methods use an expectation-maximization algorithm to estimate the haplotype, and use the estimated haplotype in the LD calculation. However, the trio design uses the parents' SNP genotype in the analysis to infer the haplotype of the child. It is possible that this additional information makes the estimated haplotype more accurate. Fallin and Schork [1] examined the accuracy of haplotype estimate using unphased diploid data collected from individuals and concluded that the estimated haplotype frequencies are quite accurate. When summarizing their findings, they pointed out that "any statistical-inference procedures that make use of haplotype frequency estimates demands independent attention." In November 2004, the HapMap Project [2] released whole-genome SNP genotype data for four major populations: 30 trios for Caucasian, 30 trios for African, 45 independent individuals for Chinese, and 45 independent individuals for Japanese. These data are widely used in the genetic research communities to assess the LD for these four populations and for other purposes. For example, Bonnen et al. [3] used these four HapMap populations and a sample of 30 trios from Kosrae to compare the LD patterns of these populations and to assess the feasibility of whole-genome association study using the Kosrae population. This year, HapMap Project released its HapMap 3 SNP data, which includes 90 to 180 persons in each of 11 populations. Some population had both trios and independent individuals, some have only trios, and Chinese and Japanese populations had only independent individuals.
In the near future, HapMap 3 will be one of the major SNP data sources for the genetic research communities to assess the LD pattern in different populations. In this case we have to ask questions concerning the use of different designs (trios or independent individual) and the sample sizes in LD estimation. For example, do the trios and independent individuals offer compatible estimate of LD? How does sample affect the estimate of LD? There are no answers to these questions in the literature. Therefore, we used the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 500 k SNP data to examine the effects of the sampling methods and sample size on the estimate of LD for SNP data.

GAW16 Framingham Heart Study data
The GAW16 Framingham Heart Study 500 k SNP data set has genotype data for more than 6500 individuals from three cohorts of the Framingham Heart Study. Based on the family structure offered by the workshop, we identified 332 trios with SNP genotype data. All of these 332 trios are either from different families, or from the same family but share no common ancestors. Therefore, we can assume that they are independent.

Analysis
Based on these 332 trios, 80 random samples of trios were drawn without replacement for sample sizes of 25, 30, 40, 50, 100, 200, and 300. We also used offspring in these trios to form a pool of 332 independent individuals. Eighty random samples of independent individuals for sample sizes of 25, 30, 40, 50, 100, 200, and 300 were drawn without replacement from this pool. The Haploview program [4] was used to calculate the LD measures D' and r 2 for each random sample for all SNP pairs within a distance of 1000 kb on chromosome 1. We also calculated D' and r 2 for SNP pairs for the 332 trios in the same way. Then percent of D'> 0.8 and r 2 > 0.8 were calculated for different distance bins for each sample. The mean, standard deviation, and range of percent of LD measures greater than 0.8 among the repeated random samples were calculated for each sample size for trios and independent individuals. These results were compared with those obtained from the 332 trios. This gives us a picture how the sample size and sampling design (trio or independent individuals) affect the estimate of the LD.

Discussion
The two major methods to estimate LD are based on sample of trios and independent individuals. HapMap collected both types of data and used them to estimate LD and established LD maps for different populations. The LD maps help us to plan effective SNP association studies, to effectively locate the disease genes, and to estimate some parameters of population genetics. Also, these data are used to compare the LD pattern across populations.
However, our results show that the estimated LD from independent individuals is not as stable as those from trios. Also, the estimated percent of LD > 0.8 based on independent individuals will be different from those based on the trios. Therefore, we conclude that the estimates from these two designs are not completely comparable under such circumstances. When comparing the LD estimate from these two different sampling methods, we must take this difference into consideration.
There may be some fundamental issues about how to estimate the haplotype and how to use the estimated haplotypes to calculate LD, especially when independent individuals are used. There are two approaches to use the estimated haplotype for individuals. Due to the limited information, we can only estimate the probability for each possible haplotype for each individual. One approach is to assign the haplotype with the highest probability to the particular individual. The other approach is keeping the estimated probability of each possible haplotype, and then use them in the subsequent calculation. These two approaches will give different answers in estimating the LD pattern, especially considering that the first approach will result in loss of information. There is no information on how the haplotypes are used in the calculation of LD in Haploview, so we cannot directly address this issue.
Generally speaking, the trio design, which uses the parents' genotype to infer the offspring haplotype, should give more accurate results. One possible cause of the unstable estimates based on independent individuals is the approach used to estimate haplotype in the LD calculation. Further investigation should be conducted to examine the underlying causes.

Conclusion
Our results suggest that a trio design is more suitable than using independent individuals in estimating LD. When independent individuals are used, the estimated percents of D' > 0.8 and r 2 > 0.8 are not stable. The estimates using trios design and those using independent individuals are not fully compatible. Caution should be used when comparing LD patterns between a group of independent individuals and a group of trios.