Skip to main content

Volume 3 Supplement 7

Genetic Analysis Workshop 16

Linkage disequilibrium of single-nucleotide polymorphism data: how sampling methods affect estimates of linkage disequilibrium

Abstract

Linkage disequilibrium (LD) is an important measure used in the analysis of single-nucleotide polymorphism (SNP) data. We used the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 500 k SNP data to explore the effect of sampling methods on estimating of LD for SNP data.

Method and data

We found 332 trios in the GAW16 Framingham SNP data. Repeated random samples without replacement, of different sizes of trios and independent individuals, are drawn from these 332 trios. For each sample, the LD is calculated using the Haploview program for the chromosome 1 SNP data. Percents of D' > 0.8 and r2 > 0.8 are calculated for different distance bins based on the Haploview output. The results are summarized by sample size and sampling methods to give us an overall view of the effect of sample size and sampling methods on the LD estimation.

Results

Trios design gave stable estimates. A sample of 30 to 40 trios gave estimates of percent of LD > 0.8 very close to those from 332 trios. When independent individuals are used, the estimates are less stable and are different from those obtained from the 332 trios for both D' and r2, with larger differences for D'.

Conclusion

Our results suggest that trio design gives a stable estimate of LD. Therefore it may be more suitable for LD analysis than using independent individuals. We must be cautious when comparing the LD estimates from trios, and those from independent individuals.

Background

Linkage disequilibrium (LD) is an important measure in analyzing single-nucleotide polymorphism (SNP) data. It helps us to understand the evolutionary history of humans and other organism. A thorough understanding of LD also helps us to better design and analyze studies of SNP-disease associations.

Two major sampling methods are used in the estimate of LD. One method uses trios in the analysis. Each trio has a pair of parents and one child. The other uses independent individuals. Both methods use an expectation-maximization algorithm to estimate the haplotype, and use the estimated haplotype in the LD calculation. However, the trio design uses the parents' SNP genotype in the analysis to infer the haplotype of the child. It is possible that this additional information makes the estimated haplotype more accurate. Fallin and Schork [1] examined the accuracy of haplotype estimate using unphased diploid data collected from individuals and concluded that the estimated haplotype frequencies are quite accurate. When summarizing their findings, they pointed out that "any statistical-inference procedures that make use of haplotype frequency estimates demands independent attention."

In November 2004, the HapMap Project [2] released whole-genome SNP genotype data for four major populations: 30 trios for Caucasian, 30 trios for African, 45 independent individuals for Chinese, and 45 independent individuals for Japanese. These data are widely used in the genetic research communities to assess the LD for these four populations and for other purposes. For example, Bonnen et al. [3] used these four HapMap populations and a sample of 30 trios from Kosrae to compare the LD patterns of these populations and to assess the feasibility of whole-genome association study using the Kosrae population. This year, HapMap Project released its HapMap 3 SNP data, which includes 90 to 180 persons in each of 11 populations. Some population had both trios and independent individuals, some have only trios, and Chinese and Japanese populations had only independent individuals.

In the near future, HapMap 3 will be one of the major SNP data sources for the genetic research communities to assess the LD pattern in different populations. In this case we have to ask questions concerning the use of different designs (trios or independent individual) and the sample sizes in LD estimation. For example, do the trios and independent individuals offer compatible estimate of LD? How does sample affect the estimate of LD? There are no answers to these questions in the literature. Therefore, we used the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 500 k SNP data to examine the effects of the sampling methods and sample size on the estimate of LD for SNP data.

Methods

GAW16 Framingham Heart Study data

The GAW16 Framingham Heart Study 500 k SNP data set has genotype data for more than 6500 individuals from three cohorts of the Framingham Heart Study. Based on the family structure offered by the workshop, we identified 332 trios with SNP genotype data. All of these 332 trios are either from different families, or from the same family but share no common ancestors. Therefore, we can assume that they are independent.

Analysis

Based on these 332 trios, 80 random samples of trios were drawn without replacement for sample sizes of 25, 30, 40, 50, 100, 200, and 300. We also used offspring in these trios to form a pool of 332 independent individuals. Eighty random samples of independent individuals for sample sizes of 25, 30, 40, 50, 100, 200, and 300 were drawn without replacement from this pool. The Haploview program [4] was used to calculate the LD measures D' and r2 for each random sample for all SNP pairs within a distance of 1000 kb on chromosome 1. We also calculated D' and r2 for SNP pairs for the 332 trios in the same way. Then percent of D'> 0.8 and r2 > 0.8 were calculated for different distance bins for each sample. The mean, standard deviation, and range of percent of LD measures greater than 0.8 among the repeated random samples were calculated for each sample size for trios and independent individuals. These results were compared with those obtained from the 332 trios. This gives us a picture how the sample size and sampling design (trio or independent individuals) affect the estimate of the LD.

Results

The results are presented in four tables. Table 1 presents the percent of D' > 0.8 for different distance bins for the trio design. Table 2 presents the percent of D' > 0.8 for different distance bins for independent individuals design. Tables 3 and 4 present the percent of r2 > 0.8 for different distance bins for trios design and independent individual design, respectively.

Table 1 Percent of r2 > 0.8 of 80 random samples of trios by sample size, compared to percent of r2 > 0.8 for the 332 trios
Table 2 Percent of r2 > 0.8 of the 80 random samples of independent individuals by sample size, compared to percent of r2 > 0.8 for the 332 trios
Table 3 Percent of D' > 0.8 of 80 random samples of trios by sample size, compared to percent of D' > 0.8 for the 332 trios
Table 4 Percent of D' > 0.8 of 80 random samples of independent individuals by sample size, compared to percent of D' > 0.8 for the 332 trios

The length of chromosome 1 is about 247 Mb and there are 39,936 SNPs in this dataset on this chromosome. The numbers of SNP pairs in each distance bin, which would be used to calculate LD, are very large. For every distance bin of 5 kb, there are more than 30,000 SNP pairs used in the calculation for the percent of LD > 0.8, which ensures that the results from our analysis are not due to the effects of small sample sizes.

Our results show that trios gave quite stable results. In this setting, a sample size of 30 or 40 would give estimates of percent for both D' > 0.8 and r2 > 0.8 for different distance bins very close to these obtained from 332 trios. However, when independent individuals were used, the estimates were not so stable. When sample size increases, the difference between the estimates from the independent individuals and these from the 332 trios increases. Figure 1 demonstrates this fact. This figure shows the percent of D' > 0.8 estimated in data sets with 40, 100, or 200 independent individuals, and 332 trios. As for estimation of percent of r2 > 0.8, a similar phenomenon exists, but on a smaller scale.

Figure 1
figure 1

Mean of estimated percent of D' > 0.8 for different distance bins from 80 samples of independent individuals and estimated percent of D' > 0.8 from the 332 trios.

Discussion

The two major methods to estimate LD are based on sample of trios and independent individuals. HapMap collected both types of data and used them to estimate LD and established LD maps for different populations. The LD maps help us to plan effective SNP association studies, to effectively locate the disease genes, and to estimate some parameters of population genetics. Also, these data are used to compare the LD pattern across populations.

However, our results show that the estimated LD from independent individuals is not as stable as those from trios. Also, the estimated percent of LD > 0.8 based on independent individuals will be different from those based on the trios. Therefore, we conclude that the estimates from these two designs are not completely comparable under such circumstances. When comparing the LD estimate from these two different sampling methods, we must take this difference into consideration.

There may be some fundamental issues about how to estimate the haplotype and how to use the estimated haplotypes to calculate LD, especially when independent individuals are used. There are two approaches to use the estimated haplotype for individuals. Due to the limited information, we can only estimate the probability for each possible haplotype for each individual. One approach is to assign the haplotype with the highest probability to the particular individual. The other approach is keeping the estimated probability of each possible haplotype, and then use them in the subsequent calculation. These two approaches will give different answers in estimating the LD pattern, especially considering that the first approach will result in loss of information. There is no information on how the haplotypes are used in the calculation of LD in Haploview, so we cannot directly address this issue.

Generally speaking, the trio design, which uses the parents' genotype to infer the offspring haplotype, should give more accurate results. One possible cause of the unstable estimates based on independent individuals is the approach used to estimate haplotype in the LD calculation. Further investigation should be conducted to examine the underlying causes.

Conclusion

Our results suggest that a trio design is more suitable than using independent individuals in estimating LD. When independent individuals are used, the estimated percents of D' > 0.8 and r2 > 0.8 are not stable. The estimates using trios design and those using independent individuals are not fully compatible. Caution should be used when comparing LD patterns between a group of independent individuals and a group of trios.

Abbreviations

GAW16:

Genetic Analysis Workshop 16

LD:

Linkage disequilibrium

SNP:

Single-nucleotide polymorphism.

References

  1. Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet. 2000, 67: 947-959. 10.1086/303069.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Thorisson GA, Smith AV, Krishnan L, Stein LD: The International HapMap Project web site. Genome Res. 2005, 15: 1591-1593. 10.1101/gr.4413105.

    Article  Google Scholar 

  3. Bonnen PE, Pe'er I, Plenge RM, Salit J, Lowe JK, Shapero MH, Lifton RP, Breslow JL, Daly MJ, Reich DE, Jones KW, Stoffel M, Altshuler D, Friedman JM: Evaluating potential for whole-genome studies in Kosrae, an isolated population in Micronesia. Nat Genet. 2006, 38: 214-217. 10.1038/ng1712.

    Article  CAS  PubMed  Google Scholar 

  4. Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. QH and BJW are supported by NIH grant 5R01AG027060-04 from National Institute of Aging.

This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qimei He.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

QH and BJW played an equal role in designing the study and writing the manuscript. QH conducted the analysis.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

He, Q., Willcox, B.J. Linkage disequilibrium of single-nucleotide polymorphism data: how sampling methods affect estimates of linkage disequilibrium. BMC Proc 3 (Suppl 7), S105 (2009). https://doi.org/10.1186/1753-6561-3-S7-S105

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1753-6561-3-S7-S105

Keywords