Power of association tests in the presence of multiple causal variants
© Di et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
We show that the statistical power of a single single-nucleotide polymorphism (SNP) score test for genetic association reflects the cumulative effect of all causal SNPs that are correlated with the test SNP. Statistical significance of a score test can sometimes be explained by the collective effect of weak correlations between the test SNP and multiple causal SNPs. In a finite population, weak but significant correlations between the test SNP and the causal SNPs can arise by chance alone. As a consequence, when a single-SNP score test shows significance, the causal SNPs contributing to the power of the test are not necessarily located near the test SNP, nor do they have to be in linkage disequilibrium with the test SNP. These findings are confirmed with the Genetic Analysis Workshop 17 mini-exome data. The findings of this study highlight the often overlooked importance of long-range and weak linkage disequilibrium in genetic association studies.
In a typical genome-wide association study, single single-nucleotide polymorphism (SNP) association tests, such as score tests [1, 2], are used to scan the genome for possible genotype-phenotype associations. When an association test shows significance, it is commonly expected that the detected association is due to either the direct genetic effect of the test SNP or linkage disequilibrium (LD) with causal SNPs located at nearby genomic positions .
In this study, we show that the causal SNPs contributing to the power of a single-SNP score test are not necessarily located near the test SNP, nor do they have to be in genuine LD with the test SNP. We term this phenomenon the hyper-LD effect. The hyper-LD effect is a consequence of weak correlations between the test SNP and multiple causal SNPs. Score tests performed at rare SNPs are particularly prone to this hyper-LD effect. In this study we highlight the often overlooked importance of weak correlations between distant SNPs in association studies.
We first derive formulas for computing the power of the score tests in the presence of multiple causal SNPs. We then give theoretical explanations of the hyper-LD effect. We focus our discussion on quantitative trait models, but, by using an argument similar to that in Appendix 1 of Chapman et al. , our results can be viewed as a reasonable approximation under logistic regression models for binary traits. For clarity, in this article, the term correlation refers to sample correlation in a finite population, and the term LD refers to the expected value of the sample correlation between alleles at different SNPs.
Quantitative trait model
where Y is the vector of trait values; μ is a constant vector of baseline mean trait values; vectors Z k , k = 1, …, K, represent measured covariates, such as age, sex, and an indicator of whether an individual is a smoker; the coefficients α k represent the effects of the covariates on trait values; the X j , j = 1, …, J, are vectors of genotypes; and the coefficients β j represent the allele effect sizes. To model an additive genetic effect, the genotypes are coded as 0, 1, or 2 according to the number of minor alleles present. Furthermore, we assume that there are no gene-gene or gene-environment interactions and that all individuals in the study are truly unrelated.
Score tests of genetic association
where Z = (1, Z1, …, Z K ) and s YY is the sample variance of the residual trait values (1 is a vector of ones). The score statistic u measures the covariance between the genotype vector X τ and the trait value vector Y after adjusting for measured covariates. If a covariate Z k is correlated with X τ , then the covariance between X τ and Y will decrease after adjusting for Z k . The effect of covariate adjustment is also reflected in the variance estimate v (see Section 6.3.2 of Bickel and Doksum  for more details). To evaluate the statistical evidence for genetic association, u2/v is compared to a chi-square distribution with 1 degree of freedom.
Power of score tests in the presence of multiple causal SNPs
r τj is the correlation coefficient between (the genotype vectors at) the test SNP τ and the causal SNP j, adjusted for the measured covariates. We refer to h j as the direct effect of SNP j; measures the proportion of the residual trait variance explained by the causal SNP j. The term reflects the cumulative effect of all causal SNPs. Equation (6) extends a corresponding equation in Clayton et al.  to the case of multiple causal SNPs. Equation (6) can be extended to tests based on collapsing rare variants at multiple SNPs [6, 7] by letting X τ be the sum or weighted sum of the genotype vectors at the collapsed SNPs.
In the presence of multiple causal SNPs, individually weak correlations (e.g., r τj = 0.1) between the test SNP and the causal SNPs can collectively give rise to significant power in a score test. For example, 10 causal SNPs, each having a direct genetic effect h j = h and a correlation coefficient r τj = 0.1 with the test SNP would result in a noncentrality parameter of (N − 1)h2, which is not 10 but 100 times greater than what it would be if there were only a single causal SNP with the same effect and correlation.
- 3.In a finite population, observed correlations between the test SNP and the causal SNPs can be due to either LD or random fluctuations. Between SNPs in complete LD, r τj = 1. Between common SNPs that are in linkage equilibrium, r τj has an approximate normal distribution with mean 0 and variance 1/N:
This implies that even between SNPs in linkage equilibrium, about 1% of the r τj values will be greater than 2.33/N1/2 (= 0.088 when N = 697) by chance. In a genome-wide association study, 1% corresponds to a large number of SNPs.
When either the causal SNP or the test SNP is a rare SNP (e.g., when less than 30 copies of the minor allele are present in the study population), the correlation coefficient r τj will have a positively skewed distribution and thus will be more prone to random fluctuations.
Between a pair of a common SNP1 and a rare SNP2 with the same allele effect size (β1 = β2), the common SNP will contribute more to the power of the score test performed at the rare SNP than vice versa (r21h1 >r12h2): the correlation coefficient is symmetric (r12 = r21), but the common SNP can explain more trait variation (h1 >h2).
Findings 1 to 3 explain the causes of the hyper-LD effect: the phenomenon where the statistical significance of a score test can be explained by the collective effect of weak correlations between the test SNP and distant causal SNPs. Findings 4 and 5 explain why score tests performed at rare SNPs are particularly prone to the hyper-LD effect.
In this section, we confirm our theoretical findings with the Genetic Analysis Workshop 17 (GAW17) mini-exome data. We performed a power analysis and examined the correlations between SNPs in the GAW17 data. We focused exclusively on the unrelated individuals data set, which consists of 697 individuals from seven populations and their genotypes and phenotypes. At each of the 24,487 SNPs, we used the software GenABEL [8, 9] to test the null hypothesis of Hardy-Weinberg equilibrium in each population separately. We removed 1,730 SNPs that yielded a p-value smaller than 10−4 in any of the populations, leaving 22,757 SNPs for our analyses.
Power of score tests
The power analysis was performed using the quantitative risk factor Q1. The true simulation model  was known to our group. In the power analysis, the factors Age, Sex, Smoke, and Population were considered as covariates (the Z k in Eq. (1)).
We next present a concrete instance of the hyper-LD effect. In this instance, none of the causal SNPs that contribute to the power of the score test are located near the test SNP. The score tests show high power (simulated power = 1) at a cluster of SNPs (C12S704, …, C12S709) on chromosome 12. These SNPs are not causal SNPs in the simulation model. In fact, none of the causal SNPs are located on chromosome 12. Note that the power of the score tests at this cluster of SNPs is still well explained by Eq. (6). For instance, 13 causal SNPs have correlations r τj > 0.1 with SNP C12S706. At SNP C12S706, the cumulative effect of the causal SNPs resulting from correlations is . This effect is greater than the direct effect h j of all but one causal SNP (the one exception being SNP C13S522 with h j = 0.23). The power of the score test reflects this cumulative effect. The fact that single-SNP score tests can show high power at SNPs not located near any causal SNP makes single-SNP tests unreliable as a tool for mapping trait genes.
Correlations between SNP genotypes
To demonstrate that weak but significant correlations between SNPs can arise by chance alone, we simulated another set of 22,757 SNPs that are in linkage equilibrium with each other: The SNP genotypes were simulated to be independent and in Hardy-Weinberg equilibrium, with MAFs matching those of the SNPs in the GAW17 data. Again, for each simulated SNP, we counted the number of SNPs having correlation coefficients greater than 0.1 with it. The results are shown in Figure 2b. Figure 2b confirms that modest correlations between SNPs can arise by chance and that rarer SNPs are more prone to the effects of random fluctuations. The overall correlation level in the GAW17 data set is higher than that in the simulated data set. This is expected because genuine LD does exist in the GAW17 data.
Distributions of the number of causal SNPs significantly correlated (r τ j > 0.1) with each SNP
Number of correlated (r τj > 0.1) causal SNPs
Discussion and conclusions
We have demonstrated a phenomenon that we call the hyper-LD effect in which the statistical significance of a score test can be explained by the collective effect of weak correlations between the test SNP and distant causal SNPs. Tests performed at rare SNPs are particularly prone to this hyper-LD effect. In the presence of multiple causal SNPs, the results of single-SNP score tests can be dominated by the hyper-LD effect and thus can provide misleading information for mapping trait genes if they are misinterpreted.
We emphasize the importance of weak and long-range correlations between SNPs in association studies. These long-range correlations can be due to genuine LD or random fluctuation or both. The magnitude of the random correlations arising by chance will decrease as the population size increases (Eq. (10)), but genuine LD between distant SNPs resulting from processes that reflect population history will persist. We speculate that more causal SNPs will be present in a larger population. If the number of causal SNPs is larger, even weaker correlations will be significant to the power of the association tests. So even in large populations, the hyper-LD effect will still be of concern.
Possible approaches to alleviating the hyper-LD effect include increasing the study population size, effectively increasing the MAFs by collapsing rare variants, using gene-set or pathway analysis, and combining information from family-based linkage or association analysis. The effectiveness of these approaches needs to be investigated in future studies.
We thank the GAW17 organizers. YD thanks Thomas Lumley for helpful discussions.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425-434. 10.1086/338688.PubMed CentralView ArticlePubMedGoogle Scholar
- Clayton D, Chapman J, Cooper J: Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol. 2004, 27: 415-428. 10.1002/gepi.20032.View ArticlePubMedGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.View ArticlePubMedGoogle Scholar
- Chapman JM, Cooper JD, Todd JA, Clayton DG: Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum Hered. 2003, 56: 18-31. 10.1159/000073729.View ArticlePubMedGoogle Scholar
- Bickel PJ, Doksum KA: Mathematical Statistics: Basic Ideas and Selected Topics,. 2001, v. 1, Upper Saddle River, NJ, Prentice Hall, 2ndGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Aulchenko YS, Ripke S, Isaacs A, van Duijn CM: GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007, 23: 1294-1296. 10.1093/bioinformatics/btm108.View ArticlePubMedGoogle Scholar
- Wigginton JE, Cutler DJ, Abecasis GR: A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 2005, 76: 887-893. 10.1086/429864.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.