Methods for detecting associations between phenotype and aggregations of rare variants
- Fan Yang†^{1},
- Chul Joo Kang†^{1} and
- Paul Marjoram^{1}Email author
https://doi.org/10.1186/1753-6561-5-S9-S51
© Yang et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
Although genome-wide association studies have uncovered variants associated with more than 150 traits, the percentage of phenotypic variation explained by these associations remains small. This has led to the search for the dark matter that explains this missing genetic component of heritability. One potential explanation for dark matter is rare variants, and several statistics have been devised to detect associations resulting from aggregations of rare variants in relatively short regions of interest, such as candidate genes. In this paper we investigate the feasibility of extending this approach in an agnostic way, in which we consider all variants within a much broader region of interest, such as an entire chromosome or even the entire exome. Our method searches for subsets of variant sites using either Markov chain Monte Carlo or genetic algorithms. The analysis was performed with knowledge of the Genetic Analysis Workshop 17 answers.
Background
The modern genomics era holds forth promise of great advances in terms of our understanding of the genetic causes of disease. The ability to interrogate orders of magnitude more data than what was previously available has indeed led to a rapid increase in the rate at which we uncover variants, such as single-nucleotide polymorphisms (SNPs), that are associated with phenotypes. For example, a meta-analysis of height has led to the discovery of about 50 associated variants (see [1] and refs therein). However, the total variance explained by those variants is 5%, whereas from empirical observation we know that heritability of height is about 80%. This somewhat depressing scenario has been repeated in many disease studies (in which the sample sizes are typically much smaller than was the case for height) and has led to the adoption of the term dark matter to describe this phenomenon [2].
Dark matter is the purported explanation for the missing genetic heritability of phenotype. One potential form of dark matter is rare variants [3]. Modern so-called SNP-chip platforms (e.g., Illumina or Affymetrix technologies) were designed, with good reason, to use only common SNPs. This design is likely to give the greatest cost efficiency in terms of gathering as much data as possible given the physical and/or financial restraints in play. However, it also means that if disease phenotypes are in large part due to the presence of collections of rare variants, the so-called multiple rare variants (MRV) hypothesis, then such platforms are likely to have low utility for detecting these associations. Of course, it is also possible that dark matter is in fact just lack of power (e.g., as a result of our poor ability to detect interactions), but in this paper we focus on the first possible explanation by analyzing the Genetic Analysis Workshop 17 (GAW17) data [4] and searching for MRVs.
We focus on methods in which scores are calculated for aggregations of SNPs. Several score statistics have been proposed (see Methods section), but all have been applied in contexts in which a relatively small region of interest (say, a gene) is under consideration. In this application we seek to extend the application of such statistics to larger contexts. We do this by applying optimization algorithms to detect appropriate subsets of the set of all possible SNPs. We then apply these methods to the GAW17 data, for the quantitative trait data, assessing significance of results by means of permutation tests.
Methods
Score statistics
We focus on three score statistics. First, we consider the statistic σ_{LL}, motivated by Li and Leal [5], in which the score of a set of SNPs S = {S_{1}, …, S_{ n }} for individual I is defined to be the total count of mutant alleles summed across those SNPs for I. Second, we consider the statistic σ_{MB}, which follows Madsen and Browning [6] in that the contribution to the score from each SNP is weighted by the inverse of the frequency of its minor allele in the sample. Third, we consider a novel statistic, σ_{ME}, which weights the count of mutant alleles for a SNP by the marginal effect size of that SNP, estimated either by log odds ratio for a binary trait or by the coefficient in the univariate linear regression for a quantitative trait.
Optimization algorithms
Traditionally, score statistics are applied to predetermined small regions of interest, and all SNPs contained in that region are included in the evaluation of the score statistic. It would be useful to be able to extend the region of search so that the analysis might be more agnostic regarding what is being searched for. For example, we might consider a region defined as all genes in a given pathway. To explore the computational limits to this approach, we use a chromosome as a unit corresponding to a “large region.” Unfortunately, the performance of score statistic methods, such as those mentioned, is known to deteriorate as the fraction of nonassociated statistics in the region increases. For that reason, to extend these methods to wider regions, we use an optimization method to choose a subset of statistics for which the score is calculated. For the purposes of the present study and after some experimentation, we choose to use subsets of fixed size: 20 SNPs. We compare results for two optimization methods: the Markov chain Monte Carlo (MCMC) method and genetic algorithms (GAs).
In our MCMC implementation we run five chains in parallel (Metropolis coupled MCMC). The MCMC process depends crucially on its proposal kernel, the rule by which SNPs are added or removed from the set of SNPs currently under consideration. Here, we use a kernel that samples SNPs with a probability proportional to their association with the phenotype of interest. We use the square of the Pearson correlation coefficient as the measure of association between score statistic and phenotype. For mathematical convenience, we assume that the residuals of the three quantitative trait loci (Q1, Q2, and Q4) are normally distributed after model fitting and that there is a linear relationship between the value of the score statistic and the mean trait value. This allows us to approximate the likelihood term contained within the Hastings ratio as a function of the correlation coefficient between score and trait; otherwise, the likelihood term would be intractable.
Each run of the MCMC process results in a posterior distribution for the inclusion of each SNP in the subset of SNPs that is used to calculate the score. Therefore these SNPs may be associated with the phenotype of interest. To make this more directly comparable with GAs, which search for a single, optimal subset, we also record the single best solution discovered in each run. This also allows for simple compilation of results across replicates when we report summaries of our results.
For GAs, populations of “chromosomes” evolve in a context in which their fitness is measured by their ability to predict the phenotype of interest. Here, the chromosome simply defines a subset of SNPs, and fitness is measured as a function of the Pearson correlation coefficient between the score and the quantitative trait for those SNPs. The chromosomes evolve subject to mutation and recombination; the identity of the SNPs contained in each chromosome (and hence used when the score is calculated) changes. Probability of reproduction is proportional to fitness, thus encouraging the more successful chromosomes to have more “offspring.” A full description of GAs is not possible given the space constraints here, but see Mitchell [7] for a more comprehensive introduction.
Assessment of significance
In all cases we assess significance of overall fit by creating 1,000 data sets in which the phenotype is randomly permuted to determine the null distribution of the best-performing score statistic at the end of the analysis. To assess whether we are finding the functional SNPs or regions, we also report the frequency with which SNPs were included in the optimal solution, accumulated across all 200 phenotype replicates.
Results
Inflated type I error rates
MCMC analysis: chromosome-wide p-values
Q1 | Q2 | Q4 | ||||
---|---|---|---|---|---|---|
Chromosome | Unadjusted | Adjusted | Unadjusted | Adjusted | Unadjusted | Adjusted |
1 | 0.020 | 0.148 | 0.688 | 0.594 | 0.061 | 0.213 |
2 | 0.001 | 0.285 | 0.217 | 0.348 | 0.542 | 0.403 |
3 | 0.026 | 0.240 | 0.117 | 0.811 | 0.810 | 0.593 |
4 | 0.002 | 0.414 | 0.191 | 0.183 | 0.202 | 0.641 |
5 | 0.008 | 0.841 | 0.050 | 0.658 | 0.544 | 0.038 |
6 | 0.002 | 0.014 | 0.187 | 0.050 | 0.002 | 0.022 |
7 | 0.001 | 0.017 | 0.003 | 0.459 | 0.507 | 0.333 |
8 | 0.044 | 0.164 | 0.068 | 0.939 | 0.351 | 0.875 |
9 | 0.022 | 0.232 | 0.174 | 0.869 | 0.001 | 0.124 |
10 | 0.002 | 0.844 | 0.183 | 0.390 | 0.586 | 0.532 |
11 | 0.002 | 0.223 | 0.601 | 0.874 | 0.266 | 0.357 |
12 | 0.010 | 0.267 | 0.310 | 0.692 | 0.551 | 0.235 |
13 | 0.001 | 0.001 | 0.097 | 0.047 | 0.266 | 0.707 |
14 | 0.001 | 0.006 | 0.64 | 0.832 | 0.551 | 0.808 |
15 | 0.140 | 0.563 | 0.923 | 0.663 | 0.006 | 0.571 |
16 | 0.016 | 0.229 | 0.198 | 0.190 | 0.269 | 0.245 |
17 | 0.008 | 0.464 | 0.499 | 0.694 | 0.355 | 0.680 |
18 | 0.004 | 0.037 | 0.945 | 0.958 | 0.033 | 0.046 |
19 | 0.001 | 0.034 | 0.519 | 0.368 | 0.514 | 0.570 |
20 | 0.904 | 0.003 | 0.053 | 0.586 | 0.034 | 0.224 |
21 | 0.597 | 0.611 | 0.076 | 0.552 | 0.015 | 0.064 |
22 | 0.002 | 0.633 | 0.170 | 0.831 | 0.008 | 0.628 |
GA analysis: chromosome-wide p-values
Q1 | Q2 | Q4 | ||||
---|---|---|---|---|---|---|
Chromosome | Unadjusted | Adjusted | Unadjusted | Adjusted | Unadjusted | Adjusted |
1 | 0.018 | 0.506 | 0.385 | 0.632 | 0.149 | 0.015 |
2 | 0.001 | 0.001 | 0.311 | 0.401 | 0.137 | 0.769 |
3 | 0.017 | 0.912 | 0.235 | 0.958 | 0.364 | 0.371 |
4 | 0.006 | 0.021 | 0.054 | 0.726 | 0.136 | 0.089 |
5 | 0.002 | 0.128 | 0.041 | 0.947 | 0.082 | 0.067 |
6 | 0.001 | 0.003 | 0.169 | 0.209 | 0.019 | 0.013 |
7 | 0.002 | 0.070 | 0.067 | 0.665 | 0.017 | 0.422 |
8 | 0.258 | 0.140 | 0.646 | 0.473 | 0.078 | 0.220 |
9 | 0.017 | 0.124 | 0.124 | 0.103 | 0.065 | 0.061 |
10 | 0.196 | 0.003 | 0.174 | 0.434 | 0.042 | 0.362 |
11 | 0.006 | 0.032 | 0.609 | 0.869 | 0.190 | 0.702 |
12 | 0.017 | 0.073 | 0.393 | 0.730 | 0.167 | 0.178 |
13 | 0.001 | 0.002 | 0.482 | 0.568 | 0.427 | 0.060 |
14 | 0.002 | 0.001 | 0.958 | 0.404 | 0.295 | 0.160 |
15 | 0.541 | 0.266 | 0.422 | 0.269 | 0.254 | 0.579 |
16 | 0.156 | 0.401 | 0.280 | 0.364 | 0.235 | 0.643 |
17 | 0.588 | 0.403 | 0.619 | 0.966 | 0.509 | 0.598 |
18 | 0.012 | 0.056 | 0.417 | 0.840 | 0.464 | 0.670 |
19 | 0.067 | 0.014 | 0.262 | 0.554 | 0.041 | 0.479 |
20 | 0.152 | 0.006 | 0.048 | 0.204 | 0.054 | 0.136 |
21 | 0.663 | 0.073 | 0.006 | 0.250 | 0.088 | 0.065 |
22 | 0.004 | 0.062 | 0.127 | 0.108 | 0.072 | 0.095 |
What SNPs are detected?
MCMC analysis: chi-square test for detection of causal SNPs
Q1 | Q2 | ||
---|---|---|---|
Chromosome | p-value | Chromosome | p-value |
1 | 1.1 × 10^{−2} | 2 | 0 |
4 | 1.5 × 10^{−8} | 3 | 3.5 × 10^{−3} |
5 | 1.7 × 10^{−1} | 6 | 0 |
6 | 0 | 7 | 4.2 × 10^{−2} |
13 | 0 | 8 | 7.7 × 10^{−7} |
14 | 2.0 × 10^{−1} | 9 | 4.6 × 10^{−1} |
19 | 8.5 × 10^{−1} | 10 | 1.2 × 10^{−2} |
12 | 2.4 × 10^{−1} | ||
17 | 4.2 × 10^{−1} |
GA analysis: chi-square test for detection of causal SNPs
Q1 | Q2 | ||
---|---|---|---|
Chromosome | p-value | Chromosome | p-value |
1 | 6.2 × 10^{−2} | 2 | 3.4 × 10^{−1} |
4 | 0 | 3 | 0 |
5 | 0 | 6 | 0 |
6 | 9.3 × 10^{−2} | 7 | 2.8 × 10^{−3} |
13 | 0 | 8 | 0 |
14 | 0^{a} | 9 | 3.1 × 10^{−1} |
19 | 2.5 × 10^{−4} | 10 | 0 |
12 | 5.2 × 10^{−2} | ||
17 | 1.3 × 10^{−1} |
Illustrative results
Discussion and conclusions
Recently, many methods have been developed that exploit score statistics to detect associations between aggregations of (rare) variants and phenotype in an attempt to uncover at least some of the causes of the phenomenon of dark matter. However, such methods have thus far been applied in contexts in which only a relatively narrow region of interest is being investigated and, typically, all SNPs contribute to the score (subject to inclusion criteria such as thresholds based on minor allele frequency, for example). It would be of great use to have similar approaches that are more agnostic in nature and that explore entire chromosomes, or ideally genomes, to detect subsets of SNPs for which a significant association between phenotype and score is obtained when a score is calculated in a manner analogous to the existing methods for narrow regions.
The problem, of course, is the sheer magnitude of the number of possible subsets one might consider. It is impossible to explore the space of all possible subsets exhaustively. For that reason, some sort of guidance is needed. Here, we use two search algorithms: GAs, which are designed to find optimal solutions in a complex state space; and MCMC, which is designed to calculate posterior distributions defined over complex state spaces.
The GAW17 data set proved troublesome for all groups to analyze, in large part because of the extremely high type I errors. To some degree this remained an issue even after adjustment for population structure, as is the case in our own analysis. As such, power is low across the board, and this means that it is hard to draw meaningful conclusions regarding the relative merits of alternative methods. This is the case with our own analysis. Optimization with the MCMC method appears to perform better than that with GAs in general, but it is not clear whether this might be due to the particular way in which the GA was implemented (perhaps longer runs or use of different evolutionary parameters might improve performance) or whether it is a consequence of particular features of the data.
Our experience in the current setting is that attempting to search across the entire genome or the portion of the exome included in the GAW17 data has proved unsuccessful. Even with the reduced size of the GAW17 data compared to what would be obtained in a full exome- or genome-wide next-generation sequencing study and given the consequent lower dimension of the state space over which the selection of SNPs needs to be optimized, in general we could not find meaningful optima [results not shown]. For that reason we focus in this paper on results for chromosome-wide analysis.
When we use the chromosome as the unit of analysis, the results are encouraging in the sense that we tend to obtain more significant results on chromosomes on which associated SNPs are found. However, once a chromosome has been identified as containing an association with phenotype, the key question becomes which (set of) SNPs are driving the association? As we have shown in the illustrative results in this paper, although sometimes clear signals are found (e.g., for Q1 on chromosome 13), in other cases there is no clear signal to show which SNPs are associated (e.g., for Q2 on chromosome 17). This problem was likely worsened by the problems with type I error, which drastically curtailed overall power. However, in Tables 3 and 4 we show a significant tendency to pick out the truly associated SNPs in cases in which the chromosome-wide p-value is small and sometimes even when significance was not obtained at the chromosomal level.
As the size of region being considered grows larger, it becomes imperative to find some other way of guiding the search for optimal subsets of SNPs. One potential approach is to include external information regarding which SNPs are most likely to be informative. For example, one might include functional information from external databases. This should improve the performance of the search algorithms and thereby allow for more successful optimization over larger regions.
Another issue with methods that search across a large state space of potentially informative variation is the issue of overfitting. In this paper we guard against this in two ways: by using fixed-size subsets of SNPs and by assessing significance by means of a permutation test. A further response to this issue would be to use shrinkage methods, such as penalized regression, to reduce the number of variables, in our case SNPs, that might be included in the optimal subset. We intend to explore this approach in future work.
Finally, we see some evidence that the novel score statistic we propose in this paper may prove to be more powerful than existing statistics (see Figure 8). Rather than concluding anything regarding the relative merits of the statistics considered here, which is impossible given the focus on the GAW17 data in particular, we note that optimal construction of a score statistic is still an open question and is worthy of further investigation.
In summary, although our results are somewhat preliminary, in that the conclusions are based on a single data set, at least in terms of the genotype data, we believe that they do show the potential of using optimization approaches to extend existing score-based methods over wider regions than those currently considered. Selection of a subset of SNPs within a region should lessen the existing tendency for the performance of such methods to deteriorate as the number of nonassociated SNPs in the region of interest increases. In traditional analyses this occurs because such SNPs are always included in the calculation of the score statistic, whereas by choosing a subset of SNPs, such statistics will be excluded, if the method is working well, leading to an expected increase in power.
Notes
Declarations
Acknowledgments
We gratefully acknowledge funding from National Institutes of Health (NIH) grant MH084678, helpful comments from two reviewers, and helpful discussions with Duncan Thomas, Gary Chen, and Alex Stram. Genetic Analysis Workshop 17 is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Visscher P: Sizing up human height variation. Nat Genet. 2008, 40: 489-490. 10.1038/ng0508-489.View ArticlePubMedGoogle Scholar
- Manolio T, Collins F, Cox N, Goldstein D, Hindroff L, Hunter D, McCarthy M, Ramos E, Cardon L, Chakravarti A, et al: Finding the missing heritability of complex disease. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- International Schizophrenia Consortium: Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009, 460: 748-752.PubMed CentralGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Leal S: Method for detecting associations with rare variants for common disease. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning S: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Mitchell M: An Introduction to Genetic Algorithms. 1996, Cambridge, MA, MIT PressGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.