Methods for detecting associations between phenotype and aggregations of rare variants

Although genome-wide association studies have uncovered variants associated with more than 150 traits, the percentage of phenotypic variation explained by these associations remains small. This has led to the search for the dark matter that explains this missing genetic component of heritability. One potential explanation for dark matter is rare variants, and several statistics have been devised to detect associations resulting from aggregations of rare variants in relatively short regions of interest, such as candidate genes. In this paper we investigate the feasibility of extending this approach in an agnostic way, in which we consider all variants within a much broader region of interest, such as an entire chromosome or even the entire exome. Our method searches for subsets of variant sites using either Markov chain Monte Carlo or genetic algorithms. The analysis was performed with knowledge of the Genetic Analysis Workshop 17 answers.


Background
The modern genomics era holds forth promise of great advances in terms of our understanding of the genetic causes of disease. The ability to interrogate orders of magnitude more data than what was previously available has indeed led to a rapid increase in the rate at which we uncover variants, such as single-nucleotide polymorphisms (SNPs), that are associated with phenotypes. For example, a meta-analysis of height has led to the discovery of about 50 associated variants (see [1] and refs therein). However, the total variance explained by those variants is 5%, whereas from empirical observation we know that heritability of height is about 80%. This somewhat depressing scenario has been repeated in many disease studies (in which the sample sizes are typically much smaller than was the case for height) and has led to the adoption of the term dark matter to describe this phenomenon [2].
Dark matter is the purported explanation for the missing genetic heritability of phenotype. One potential form of dark matter is rare variants [3]. Modern so-called SNP-chip platforms (e.g., Illumina or Affymetrix technologies) were designed, with good reason, to use only common SNPs. This design is likely to give the greatest cost efficiency in terms of gathering as much data as possible given the physical and/or financial restraints in play. However, it also means that if disease phenotypes are in large part due to the presence of collections of rare variants, the so-called multiple rare variants (MRV) hypothesis, then such platforms are likely to have low utility for detecting these associations. Of course, it is also possible that dark matter is in fact just lack of power (e.g., as a result of our poor ability to detect interactions), but in this paper we focus on the first possible explanation by analyzing the Genetic Analysis Workshop 17 (GAW17) data [4] and searching for MRVs.
We focus on methods in which scores are calculated for aggregations of SNPs. Several score statistics have been proposed (see Methods section), but all have been applied in contexts in which a relatively small region of interest (say, a gene) is under consideration. In this application we seek to extend the application of such statistics to larger contexts. We do this by applying optimization algorithms to detect appropriate subsets of the set of all possible SNPs. We then apply these methods to the GAW17 data, for the quantitative trait data, assessing significance of results by means of permutation tests.

Score statistics
We focus on three score statistics. First, we consider the statistic s LL , motivated by Li and Leal [5], in which the score of a set of SNPs S = {S 1 , …, S n } for individual I is defined to be the total count of mutant alleles summed across those SNPs for I. Second, we consider the statistic s MB , which follows Madsen and Browning [6] in that the contribution to the score from each SNP is weighted by the inverse of the frequency of its minor allele in the sample. Third, we consider a novel statistic, s ME , which weights the count of mutant alleles for a SNP by the marginal effect size of that SNP, estimated either by log odds ratio for a binary trait or by the coefficient in the univariate linear regression for a quantitative trait.

Optimization algorithms
Traditionally, score statistics are applied to predetermined small regions of interest, and all SNPs contained in that region are included in the evaluation of the score statistic. It would be useful to be able to extend the region of search so that the analysis might be more agnostic regarding what is being searched for. For example, we might consider a region defined as all genes in a given pathway. To explore the computational limits to this approach, we use a chromosome as a unit corresponding to a "large region." Unfortunately, the performance of score statistic methods, such as those mentioned, is known to deteriorate as the fraction of nonassociated statistics in the region increases. For that reason, to extend these methods to wider regions, we use an optimization method to choose a subset of statistics for which the score is calculated. For the purposes of the present study and after some experimentation, we choose to use subsets of fixed size: 20 SNPs. We compare results for two optimization methods: the Markov chain Monte Carlo (MCMC) method and genetic algorithms (GAs).
In our MCMC implementation we run five chains in parallel (Metropolis coupled MCMC). The MCMC process depends crucially on its proposal kernel, the rule by which SNPs are added or removed from the set of SNPs currently under consideration. Here, we use a kernel that samples SNPs with a probability proportional to their association with the phenotype of interest. We use the square of the Pearson correlation coefficient as the measure of association between score statistic and phenotype. For mathematical convenience, we assume that the residuals of the three quantitative trait loci (Q1, Q2, and Q4) are normally distributed after model fitting and that there is a linear relationship between the value of the score statistic and the mean trait value. This allows us to approximate the likelihood term contained within the Hastings ratio as a function of the correlation coefficient between score and trait; otherwise, the likelihood term would be intractable.
Each run of the MCMC process results in a posterior distribution for the inclusion of each SNP in the subset of SNPs that is used to calculate the score. Therefore these SNPs may be associated with the phenotype of interest. To make this more directly comparable with GAs, which search for a single, optimal subset, we also record the single best solution discovered in each run. This also allows for simple compilation of results across replicates when we report summaries of our results.
For GAs, populations of "chromosomes" evolve in a context in which their fitness is measured by their ability to predict the phenotype of interest. Here, the chromosome simply defines a subset of SNPs, and fitness is measured as a function of the Pearson correlation coefficient between the score and the quantitative trait for those SNPs. The chromosomes evolve subject to mutation and recombination; the identity of the SNPs contained in each chromosome (and hence used when the score is calculated) changes. Probability of reproduction is proportional to fitness, thus encouraging the more successful chromosomes to have more "offspring." A full description of GAs is not possible given the space constraints here, but see Mitchell [7] for a more comprehensive introduction.

Assessment of significance
In all cases we assess significance of overall fit by creating 1,000 data sets in which the phenotype is randomly permuted to determine the null distribution of the bestperforming score statistic at the end of the analysis. To assess whether we are finding the functional SNPs or regions, we also report the frequency with which SNPs were included in the optimal solution, accumulated across all 200 phenotype replicates.

Inflated type I error rates
We begin by running our analyses on entire chromosomes, one chromosome at a time. Empirical p-values for overall model fit are obtained as described in the "Assessment of Significance" subsection and are shown in Table 1 for the MCMC analysis (see the column for unadjusted results), and in Table 2 for the GA analysis (again, see the unadjusted column in the table). For convenience, results for chromosomes containing SNPs that are directly associated with the corresponding trait (per the GAW17 answers) are shown in boldfaced text. The results are similar for all three statistics we considered, so here and throughout the manuscript where results for just one statistic are presented, we show results for the Madsen-Browning statistic (s MB ) only.
The most striking feature of the results is a grossly inflated type I error rate. To further demonstrate this issue and to demonstrate the degree to which it is reproducible across replicates, we present quantile-quantile (Q-Q) plots for replicates 1 and 2 for results from the MCMC analysis in Figures 1 and 2. (Results from the GA analysis were similar and so are not shown.) We see a clear inflation of type I error across both replicates, most strikingly for Q1 but also for Q2 and, in replicate 1 only, Q4, which has no direct association with any SNP in this data set. The difference between the results for replicates 1 and 2 for Q4 is striking because this is essentially stochastic noise (the replicates are independent, identically distributed realizations of the same disease model simulation with exactly the same SNP data).
These results clearly indicate the need for correction for inflated type I error, which is likely partly the result of population stratification. For this reason, we adjust for population structure using principal components (PCs). After exploring a range of the possible number of PCs to include in the analysis, we determined that 10 PCs perform well in terms of reducing the inflation of type I error, and so all further results in this paper show results obtained after regressing out the first 10 PCs. Overall p-values for model fit across each chromosome are shown in Table 1 (see the adjusted column), and we show Q-Q plots for the PC-adjusted MCMC results in Figure 3. Although the Q-Q plots for replicate 1 now look more satisfactory, with the null trait, Q4, looking suitably uniform (although Q1 is still highly inflated), we note that these results vary a good deal across replicates. For example, Q2 shows some inflation of −log p-values after PC correction in replicate 2 (see Figure 2), and because Q2 should show association with some of the SNPs in the data, this may well prove to be a better outcome than what we see in replicate 1, which most likely indicates a loss of power. However, because we wish to avoid mining the data-choosing a different number of PCs for each replicate (this is computationally intractable in the time available anyway)-for the results shown we analyze each of the 200 replicates after adjusting for population stratification using 10 PCs.
What SNPs are detected?
Tables 1 and 2 report p-values for the overall fit of the model on each chromosome, but they do not indicate whether we are actually finding the causal SNPs. To assess this, in Tables 3 and 4 we consider the results of an independent analysis of each of the 200 replicates of the GAW17 data and report the p-values resulting from a chi-square test for nonrandom association between those SNPs that were included in the best aggregation of SNPs found by our MCMC (Table 3) and GA (Table 4) analysis (20 SNPs for each replicate) and those that were actually causal (assessed by referencing the GAW17 answers). Analysis was performed after correction for population stratification using 10 PCs. Of course, we cannot expect to necessarily detect a causal SNP, because any other SNP in perfect linkage disequilibrium will serve just as well in our score statistic and other SNPs in high linkage disequilibrium might also be detected instead. Despite these practical realities, we see that,  although results do vary from chromosome to chromosome, we do frequently (but not always) observe a significant tendency to preferentially include causal SNPs for analyses in which significance was observed at the chromosomal level (cf. p-values in Tables 1 and 2).

Illustrative results
Having investigated the issue of type I error rates and having made some attempt to mitigate its effects, we now show illustrative results for particular chromosomes of interest. Space prohibits including a large number of examples, so we instead provide results for analysis of Q1 on three chromosomes: 1, 13, and 21. Chromosome 13 contains a region with large effect, so we might hope to see a strong signal. Chromosome 1 contains two SNPs affecting Q1, and chromosome 21 contains no associated SNPs. Results for the MCMC analysis are shown in Figure 4 and those for the GA analysis are shown in Figure 5. We see qualitatively different behavior, with a clear peak being seen on chromosome 13 but not on chromosome 1 or 21. Furthermore, the signal appears to be stronger in the MCMC analysis than in the GA analysis. This trend is observed across most analyses. However, it is important to note that in the analysis in which we correct for population structure, chromosome 1 does not attain a significant p-value after the permutation test, whereas without PC correction for structure, significance is obtained in most cases. This illustrates a peril of correcting for population structure: If phenotype is also correlated with structure, then loss of power can result.
As a further example we show the results of the analysis of Q2 on chromosomes 6, 17, and 21. Again these chromosomes are chosen because they should have strong, medium, and no signal, respectively. The results are shown in Figures 6 and 7. Again, power appears to be lost because of the correction for stratification, although the MCMC analysis does appear to find a clear signal for chromosome 6. Interestingly the GA analysis results for chromosome 6 before PC correction also show a clear peak (results not shown).  Finally, we present an illustrative but representative example of the differences observed between the performances of the three statistics. This is shown in Figure 8, where we see results of the GA analysis of Q2 on chromosome 6. We comment further on this in the Discussion and conclusions section.

Discussion and conclusions
Recently, many methods have been developed that exploit score statistics to detect associations between aggregations of (rare) variants and phenotype in an attempt to uncover at least some of the causes of the phenomenon of dark matter. However, such methods have thus far been applied in contexts in which only a relatively narrow region of interest is being investigated and, typically, all SNPs contribute to the score (subject to inclusion criteria such as thresholds based on minor allele frequency, for example). It would be of great use to have similar approaches that are more agnostic in nature and that explore entire chromosomes, or ideally genomes, to detect subsets of SNPs for which a significant association between phenotype and score is obtained when a score is calculated in a manner analogous to the existing methods for narrow regions.
The problem, of course, is the sheer magnitude of the number of possible subsets one might consider. It is impossible to explore the space of all possible subsets exhaustively. For that reason, some sort of guidance is needed. Here, we use two search algorithms: GAs, which are designed to find optimal solutions in a complex state space; and MCMC, which is designed to calculate posterior distributions defined over complex state spaces.
The GAW17 data set proved troublesome for all groups to analyze, in large part because of the extremely high type I errors. To some degree this remained an issue even after adjustment for population structure, as is the case in our own analysis. As such, power is low across the board, and this means that it is hard to draw meaningful conclusions regarding the relative merits of alternative methods. This is the case with our own analysis. Optimization with the MCMC method appears to perform better than that with GAs in general, but it is not clear whether this might be due to the particular way in which the GA was implemented (perhaps longer runs or use of different evolutionary parameters might improve performance) or whether it is a consequence of particular features of the data.
Our experience in the current setting is that attempting to search across the entire genome or the portion of the exome included in the GAW17 data has proved unsuccessful. Even with the reduced size of the GAW17 data compared to what would be obtained in a full exome-or genome-wide next-generation sequencing study and given the consequent lower dimension of the state space over which the selection of SNPs needs to be optimized, in general we could not find meaningful optima [results not shown]. For that reason we focus in this paper on results for chromosome-wide analysis.
When we use the chromosome as the unit of analysis, the results are encouraging in the sense that we tend to obtain more significant results on chromosomes on which associated SNPs are found. However, once a chromosome has been identified as containing an association with phenotype, the key question becomes which (set of) SNPs are driving the association? As we have shown in the illustrative results in this paper, although sometimes clear signals are found (e.g., for Q1 on chromosome 13), in other cases there is no clear signal to show which SNPs are associated (e.g., for Q2 on chromosome 17). We report p-values resulting from a chi-square test for nonrandom association, across the 200 replicates of the GAW17 data, between those SNPs that were included in the best aggregation of SNPs found by our MCMC analysis and those that were actually causal (assessed by referencing the GAW17 answers). Analysis was performed after correction for population stratification using 10 PCs. We report p-values resulting from a chi-square test for nonrandom association, across the 200 replicates of the GAW17 data, between those SNPs that were included in the best aggregation of SNPs found by our GA analysis and those that were actually causal (assessed by referencing the GAW17 answers). Analysis was performed after correction for population stratification using 10 PCs. a For the Q1-chromosome 14 combination, there was a significant tendency to miss the causal SNPs.
This problem was likely worsened by the problems with type I error, which drastically curtailed overall power. However, in Tables 3 and 4 we show a significant tendency to pick out the truly associated SNPs in cases in which the chromosome-wide p-value is small and sometimes even when significance was not obtained at the chromosomal level.
As the size of region being considered grows larger, it becomes imperative to find some other way of guiding the search for optimal subsets of SNPs. One potential approach is to include external information regarding which SNPs are most likely to be informative. For example, one might include functional information from external databases. This should improve the performance of the search algorithms and thereby allow for more successful optimization over larger regions.
Another issue with methods that search across a large state space of potentially informative variation is the issue of overfitting. In this paper we guard against this in two ways: by using fixed-size subsets of SNPs and by assessing significance by means of a permutation test. A further response to this issue would be to use shrinkage methods, such as penalized regression, to reduce the number of variables, in our case SNPs, that might be included in the optimal subset. We intend to explore this approach in future work.
Finally, we see some evidence that the novel score statistic we propose in this paper may prove to be more . The x-axis indexes the SNPs on the given chromosome (each bar corresponds to a single SNP; bars are equally spaced for clarity). The y-axis shows the frequency with which the SNP was included in the optimal solution. Red x's indicate positions of SNPs that are associated with the trait (according to GAW17 answers).
powerful than existing statistics (see Figure 8). Rather than concluding anything regarding the relative merits of the statistics considered here, which is impossible given the focus on the GAW17 data in particular, we note that optimal construction of a score statistic is still an open question and is worthy of further investigation.
In summary, although our results are somewhat preliminary, in that the conclusions are based on a single Figure 7 GA results for analysis of quantitative trait Q2. Histogram showing the frequency with which each SNP was included in the subset of SNPs on which the optimal score statistic was calculated, aggregated across all 200 replicates. Results are for chromosome 6 (top), chromosome 17 (middle), and chromosome 21 (bottom). The x-axis indexes the SNPs on the given chromosome (each bar corresponds to a single SNP; bars are equally spaced for clarity). The y-axis shows the frequency with which the SNP was included in the optimal solution. Red x's indicate positions of SNPs that are associated with the trait (according to GAW17 answers).

Figure 8
Example of differing results from the three statistics. Histograms showing the frequency with which each SNP was included in the subset of SNPs on which the optimal score statistic was calculated, aggregated across all 200 replicates. Results are for the GA analysis of Q2 on chromosome 6. Results are for s LL (top), s MB (middle), and s ME (bottom). The x-axis indexes the SNPs on chromosome 6 (each bar corresponds to a single SNP; bars are equally spaced for clarity). The y-axis shows the frequency with which the SNP was included in the optimal solution. Red x's indicate positions of SNPs that are associated with the trait (according to GAW17 answers). data set, at least in terms of the genotype data, we believe that they do show the potential of using optimization approaches to extend existing score-based methods over wider regions than those currently considered. Selection of a subset of SNPs within a region should lessen the existing tendency for the performance of such methods to deteriorate as the number of nonassociated SNPs in the region of interest increases. In traditional analyses this occurs because such SNPs are always included in the calculation of the score statistic, whereas by choosing a subset of SNPs, such statistics will be excluded, if the method is working well, leading to an expected increase in power.