Enhancing the discovery of rare disease variants through hierarchical modeling
© Chen; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Skip to main content
Volume 5 Supplement 9
© Chen; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Advances in next-generation sequencing technology are enabling researchers to capture a comprehensive picture of genomic variation across large numbers of individuals with unprecedented levels of efficiency. The main analytic challenge in disease mapping is how to mine the data for rare causal variants among a sea of neutral variation. To achieve this goal, investigators have proposed a number of methods that exploit biological knowledge. In this paper, I propose applying a Bayesian stochastic search variable selection algorithm in this context. My multivariate method is inspired by the combined multivariate and collapsing method. In this proposed method, however, I allow an arbitrary number of different sources of biological knowledge to inform the model as prior distributions in a two-level hierarchical model. This allows rare variants with similar prior distributions to share evidence of association. Using the 1000 Genomes Project single-nucleotide polymorphism data provided by Genetic Analysis Workshop 17, I show that through biologically informative prior distributions, some power can be gained over noninformative prior distributions.
Genome-wide association studies (GWAS) have been a powerful method for revealing common variants that confer a modest increase in disease risk in carriers. In general, the single-nucleotide polymorphisms (SNPs) that show the strongest evidence for association in GWAS do not perfectly tag the putative causal variant(s) nearby because of ancestral recombination events; therefore resequencing in these regions is necessary to resolve the precise location of the causal variant(s). Dickson et al.  postulated one possible explanation for why many fine-mapping efforts have failed to map a single causal SNP in the region tagged by the original genome-wide association signal: multiple rare variants (MRVs) residing on multiple haplotypes at the region of the genome-wide association signal are generating a “synthetic” association when these haplotypes share a common allele that is observed more in case subjects than in control subjects. In support of the MRV hypothesis, several investigators have recently developed a number of popular burden-type methods [2–4]. These methods are predicated on the notion that presence of or an increase in the number of mutations for a person at a particular pathway, region, gene, or any other biological unit can serve as a reasonable proxy for his/her risk of developing disease. The common theme among these methods is that the genotypes for MRVs that map to these biological units, called bins, are collapsed into a single vector of scores, a technique that can potentially improve statistical power to detect disease association. For example, in the combined multivariate and collapsing (CMC) method of Li and Leal , a score for an individual is assigned 1 if at least one mutation is observed across all SNPs within a bin, or 0 otherwise. The significance of a gene, for example, can then be tested by jointly modeling all bins that map within the gene using a multivariate method such as Hotelling’s multivariate T-test, logistic regression, or linear regression.
In this paper, I describe how I adapted the concept of the CMC method into a Bayesian variable selection algorithm with the notion that common SNPs may also contribute valuable information to nearby causal rare variants, assuming that the shared haplotype model  is true. The exon resequencing data set provided by the organizers of Genetic Analysis Workshop 17 (GAW17) provides an ideal opportunity for evaluating the performance of this new approach.
Details of the simulated GAW17 data set can be found in this same issue . I defined variants that had a minor allele frequency (MAF) less than 0.01 to be rare but potentially the most biologically interesting, because extremely rare mutations are expected to have the greatest deleterious effects on phenotype. Of all the SNPs provided in the data set, 73% (18,131) fall within this MAF range. For each gene, I applied the collapsing procedure, as described in the CMC method , by grouping rare SNPs into one of two bins defined by their predicted impact on protein (i.e., synonymous or nonsynonymous variant). Any bin with a MAF less than 0.01 after the collapsing procedure was not included for further analysis. Common SNPs, defined as those having a MAF ≥ 0.01, were not collapsed with any other SNPs in the gene. For conciseness, I use the term variable to define either a single common SNP or a SNP bin. The final marker panel included 7,385 variables: 1,029 bins containing collapsed rare variants and 6,356 bins containing common SNPs. I experimented with higher threshold values for bin definition (e.g., MAF = 0.05), but this strategy did not recover an appreciable number of bins from the filtering step because most genes in the data set were small and harbored private mutations. True log relative risks (denoted β) for each SNP are provided in the simulation answer key, which quantifies each SNP’s effect on the quantitative traits Q1 and Q2. Thus, to assess how accurately my method can recover the true values of β at each SNP, I constrained the analyses only to models where the outcome phenotype was either trait Q1 or Q2.
The Z matrix stores external knowledge about each of the m variables currently in the model. To encode my belief that deleterious mutations would have higher or lower values of β relative to other types of mutations, I assigned a value of 1 to the nonsynonymous mutation in the second column (after reserving the first column as the intercept) of the m × 2 design matrix Z and a value of 0 for any other SNP category. The term π, estimated using ordinary linear regression, relates the magnitude of β in Eq. (1) to values in Z. Furthermore, to encode my belief that mutations within the same gene should have similar effects on disease, I specified an indicator encoding whether predictor k and any other model variables are in the same gene by means of a k × k adjacency matrix A. Specifically, the parameter stores the mean of the MLE from the first level, taken across neighbors of variable k (i.e., all other variables that are in the same gene) defined by means of A. The variance term τ2 is inversely scaled by v k , the number of neighbors of k to weight the uncertainty about τ2. Finally, θ k soaks up any remaining variation in the second level of the model through the variance term σ2.
A posterior density is defined on the basis of the likelihood and normal density function corresponding to the first (Eq. (1)) and second (Eq. (2)) levels of the hierarchical model. I use the product of this density function and a model transition function as the objective function of a reversible jump Monte Carlo Markov chain (MCMC) algorithm to stochastically explore the search space, fitting all possible sets of model variables to the data. The model transition kernel itself is informed through empirical Bayes estimates of the hyperparameters (e.g., π), so that regions of the search space that have strong empirical support and prior evidence are prioritized. Further details on how the variable selection algorithm works can be found in Chen and Thomas .
In the next section I present results between a more conventional method and my proposed method. The first method is an ordinary least-squares regression between the quantitative trait (i.e., Q1 or Q2) and each vector of variable scores, which I denote as the MLE method. This approach is equivalent to a conventional genome-wide association scan, testing for marginal effects. I compared this to four variations of the multivariate MCMC method. Specifically, I varied the degree of informativeness in the prior distribution by modifying the definition of the matrices A and Z. The most informative prior distribution (denoted FULL) stores both gene membership and SNP mutation type information in the A and Z matrices, respectively. In the second variant of the prior distribution (denoted Z only), I removed gene membership information so that matrix A was simply the identity matrix. Conversely, in the third variant of the prior distribution (denoted A only), I removed mutation class information so that the Z matrix included only the intercept. The last variant of the prior distribution (denoted UNINF) includes both the uninformative Z and A matrices and is equivalent to a the ridge style prior distribution (i.e., β ~ N(0, σ2 )).
For each of the MCMC analyses, I sampled 2 million realizations from the posterior distribution, retaining statistics on only the last million realizations to minimize any correlation to the initial parameter values. Run time on a 2-GHz Xeon processor was approximately 8 h. I verified that the retained statistics reached convergence by comparing their distributions across multiple chains using a nonsignificant p-value extracted from the Kolmogorov-Smirnov test.
To quantify evidence for any specific variable (either common SNP or SNP bin), I empirically estimated Bayes factors (BFs) for each variable by dividing the posterior odds by the prior odds, as described by Chen and Thomas . BFs quantify the increase in evidence for a hypothesis (in this case, inclusion of a variable into the model) in light of observed data relative to a prior hypothesis .
Posterior estimates of hyperparameters
Accuracy of estimates of β for trait Q1 between maximum-likelihood estimate (MLE) and hierarchical modeling (HM) estimates
Mean square error
Mean coverage ratea
Relative power (in relation to the maximum-likelihood estimate) of hierarchical modeling method at FDR = 0.05
Bayes factors for each causal variable under the Q1 trait model
Bayes factors for each causal variable under the Q2 trait model
In response to the missing heritability mystery plaguing the field of complex trait genetics, there is understandably massive interest in developing methods that can effectively investigate the relationship between rare variants and disease. In the methods described by Madsen and Browning  and a more recent refinement described by Price et al. , common SNPs are down-weighted on the assumption that their effect sizes are expected to be smaller than their rarer neighbors. Details on these approaches are found in Dering et al. . A one degree of freedom test is carried out at the gene level or other biological unit rather than at the SNP level. These methods are appealing because power can be increased as a result of fewer multiple hypotheses to adjust for. I took a somewhat different approach that was closer in spirit to the CMC method . Like the CMC approach, my method operates within a multivariate framework so that multiple bins within a gene can be considered; this allows one to test multiple hypotheses and to refine the signal, albeit at a statistical cost resulting from multiple comparisons. In contrast to the Madsen and Browning  and Price et al.  methods, I do not down-weight SNPs of higher frequency. In fact, I believe that if there is a shared haplotype effect among case subjects, then these common SNPs can aid in discovery of rarer neighbors through an appropriate prior specification (e.g., the A matrix in the hierarchical model). With any type of collapsing strategy, including mine, the choice of how bins are defined is arbitrary and some type of permutation procedure is necessary to alleviate an increase in type I error from overfitting the data. My Bayesian method, while also computationally expensive, does not involve permutation. Through Bayes model averaging and reporting of BFs, the problem of model overfitting is handled naturally. I previously demonstrated through simulations that the model is robust in light of multiple comparisons within the context of discovering interactions .
The results from the analyses show that in certain cases, such as when Q1 is modeled as the outcome, rare variants can make accurate estimates of effect size difficult when operating under a conventional MLE framework. Hierarchical modeling can be particularly helpful here, even if the prior distributions are not particularly informative. However, I must provide an important caveat that the method, which still operates under a standard multivariate regression framework at the first level of the model, does not appear to work particularly well when rare variants (i.e., omitting a collapsing strategy), such as singleton mutations, are directly tested; convergence problems usually emerge when the design matrix becomes numerically singular. Thus I was unable to directly evaluate the method’s performance on any one specific SNP among the extremely rare causal variants. The LASSO (least absolute shrinkage and selection operator) method , another flavor of penalized regression that provides variable selection, has recently been extended to allow one to directly test any rare variant by defining bins (e.g., genes) that relax the global penalization parameter . Although my approach is more limited in this sense, my model allows the investigator to include an arbitrary number of prior knowledge sources through columns in a Z matrix, as demonstrated in the sensitivity analyses presented in Figures 1 and 2. I found that defining a richer prior distribution on β based on biology could indeed improve power to detect variants. On closer inspection, I learned that mutation type (synonymous vs. nonsynonymous) information was more beneficial than gene-membership information for most of the SNP bins, but the opposite was true for the FLT1 gene. Thus I recommend providing as much external knowledge as possible in the model (e.g., adding additional columns in Z). Because my method is based on empirical Bayes estimates, it is robust to poor specification of the prior distribution, because this only leads to increased uncertainty (modeled in the prior variances τ2 and σ2), asymptotically reducing the prior distribution on β to a ridge prior distribution.
Clearly, there is a need to develop methods to effectively mine the data for rare variants that confer disease risk. I am optimistic that my approach is more effective than other methods in many cases, but it does have the same limitations as those shared by collapsing-style methods, particularly the strong assumption that effect sizes will point in the same direction among SNPs inside a bin. I am considering other variations of the hierarchical model that might more flexibly accommodate this type of heterogeneity. One appealing idea is to include a new stochastic layer into the algorithm that randomly groups SNPs into bins (and consequently compresses the A and Z matrices accordingly). My method currently permits one to perform a global test of association (i.e., are any rare variants associated?) by testing fixed bins. An important property of enabling flexibility in bin assignment is that one can additionally perform local tests of association (i.e., how often does this SNP appear in any bin?).
I have presented a computationally efficient Bayesian method that simultaneously provides additional power to discover rare disease variants and enhances estimation of true effect sizes. Users interested in the algorithm can download C++ source code and binaries from my website (http://www-hsc.usc.edu/~garykche/).
I would like to thank the organizers of Genetic Analysis Workshop 17, the anonymous manuscript reviewers, and the copy editor Mimi Braverman for improving the paper. The Workshop is supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.