Volume 5 Supplement 9
Genetic Analysis Workshop 17: Unraveling Human Exome Data
Identification of genes and variants associated with quantitative traits using Bayesian factor screening
- Kith Pradhan^{1},
- Seungtai Chris Yoon^{2},
- Tao Wang^{1} and
- Kenny Ye^{1}Email author
DOI: 10.1186/1753-6561-5-S9-S4
© Pradhan et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
We propose a factor-screening method based on a Bayesian model selection framework and apply it to Genetic Analysis Workshop 17 simulated data with unrelated individuals to identify genes and SNP variants associated with the quantitative trait Q1. A Metropolis-Hasting algorithm is implemented to generate a posterior distribution in a restricted model space and thus the marginal posterior distribution of each variant. Our framework provides flexibility to make inferences on either individual variants or genes. We obtained results for 10 simulated data sets. Our methods are able to identify FTP1 and KDR, two genes that are associated with Q1 in a majority of replicates.
Background
Here, we apply this method to the Genetic Analysis Workshop 17 (GAW17) simulated data, of which the genotypes of 697 individuals at 3,205 genes are based on exome sequencing data of the 1000 Genomes Project [2]. In principle, the general framework described applies to any class of parametric or nonparametric models, so long as the model posterior probability is well defined and can be computed. However, here we consider only linear models, and we apply our method to analyze quantitative trait Q1 of the GAW17 data set. This trait was simulated to be associated with 39 single-nucleotide polymorphisms (SNPs) in 9 genes. In the first part of our analysis, we treat each SNP as a factor to evaluate the association at the SNP level. In the second part of our analysis, we treat each gene as a factor and evaluate the association at the gene level. Because each gene contains a different number of SNPs, unequal prior probabilities are assigned to candidate models to penalize models with more parameters based on the Bayesian information criterion (BIC) [3].
Methods
Use SNPs as factors
Note that is the sum residual squared when we regress ỹ on ; it can be computed as , and the other part of the posterior distribution can be computed as , where is the QR decomposition of . The details of using QR decomposition for least-squares estimates of linear models are given by Seber [5].
However, even with efficient evaluation of the model posterior distribution, the model space contains p choose m models and is usually too big to be exhaustively evaluated. For example, with p = 24,478 variants, there are almost 2.5 trillion models when m = 3. If 1 million models are evaluated in a second, it will take almost a month to evaluate all 2.5 trillion models. Therefore a Monte Carlo Markov chain (MCMC) method is used to obtain the posterior probability. We used a simple Metropolis-Hasting algorithm, which we briefly described in what follows. Let γ^{(}^{ j }^{)} be the model at the jth step. At the (j + 1)th step, a model γ* is randomly selected by replacing a random active factor (i.e., γ_{ i }^{(}^{ j }^{)} = 1) in γ^{(}^{ j }^{)} with a random inactive factor (i.e., γ_{ i }^{(}^{ j }^{)} = 0) in γ^{(}^{ j }^{)}. Set γ^{(}^{ j }^{+1)} = γ* with probability min{p(γ* | y) / p(γ^{(}^{ j }^{)} | y),1}; otherwise, set γ^{(}^{ j }^{+1)} = γ^{(}^{ j }^{)}. To estimate the marginal posterior distribution of each factor, we simply count the number of times the factor appears in the chain. To determine if the MCMCs have reached convergence, we usually compare two chains run with different seeds given to the pseudo–random number generator.
Given a vector of active factors (genotypes variants) γ, each column of X is the number of minor alleles of each (active) variant in the 697 subjects, centralized to mean 0. Note that all models with a fixed number of active factors have the same number of parameters. This avoids the complication of comparing models with different numbers of unknown parameters; thus all candidate models have the same prior distribution. Our framework can consider factor-factor interactions by including interaction terms in the model matrix X, but we do not include them here.
The effects of the three covariates, Smoke, Age, and Sex, are removed by taking a regression of the quantitative traits to the covariates. The residuals are then used as the quantitative response to be associated with genetic variants. Because Q1 is simulated with interaction effects between variants in the KDR gene and smoking, by ignoring such interactions, our power for detecting the KDR gene would be lower.
Use genes as factors
Using SNP variants as factors does not take into consideration genes on which variants are sitting. Statistical inference on a gene can be made indirectly by averaging the marginal posterior distribution of all variants within it. Alternatively, we can treat each gene as a factor in our factor screen framework by forcing all variants in a gene in and out of a linear regression model together. The inferences are then made directly on each gene, but no inference is made on individual SNPs.
More specifically, in the linear model y = Xβ + ε, X is a matrix derived from genotypes of all SNP variants of m “active” genes. The number of columns of X is the total number of SNP variants of the m genes. A binary vector γ = (γ_{1}, γ_{2}, …, γ_{ p }) now represents the active or inactive state of the 3,205 genes, but the computation of the posterior distribution follows the same formula as above. However, each gene has a different number of SNP variants; hence the column of X varies even though the number of active genes is fixed. Therefore the total number of parameters of each candidate model is not fixed. If we still assign each candidate model with equal prior probability, the model selection procedure tends to bias toward genes with a greater number of variants. In an effort to correct for such bias, we assign prior probabilities of candidate models based on the BIC. That is, the prior probability of a model is proportional to e^{–}^{ k }^{/2}, where k is the total number of parameters.
Results
Running a MCMC procedure is computationally expensive. Therefore we run our analysis on only the first 10 replicates of the 200 replicates simulated by the GAW17 data set. However, we believe that the results of these 10 replicates are sufficient to evaluate our method.
Use SNP as factors
SNPs with high marginal posterior probabilities
Appearances in top 20 SNPs among 10 replicates | SNPs (genes), m=10 |
---|---|
9 | C13S523 (FLT1) |
8 | C13S522 (FLT1) |
5 | C13S431 (FLT1) |
4 | C4S1884 (KDR) |
2 | C4S1878 (KDR), C6S2981 (VEGFA), 8 SNPs not associated with Q1 |
1 | C13S524 (FLT1), C1S6533 (ARNT), C4S1890 (KDR), 151 SNPs not associated with Q1 |
SNPs with high marginal posterior probabilities
SNP | Gene | Posterior probability > 0.5 | Posterior probability > 0.1 |
---|---|---|---|
C13S431 | FLT1 | 1 | 4 |
C13S522 | FLT1 | 6 | 8 |
C13S523 | FLT1 | 9 | 9 |
C13S534 | FLT1 | 0 | 1 |
C4S1878 | KDR | 0 | 2 |
C4S1884 | KDR | 2 | 4 |
C6S2981 | VEGFA | 0 | 1 |
Other Q1 | 0 | 0 | |
Non-Q1 | 3 | 36 | |
False-positive discovery rate | 3/21 | 36/65 |
Average marginal posterior/prior probability ratio per variants for Q1-associated genes
Gene | Posterior/prior ratio (mean of 10 replicates) |
---|---|
ARBT | 1.79 |
ELAVL4 | 1.31 |
FLT1 | 113 |
FLT4 | 0.649 |
HIF1A | 0.970 |
H1F3A | 0.483 |
KDR | 50.5 |
VEGFA | 11.6 |
VEGFC | 2.50 |
Use genes as factors
For the first 10 replicates, we analyze Q1 with m = 3 and m = 6. We run 10,000 iterations of MCMCs after a burn-in period of 1,000. A comparison between the two independent chains for the first replicate suggests that the marginal posterior probabilities of the individual factors converge at such length. For the prior probability of β, we set λ = 1, as previously, and set Σ as an identity matrix. We also run our analysis with a Σ that imposes a slight correlation (at 0.1) among the effects of variants within a gene. The results are similar and hence are not presented here.
Genes with high marginal posterior probabilities
Appearances in the top 10 genes among 10 replicates | m = 3 | m = 6 |
---|---|---|
10 | FLT1 | FLT1 |
6 | − | KDR |
5 | KDR | − |
2 | Two genes not associated with Q1 | Four genes not associated with Q1 |
1 | ARNT, HIF1A, 79 genes not associated with Q1 | HIF1A, VEGFA, 74 genes not associated with Q1 |
Marginal posterior probability for Q1-associated genes
Gene | Posterior probability (mean of 10 replicates) | Posterior probability > 0.5 | Posterior probability > 0.1 | |||
---|---|---|---|---|---|---|
m = 3 | m = 6 | m = 3 | m = 6 | m = 3 | m = 6 | |
ARBT | 0.00521 | 0.00175 | 0 | 0 | 0 | 0 |
ELAVL4 | 0.00001 | 0.00065 | 0 | 0 | 0 | 0 |
FLT1 | 0.98967 | 0.96252 | 10 | 10 | 10 | 10 |
FLT4 | 0 | 0.00053 | 0 | 0 | 0 | 0 |
HIF1A | 0.00596 | 0.00704 | 0 | 0 | 0 | 0 |
H1F3A | 0 | 0 | 0 | 0 | 0 | 0 |
KDR | 0.32353 | 0.40953 | 4 | 3 | 4 | 6 |
VEGFA | 0.0006 | 0.0434 | 0 | 0 | 0 | 1 |
VEGFC | 0.00129 | 0.00256 | 0 | 0 | 0 | 0 |
Non-Q1 | 5 | 7 | 18 | 51 | ||
False-positive discovery rate | 5/19 | 7/20 | 18/32 | 51/68 |
Conclusions
We presented a simple method of Bayesian factor screening and applied it to analyze Q1, a quantitative trait simulated to be associated with 39 SNP variants in 9 genes. We applied two implementations: One treats each SNP as a factor; the other treats each gene as a factor. Our computational framework is simple, straightforward, and efficient. The prior probabilities require few assumptions and are as noninformative as possible. Based on our experience, the results are not sensitive to the choice of λ and Σ in the prior probability and are not sensitive to the choice of m when SNPs are treated as factors. No biological information was used, except in the second implementation, where we grouped the SNPs of the same gene together.
Our method is quite effective. In the gene-level analysis, when all SNPs in a gene are treated as a group, we are able to identify FTL1 consistently as the top candidate and we find KDR about half of the time. Beyond these two genes, we are not able to identify other Q1 genes without dramatically increasing our false-positive discovery rate. Our prior probability assignment penalizes models with more parameters and tends to overly favor genes with fewer variants; it requires further adjustment. The false-positive discovery rate tends to increase as m increases. In the SNP-level analysis, we have a good chance of identifying most Q1-associated SNPs with MAF > 0.01, at a reasonable false-positive discovery rate. The rarer variants are difficult to identify, even allowing for a high false-positive discovery rate. It might be unreasonable to expect variants with a low MAF to be identified in a sample size of 697 and a low false-positive discovery rate when there are almost 25,000 candidate variants, unless some degree of biological knowledge is used. The inference on genes derived from SNP-level analysis seems reasonable effective, finding FLT1 and KDR frequently. When we take a fixed number of top factors for further investigations, the false discovery rate tends to be high. However, multistage designs can be used to gradually weed out false positives that pass the early rounds, because the chance for them to luck out twice is very low.
Declarations
Acknowledgments
KP is supported by the National Institutes of Health (NIH) grant 5U01HG005209. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. The authors would like to thank all participants of Group 1 of GAW17 for their constructive comments and discussions.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Yoon S: Bayesian factor screening. 2006, Ph.D. dissertation, State University of New York at Stony BrookGoogle Scholar
- 1000 Genomes Project Consortium, Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA, et al: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Schwarz , Gideon E: Estimating the dimension of a model. Ann Stat. 1978, 6: 461-464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
- Brown PJ, Vannuci M: Multivariate Bayesian variable selection and prediction. J R Stat Soc B. 1998, 60: 627-641. 10.1111/1467-9868.00144.View ArticleGoogle Scholar
- Seber GAF: Multivariate observations. 2004, Wiley, New York, 497-498.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.