Application of Bayesian regression with singular value decomposition method in association studies for sequence data
© Kwon et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Genetic association studies usually involve a large number of single-nucleotide polymorphisms (SNPs) (k) and a relative small sample size (n), which produces the situation that k is much greater than n. Because conventional statistical approaches are unable to deal with multiple SNPs simultaneously when k is much greater than n, single-SNP association studies have been used to identify genes involved in a disease’s pathophysiology, which causes a multiple testing problem. To evaluate the contribution of multiple SNPs simultaneously to disease traits when k is much greater than n, we developed the Bayesian regression with singular value decomposition (BRSVD) method. The method reduces the dimension of the design matrix from k to n by applying singular value decomposition to the design matrix. We evaluated the model using a Markov chain Monte Carlo simulation with Gibbs sampler constructed from the posterior densities driven by conjugate prior densities. Permutation was incorporated to generate empirical p-values. We applied the BRSVD method to the sequence data provided by Genetic Analysis Workshop 17 and found that the BRSVD method is a practical method that can be used to analyze sequence data in comparison to the single-SNP association test and the penalized regression method.
Association studies usually involve a large number of single-nucleotide polymorphisms (SNPs) (k) and a relatively small number of samples (n). To avoid multiple testing problems and to consider the effect of multiple SNPs simultaneously, investigators need statistical models that will test multiple SNPs simultaneously. Because standard statistical methods are unable to analyze multiple SNPs simultaneously when k is much greater than n, Tibshirani  introduced the penalized regression (PR) method as an alternative. The method reduces the size of SNP coefficients by treating the coefficients with little effect as zero. In other words, only those SNPs that significantly improve prediction are kept in the model. A potential drawback of this method is that a SNP with a strong marginal effect might be removed from the model if some other SNPs can explain the effect. A second drawback is that the number of SNPs evaluated in the model is controlled by the chosen penalization parameter. Even though the PR method does evaluate multiple SNPs simultaneously when k is much greater than n, the maximum number of SNPs that can be evaluated in the model is limited by sample size; that is, the method usually cannot test all SNPs simultaneously in large-scale genetic association studies, such as genome-wide association studies.
To evaluate all SNPs simultaneously in one statistical model, we introduced the Bayesian classification with singular value decomposition (BCSVD) method . The BCSVD method can be applied to a dichotomous response variable when k is much greater than n. The method achieves a massive dimension reduction by applying singular value decomposition to the design matrix in a binary probit model; it estimates the effect of SNPs through the reduced model. Selection of significant SNPs can be achieved by using the empirical p-values obtained from permutation. The BCSVD method handles small sample sizes quite well.
To analyze quantitative traits when k is much greater than n, we further developed the Bayesian regression with singular value decomposition (BRSVD) method. We applied the BRSVD method to the sequence data provided by Genetic Analysis Workshop 17 (GAW17). We show that the BRSVD method is a practical method that can be used to analyze sequence data by comparison to the single-SNP association test and PR methods.
Study sample and association analysis
We used the unrelated individuals data distributed by GAW17, which includes 697 individuals, 24,487 SNPs, and 3 covariates (sex, age, and smoking status). We analyzed the first 10 replicates of phenotypes for quantitative risk factor Q1. We first performed the single-SNP association test using the simple linear regression model option in PLINK . Second, we applied the PR method with L1 penalty introduced by Tibshirani  using the R package monomvn . We evaluated SNP association with Q1 within the maximum number of SNPs allowed by the package in each step, which is min(k, n − intercept). Because the package does not provide p-values, we used the same permutation technique as in the BRSVD method to obtain empirical p-values. Third, we implemented the BRSVD method. To define significant SNPs for each method, we considered the following statistical models: quantitative risk factor Q1 versus the single SNP and the three covariates for the single-SNP association test; quantitative risk factor Q1 versus the maximum number of SNPs allowed by the package plus the three covariates for the PR method; and quantitative risk factor Q1 versus all SNPs (24,487) and the three covariates for the BRSVD method. All SNPs identified as significant for each model were compared to the 39 SNPs listed in the answer sheet distributed by GAW17. The analyses were run for each of the first 10 replicates, and the average of the 10 replicates was summarized (see Results section).
Results and discussion
Summary of validation of the three methods
E′ (= 50)
IE′ (= 24,437)
E′ (= 16)
IE′ (= 24,471)
E′ (= 45)
IE′ (= 24,442)
E (= 39)
TP = 2
FN = 37
Sen = 0.051
TP = 3
FN = 36
Sen = 0.077
TP = 9
FN = 30
Sen = 0.231
IE (= 24,448)
FP = 48
TN = 24,400
Spe = 0.998
FP = 13
TN = 24,435
Spe = 0.9995
FP = 36
TN = 24,412
Spe = 0.9985
PPV = 0.04
NPV = 0.9984
PPV = 0.187
NPV = 0.9985
PPV = 0.2
NPV = 0.9988
We used three different analysis methods (the single-SNP association analysis method implemented in PLINK, as widely used in genome-wide association studies; the PR method; and the BRSVD method) to identify SNPs that significantly influence the quantitative trait Q1 using the unrelated-individuals sample provided by GAW17. Both the PR and BRSVD methods out-performed the single-SNP association analysis method, suggesting that evaluating multiple SNPs simultaneously not only reduced the problems of multiple testing but also provided more power than single-SNP association in genetic association studies. The BRSVD method had a sensitivity almost three times as high as that of the PR method, suggesting that the BRSVD method is more optimal than the PR method. Another advantage of the BRSVD method is that it requires no specification of parameters compared to the PR method, which requires specification of the penalization parameter that controls the number of variables selected. Moreover, the BRSVD method takes much less computing time than the PR method does. For the association analysis of Q1 in the GAW17 unrelated individuals data, the PR methods used about 1.5 times as much run-time as the BRSVD method. With all factors considered, we believe that the BRSVD method is a good choice for large-scale genetic association study for quantitative traits.
This research was partly supported by the Inflammatory Bowel Disease Program Project Grant DK046763, General Clinical Research Center (GCRC) grant RR00425-30, and Diabetes Endocrinology Research Center (DERC) grant NIDDK DK 063459. The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Tibshirani R: Regression shrinkage via the Lasso. J Roy Stat Soc. 1996, 58: 267-288.Google Scholar
- Kwon S, Cui J, Rhodes S, Tsiang D, Rotter J, Guo X: Application of Bayesian classification with singular value decomposition method in genome-wide association studies. BMC Proc. 2009, 3 (suppl 7): S9-10.1186/1753-6561-3-s7-s9.PubMed CentralView ArticlePubMedGoogle Scholar
- Graybill F: Theory and application of the linear model. 1976, Belmont, CA, Duxbury PressGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Gramacy RB: Monomvn: estimation for multivariate normal and student-t data with monotone missingness. R package version 1.8-3. 2010, [http://CRAN.R-project.org/package=monomvn]Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.