Volume 5 Supplement 9
Genetic Analysis Workshop 17: Unraveling Human Exome Data
Evaluation of a LASSO regression approach on the unrelated samples of Genetic Analysis Workshop 17
- Wei Guo^{1, 2}Email author,
- Robert C Elston^{1} and
- Xiaofeng Zhu^{1}Email author
https://doi.org/10.1186/1753-6561-5-S9-S12
© Guo et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.
Background
With the rapid development of technologies, more and more single-nucleotide polymorphisms (SNPs) have become available and, in particular, most of the rare variants can be identified using the next-generation sequencing technique. However, detecting associated rare variants that contribute to phenotypic variation is still a huge challenge. Current approaches for testing rare variants include grouping the rare variants based on a threshold of the minor allele frequency (MAF) [1], summing the rare variants weighted by the allele frequencies in control subjects [2, 3], and clustering rare haplotypes using family data [4]. Another approach is to use a penalized regression, which can avoid the singular design matrix that may result from rare variants by adding a penalty, such as the least absolute shrinkage and selection operator (LASSO) and ridge penalties [5, 6]. In this analysis, we evaluated the LASSO regression, linear regression and the collapsing methods by comparing their power and false positive rates. Based on the results, we recommend the LASSO approach to detect rare SNPs.
Methods
Data checking
In the Genetic Analysis Workshop 17 (GAW17) simulated data set, there are no missing genotype data. Among all the 24,487 SNPs, 91% have a MAF less than 0.1, 87% have a MAF less than 0.05, and 75% have a MAF less than 0.01. Moreover, 39% of the SNPs have a MAF less than 0.001, which leads to 9,433 SNPs being singletons among 697 unrelated individuals. Owing to the rareness of the variants, we do not examine Hardy-Weinberg disequilibrium as a quality control procedure in this study. Thus we include all SNPs and all individuals for the association analysis.
LASSO regression
To deal with the singular matrix in linear regression caused by the rare variants, we adopt a statistical method that effectively shrinks the coefficients of unassociated SNPs and reduces the variance of the estimated regression coefficients. Here, we apply the LASSO penalty [7] to implement this regression analysis.
where n is the number of individuals, L is the number of SNP sites, and λ is the tuning parameter. The LASSO regression was implemented in the R package glmnet.
Gene-level association tests
which asymptotically follows the F distribution with (GDF(M) – 1,Â n – GDF(M)) degrees of freedom. The P-values for each gene are obtained from the F distribution given in Eq. (3).
GDF and λ
In classical linear models, the number of covariates is fixed; therefore the number of degrees of freedom is equal to the number of covariates. However, the situation is different in a LASSO regression: The number of nonzero coefficients can no longer accurately measure the model complexity. For a LASSO regression, which involves variable selection, the GDF was introduced [8] to correct for selection bias and to accurately measure the degrees of freedom of the obtained model. The GDF of a model is defined as the average sensitivity of the fitted values to a small change in the observed values. The parametric bootstrapping method is used to estimate the GDF [8, 9].
Thus the tuning parameter λ is selected to be the one that minimizes the extended AIC value.
Alternative methods: F_{linear}and combined multivariate and the collapsing method for quantitative traits
As a comparison, we also carry out the F test based on general linear regression for each gene, which we call F_{linear}. A second alternative method is the combined multivariate and collapsing (CMC) method [1], which is a unified approach that combines collapsing and multivariate tests for a binary trait. We modify the CMC method for the quantitative trait, in which markers are divided into rare and common subgroups, on the basis of a predefined allele frequency threshold (δ); within the rare subgroup an individual is coded 1 if a rare allele is present at any of the variant sites and 0 otherwise. After this collapsing, we calculate the F test to test for the association. We call this approach QCMC(δ) for convenience, and we consider δ = 0.01 and 0.05 in this paper.
Results
We evaluated the power and false-positive rates of the F_{LASSO}, F_{linear}, QCMC(0.01), and QCMC(0.05) tests based on the 200 replicates of the GAW17 data set. The significance level of the tests was first set to 1.6 × 10^{–5}, which is the Bonferroni-corrected significance level of 0.05 adjusted by the number of genes, that is, 0.05/3,205. However, because of the small sample sizes in the GAW17 data set, the power of the association tests was poor and could not be compared in our four tests. Therefore we also used the weak significance level of 0.01 for method comparison.
True variance contributions of 13 causal genes given in the GAW17 answers
VNN3 | VNN1 | SREBF1 | BCHE | VLDLR | SIRT1 | PDGFD | LPL | PLAT | RARB | GCKR | VWF | INSIG1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of SNPs | 15 | 7 | 24 | 29 | 27 | 24 | 11 | 20 | 29 | 11 | 1 | 8 | 5 |
Number of causal SNPs | 7 | 2 | 10 | 13 | 8 | 9 | 4 | 3 | 8 | 2 | 1 | 2 | 3 |
Average MAF of the causal SNPs | 0.0206 | 0.0882 | 0.0022 | 0.0010 | 0.0013 | 0.0012 | 0.0029 | 0.0060 | 0.0021 | 0.0029 | 0.0122 | 0.0032 | 0.0007 |
Variance contribution | 0.0239 | 0.0193 | 0.0125 | 0.0115 | 0.0111 | 0.0100 | 0.0098 | 0.0097 | 0.0090 | 0.0048 | 0.0034 | 0.0021 | 0.0002 |
False-positive rates at the significance levels of 0.01 and 1.6 × 10^{–5} (the Bonferroni-corrected significance level of 0.05)
Significance level | F _{LASSO} | F _{linear} | QCMC(0.01) | QCMC(0.05) |
---|---|---|---|---|
0.01 | 0.02793 | 0.02094 | 0.02195 | 0.02233 |
1.60 × 10^{–5} | 0.00016 | 0.00011 | 0.00011 | 0.00013 |
Discussion and conclusions
In this study, we used the LASSO regression and calculated the GDF for the F tests to avoid selection bias. This method requires using a parametric bootstrap to obtain the GDF; therefore it is computationally not as fast as the linear regression and collapsing methods. In general, the F_{LASSO} test is more powerful than the other methods.
Linear regression is the least powerful approach because of the large number of rare SNPs and because no deduction is made in the large number of degrees of freedom. The collapsing test requires specifying the predefined allele frequency threshold for grouping rare SNPs. It is difficult to determine this criterion optimally when in reality the true disease model is never known. For an extreme example, the QCMC(0.001) test was identical to the linear regression approach and the QCMC(0.1) test had no power at all in these data. Therefore, from this point of view, we recommend the LASSO approach for detecting rare SNPs.
Based on the power comparison of the SIRT1 and VLDLR genes, we observed some evidence that linkage disequilibrium played a significant role in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, it will perform much better in terms of power. It would be of interest to further investigate the role of linkage disequilibrium between common noncausal markers and rare causal SNPs on the power to detect rare causal SNPs and hence determine a more powerful test.
Declarations
Acknowledgments
This work was supported by National Institutes of Health (NIH) grants HL074166, HL086718 from the National Heart, Lung, and Blood Institute, HG003054 from the National Human Genome Research Institute, RR03655 from the National Center for Research Resources, and P30 CAD43703 from the National Cancer Institute. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This work was also partially supported by National Natural Science Foundation of China grant 10901031.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR: Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010, 86: 832-838. 10.1016/j.ajhg.2010.04.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Feng T, Li Y, Lu Q, Elston RC: Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol. 2010, 34: 171-187. 10.1002/gepi.20449.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo W, Lin S: Generalized linear modeling with regularization for detecting common disease rare haplotype association. Genet Epidemiol. 2009, 33: 308-316. 10.1002/gepi.20382.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou H, Sehl ME, Sinsheimer JS, Lange K: Association screening of common and rare genetic variants by penalized regression. Bioinformatics. 2010, 26: 2375-2382. 10.1093/bioinformatics/btq448.PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the LASSO. J R Stat Soc B. 1996, 58: 267-288.Google Scholar
- Ye J: On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc. 1998, 93: 120-131. 10.2307/2669609.View ArticleGoogle Scholar
- Li Y, Sung WK, Liu JJ: Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am J Hum Genet. 2007, 80: 705-715. 10.1086/513205.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.