Volume 5 Supplement 9
Detecting functional rare variants by collapsing and incorporating functional annotation in Genetic Analysis Workshop 17 mini-exome data
© Yan et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Association studies using tag SNPs have been successful in detecting disease-associated common variants. However, common variants, with rare exceptions, explain only at most 5–10% of the heritability resulting from genetic factors, which leads to the common disease/rare variants assumption. Indeed, recent studies using sequencing technologies have demonstrated that common diseases can be due to rare variants that could not be systematically studied earlier. Unfortunately, methods for common variants are not optimal if applied to rare variants. To identify rare variants that affect disease risk, several investigators have designed new approaches based on the idea of collapsing different rare variants inside the same genomic block (e.g., the same gene or pathway) to enrich the signal. Here, we consider three different collapsing methods in the multimarker regression model and compared their performance on the Genetic Analysis Workshop 17 data using the consistency of results across different simulations and the cross-validation prediction error rate. The comparison shows that the proportion collapsing method seems to outperform the other two methods and can find both truly associated rare and common variants. Moreover, we explore one way of incorporating the functional annotations for the variants in the data that collapses nonsynonymous and synonymous variants separately to allow for different penalties on them. The incorporation of functional annotations led to higher sensitivity and specificity levels when the detection results were compared with the answer sheet. The initial analysis was performed without knowledge of the simulating model.
Genome-wide association studies (GWAS) have successfully identified thousands of common variants associated with the risk of common diseases [1, 2]. To date, GWAS have been mostly conducted under the common disease/common variants (CDCV) hypothesis, which asserts that common diseases are mostly caused by common variants with small to modest effects [3–6]. Typically, only variants with a minor allele frequency (MAF) greater than 1–5% are considered in these studies. However, despite the identification of thousands of common variants that affect common disease risk, with rare exceptions these common variants can explain at most 5–10% of the heritable component of disease . Theoretical studies based on evolutionary theories suggest that less common variations are more likely to be functional than common variations [8, 9]. Recent studies using sequencing technology have also detected many rare variants that are associated with disease , providing empirical evidence for the common disease/rare variant (CDRV) hypothesis. All these studies suggest that the complex disease etiology can be a mixture of common variants and rare variants.
Typical GWAS detect disease-associated variants using indirect linkage disequilibrium (LD) mapping, which captures the information of correlated single-nucleotide polymorphisms (SNPs) using a set of tag SNPs to reduce the number of testing. However, this strategy is not efficient when applied to rare variants because the correlation between the rare variants and the tag SNPs is often weak as a result of the low MAF of the rare variants. Alternative LD measures for fine mapping have been developed and offer some advantages over the traditional LD mapping . In addition, direct mapping through exhaustive genotyping or sequencing is more appropriate for identifying functional rare variants.
To analyze the sequencing data, many investigators have developed association tests to detect disease-associated rare variants. These tests fall into three main types: (1) multiple univariate single-marker tests, (2) multiple-marker tests, and (3) collapsing methods. The univariate single-marker tests assess the significance of association for every rare variant independently. The multiple-marker tests instead test for the association of a set of variants simultaneously . Both single-marker and multiple-marker tests have reduced power because of the multiple testing correction. In addition, the power of single-marker tests for low-frequency variants is sensitive to the effect size . The collapsing methods combine information across multiple variants in the same genomic block (e.g., the same gene or pathway) so that the association signals can be enriched and the test’s degrees of freedom can be reduced [11–14].
Here, we consider three different collapsing methods for rare variants in the same gene. Regression with a LASSO (least absolute shrinkage and selection operator) penalty is then used to choose the significant collapsed rare variants or common variants. The three collapsing methods are compared based on the consistency across replicates, the cross-validation error rate of the fitted model, and the list of true causal variants. The most significant common variants and collapsed rare variants are shown. We also explore the incorporation of the functional annotation information of all the variants in the regression model. By comparing the results with the list of true causal variants, we find that incorporation of the functional annotation leads to higher sensitivity and specificity levels.
Collapsing rare variants
All the variants are divided into two groups. Variants with MAF > 5% fall into the common variants group, and all the other variants form the rare variants group. Note that this definition of rare variants is specific to this paper. We also considered a more common definition of rare variants with MAF ≤ 1% and came to the same conclusions (results not shown). The rare variants in the same gene are collapsed using the proportion coding (PROP), the data-adaptive sum (DAS), and the weighted-sum (WS) methods. Details and assumptions of these collapsing methods can be found in Dering et al. .
Multiple regression model
where E i is the vector of the environmental variables for individual i, β E is the vector of coefficients for these variables, g(·) is the link function, and μ i is the mean of Y i . For binary disease status we use the logit link function, and for the other three quantitative trait models we use the identity link function. For parameter estimation, we use a least absolute shrinkage and selection operator (LASSO) , which penalizes the likelihood function by adding the sum of the absolute value of the coefficients (L1 penalty function). Many of the coefficients will be shrunk to 0 as a result of the property of the L1 penalty function.
Comparing collapsing methods
where F contains all the genes identified by the model fitted in at least one replicate data set and |F| is the size of F. The three collapsing methods are compared based on this consistency score. The ability of the consistency score to evaluate the performance of the collapsing methods is debatable because a method can be consistently bad but have a good consistency score. Therefore we further compare the three collapsing methods using the cross-validation error rate of the fitted model. We fit one model for each of the 200 replicates and use the fitted model to predict the trait values in the other 199 replicates. The prediction is then compared with the true values to calculate the error rate. For the disease trait, an area under curve (AUC) score is calculated for each of the 199 validation replicates and the average AUC score is returned, whereas for quantitative traits the mean-square error is used as the measure of prediction error.
Incorporating functional annotation
where l(β) is the log-likelihood function, a j = ns indicates that the corresponding variant is nonsynonymous, and a j = s indicates that the variant is synonymous. The two penalty parameters λ ns and λ s are chosen based on the cross-validation error rate within each replicate data set.
Comparison of collapsing methods
Consistency scores of the selected features from the 200 replicates using the three different collapsing methods
Improvement in the prediction accuracy of the fitted regression model in the testing replicates using the three collapsing methods
Q1 (mean-square error)
Q2 (mean-square error)
Q4 (mean-square error)
Disease (average AUC score)
Identifying associated variants by incorporating functional annotations
The 10 most significant features selected for the disease model and the three quantitative traits when rare variants are collapsed using the proportion collapsing method
FLT1 (n) (200)
PDGFD (n) (50)
LPL (n) (38)
C1ORF122 (s) (8)
FLT1 (n) (25)
KDR (n) (175)
VLDLR (n) (32)
PTK7 (s) (16)
BCHE (n) (31)
FLJ16793 (s) (8)
ADCY5 (s) (12)
ARNT (n) (72)
SIRT1 (n) (28)
RY1 (s) (8)
HOXD11 (s) (12)
MAP2K7 (s) (62)
TXNL1 (n) (24)
ACOX3 (s) (7)
TFDP1 (s) (12)
NT5C2 (s) (50)
RARB (n) (24)
OR13A1 (s) (7)
OR8D4 (s) (11)
FOXO3 (s) (35)
CCNT1 (s) (11)
Discussion and conclusions
We compared three different collapsing methods using the GAW17 data and explored one way to incorporate the functional annotation information. The analysis shows that for the GAW17 data, the proportion collapsing method tends to have the best performance in terms of consistency across different simulations and cross-validation error rate. Furthermore, incorporation of the functional information leads to higher specificity and sensitivity levels. Finally, by comparing the identified genes with the true causal genes, we show that the LASSO method in combination with the rare-variants collapsing method is able to detect most of the true causal variants and genes for the three quantitative traits.
However, several issues need to be addressed with regard to the analysis. First, note that, based on both the consistency score and the cross-validation error rate, the performance of the proportion collapsing method drops when applied to Q2 and disease trait compared to Q1. In fact, Q1 is affected by the covariates Age and Smoke, which can be consistently detected easily and which cause the consistency score to be the best. For disease and Q2, this effect of the covariates is much weaker and thus leads to worse consistency. These results suggest that the consistency score may not be optimal to evaluate the performance of the collapsing methods.
Second, the improvement in the AUC score achieved by incorporating the functional annotation was not impressive for disease and Q1, given that all the functional variants in the simulation model are nonsynonymous. This again can be related to the higher residual heritability of Q1 resulting from variants not included in the data set. It also suggests that our current way of incorporating the functional annotation is not optimal.
Third, many important questions are not answered in this analysis. They include how to detect the interactions between genes and environmental variables, alternative ways to incorporate the functional annotation such as Bayesian methods with different prior probabilities for the synonymous and nonsynonymous variants, adding the quantitative traits in the disease models as predictors, and applying the generalized additive model.
We thank the Yale University Biomedical High Performance Computing Center and the National Institutes of Health (NIH), which funded the instrumentation through grant RR19895. This research was supported in part by NIH grants R01 GM59507 and T15 LM07056 and by a fellowship award from the China Scholarship Council. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Donnelly P: Progress and challenges in genome-wide association studies in humans. Nature. 2008, 456: 728-731. 10.1038/nature07631.View ArticlePubMedGoogle Scholar
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009, 106: 9362-9367. 10.1073/pnas.0903103106.PubMed CentralView ArticlePubMedGoogle Scholar
- Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95-108.View ArticlePubMedGoogle Scholar
- Iyengar SK, Elston RC: The genetic basis of complex traits: rare variants or “common gene, common disease”?. Meth Mol Biol. 2007, 376: 71-84. 10.1007/978-1-59745-389-9_6.View ArticleGoogle Scholar
- Reich DE, Lander ES: On the allelic spectrum of human disease. Tr Genet. 2001, 17: 502-510. 10.1016/S0168-9525(01)02410-6.View ArticleGoogle Scholar
- Smith DJ, Lusis AJ: The allelic structure of common disease. Hum Mol Genet. 2002, 11: 2455-2461. 10.1093/hmg/11.20.2455.View ArticlePubMedGoogle Scholar
- Schork NJ, Murray SS, Frazer KA, Topol EJ: Common vs. rare allele hypotheses for complex disease. Curr Opin Genet Dev. 2009, 19: 212-219. 10.1016/j.gde.2009.04.010.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard JK: Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet. 2001, 69: 124-137. 10.1086/321272.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard JK, Cox NJ: The allelic architecture of human disease genes: common disease-common variant … or not?. Hum Mol Genet. 2002, 11: 2417-2423. 10.1093/hmg/11.20.2417.View ArticlePubMedGoogle Scholar
- Graham J, Thompson EA: Disequilibrium likelihoods for fine-scale mapping of a rare allele. Am J Hum Genet. 1998, 63: 1517-1530. 10.1086/302102.PubMed CentralView ArticlePubMedGoogle Scholar
- Han F, Pan W: A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010, 70: 42-54. 10.1159/000288704.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Morris AP, Zeggini E: An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010, 34: 188-193. 10.1002/gepi.20450.PubMed CentralView ArticlePubMedGoogle Scholar
- Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, X: X-X.Google Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996, 58: 267-288.Google Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.