Collapsing ROC approach for risk prediction research on both common and rare variants
© Wei and Lu; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Risk prediction that capitalizes on emerging genetic findings holds great promise for improving public health and clinical care. However, recent risk prediction research has shown that predictive tests formed on existing common genetic loci, including those from genome-wide association studies, have lacked sufficient accuracy for clinical use. Because most rare variants on the genome have not yet been studied for their role in risk prediction, future disease prediction discoveries should shift toward a more comprehensive risk prediction strategy that takes into account both common and rare variants. We are proposing a collapsing receiver operating characteristic (CROC) approach for risk prediction research on both common and rare variants. The new approach is an extension of a previously developed forward ROC (FROC) approach, with additional procedures for handling rare variants. The approach was evaluated through the use of 533 single-nucleotide polymorphisms (SNPs) in 37 candidate genes from the Genetic Analysis Workshop 17 mini-exome data set. We found that a prediction model built on all SNPs gained more accuracy (AUC = 0.605) than one built on common variants alone (AUC = 0.585). We further evaluated the performance of two approaches by gradually reducing the number of common variants in the analysis. We found that the CROC method attained more accuracy than the FROC method when the number of common variants in the data decreased. In an extreme scenario, when there are only rare variants in the data, the CROC reached an AUC value of 0.603, whereas the FROC had an AUC value of 0.524.
The completion of hundreds of genome-wide association studies has brought numerous novel disease susceptibility loci to light. Yet for many diseases the common variants that have been identified explain only a small proportion of disease heritability. Additional genetic variants, including rare variants and gene-gene or gene-environment interactions, remain uncovered. Among these, great attention has been given to the rare variants. Current genome-wide association studies include only single-nucleotide polymorphisms (SNPs) with a minor allele frequency (MAF) greater than 5% [1, 2]. Within the next few years, whole-genome sequencing will produce millions of rare variants, with the expectation that some of them might explain part of the missing heritability. In fact, experimental studies have already shown that rare variants are associated with complex diseases, such as obesity , schizophrenia , and colorectal cancer .
The uncovered rare variants, particularly those that are yet to be identified by future whole-genome sequencing studies, can be combined with the known common variants and clinical risk factors for more accurate disease prediction. However, few approaches are available for assessing the combined effect of both common and rare variants in early disease prediction. Statistical approaches, such as the collapsing approach  and the weighting approach , have recently been proposed to assess the association of rare variants with disease.
The collapsing approach first combines all rare variants into a single common variant and then analyzes it with other common variants using multivariate test statistics. Although this approach was originally proposed for genetic association studies, the idea can be used for genetic risk prediction research as well. We here develop a collapsing receiver operating characteristic (CROC) approach for risk prediction research that considers both common variants and rare variants. The new approach is an extension of the previously proposed forward receiver operating characteristic (FROC) approach, which was developed using optimal features of the likelihood ratio rule [8, 9]. A multistage collapsing procedure is added to the FROC approach to facilitate its use in sequencing data composed of both common and rare variants.
The receiver operating characteristic (ROC) curve is commonly used in genetic risk prediction research to evaluate the accuracy of a risk prediction model. The ROC curve plots the sensitivity of a prediction model against its specificity by continuously changing the cutoff points over the whole range of possible outcomes. When the ROC curve is formed on the likelihood ratio (LR), which is defined as the ratio of the frequency of a particular test outcome in case subjects to that in control subjects, it attains the maximum performance at each cutoff point. The corresponding one-dimensional summary accuracy index, the area under the ROC curve (AUC), is also the highest among that of all approaches . Based on the optimal properties of the LR, we had previously developed a FROC approach for risk prediction on a large number of common genetic variants .
The FROC approach was proposed for risk prediction research on common variants. When both common and rare variants exist, the FROC approach hardly selects rare variants because of their low frequency, which could lead to low accuracy of the prediction model. To deal with both common and rare variants, we have extended the FROC approach and are introducing the CROC approach here. The CROC adopts a multistage collapsing procedure to collapse rare variants into pseudo-common variants and then uses the forward selection algorithm of the FROC approach to search both common and pseudo-common variants for the best prediction model.
The accuracy of the pseudo-common variant is then measured by its AUC value. In step k, we search the remaining rare variants for one locus that increases the AUC value most significantly and collapse it into the pseudo-common variant. The procedure keeps collapsing new rare variants into the pseudo-common variant until the AUC value stops increasing. A pseudo-common variant is thus formed.
We repeat the collapsing procedure on the remaining rare variants and generate a set of pseudo-common variants. The multistage collapsing procedure stops when there are no rare variants left in the data. One of the advantages of using a multistage collapsing procedure instead of the original collapsing procedure is that it could potentially consider bidirectional effects. The forward selection algorithm is then used to search both common variants and pseudo-common variants for an optimal risk prediction model. Because the pseudo-common variants have a higher frequency, they are more likely to be selected by the forward selection algorithm, which could result in increased accuracy of the prediction model.
We evaluated the performance of the CROC approach using the simulated Genetic Analysis Workshop 17 (GAW17) mini-exome sequencing data. The data are composed of 697 individuals, in which 209 individuals make up the case group. Thirty-seven candidate genes were selected based on the simulation results provided. There are 533 SNPs in the candidate genes, including both disease susceptibility and noise loci. The MAFs of these SNPs range from 0.00072 to 0.45122. Among those, 400 SNPs are rare variants (MAF < 0.01). Using the GAW17 data, we investigated whether the accuracy of the risk prediction model could be improved by considering rare variants in the analysis. An additional analysis was also conducted to compare the performance of the CROC and FROC approaches.
Risk prediction considering rare variants
We started the analysis by forming a risk prediction model on all common variants. To assess the accuracy improvement by adding rare variants, we also built a risk prediction model using all genetic variants. In all, 200 replicates were used for the analysis. Risk prediction models were built based on the first 100 replicates, using both the CROC and FROC approaches; they were then evaluated on the remaining 100 replicates. The reason for evaluating the model on a separate replicate is to ensure that the rare variants are evaluated. Because of the small sample size of the data set, rare variants are commonly carried by one individual or by a small number of individuals. If we split the data into training and testing data sets, then rare variants present in the training data set will likely be absent in the testing data set and therefore cannot be validated. Using two replicates ensures the presence of the same rare variants in both data sets. However, we should note that the estimation might be biased because of the potential correlation between two replicates.
Accuracy improvement in the CROC approach by adding rare variants
Common SNPs only
Mean of AUC value
SD of AUC value
Running time (s)
Comparison of CROC and FROC approaches
Next-generation sequencing technology is anticipated to be a powerful tool for uncovering novel genetic variants associated with complex disease, particularly for those variants with a low frequency. The rare variants identified through future whole-genome sequencing studies, if confirmed to have functional importance, may provide novel insights into underlying pathological and etiological processes. Yet, even if they are merely predictive and without functional importance, these rare variants can still be harnessed into clinical translational research applications. By incorporating these rare variants into current risk prediction models, we can predict disease outcomes more accurately.
We developed a CROC approach for future risk prediction research on sequencing data. The approach extends the previously developed FROC approach to deal with both common and rare variants by using a multistage collapsing procedure. The idea of collapsing was originally introduced in genetic association studies to deal with both common and rare variants. We have now integrated those ideas into the CROC approach for risk prediction research on sequencing data. By applying the approach on the simulated mini-exome sequencing data, we demonstrated the advantage of using both common and rare variants in risk prediction research. However, in this application, limited improvement was gained when additional rare variants were combined with common variants. This may be because rare variants account for only a small proportion of phenotype variation. However, in a different scenario, rare variants might contribute significantly to phenotype variation. Therefore we artificially decreased the influence of common variants. When the number of common variants was decreased, we found that significant improvement could be attained by considering additional rare variants. The accuracy of the risk prediction models can be further improved by considering environment risk predictors and gene-environment interactions. We ran an additional analysis including the environment risk predictors and found that the prediction accuracy was significantly improved (data not shown).
We considered only candidate genes in our analysis. Current risk prediction studies commonly adopt this strategy. However, for high-dimensional risk prediction research using millions of SNPs, variable selection becomes important. Although a forward selection algorithm is incorporated into the CROC approach, it could still be subject to false positives when dealing with whole-genome sequencing data. More sophisticated selection algorithms will be needed to deal with a large number of common and rare variants.
We have developed a CROC approach for risk prediction analysis on sequencing data. By applying this new approach to the simulated GAW17 mini-exome sequencing data, we have illustrated that current risk prediction models built on common variants can be further improved by considering additional rare variants. In addition, we compared the CROC approach with the existing FROC approach. The CROC approach outperformed the FROC approach, especially when a large proportion of the considered variants were rare.
This work was supported by start-up funds from Michigan State University.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Bodmer W, Bonilla C: Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008, 40: 695-701. 10.1038/ng.f.136.PubMed CentralView ArticlePubMedGoogle Scholar
- Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95-108.View ArticlePubMedGoogle Scholar
- Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, Doelle H, Ersoy B, Kryukov G, Schmidt S, et al: Medical sequencing at the extremes of human body mass. Am J Hum Genet. 2007, 80: 779-791. 10.1086/513471.PubMed CentralView ArticlePubMedGoogle Scholar
- Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, et al: Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008, 320: 539-543. 10.1126/science.1155174.View ArticlePubMedGoogle Scholar
- Azzopardi D, Dallosso AR, Eliason K, Hendrickson BC, Jones N, Rawstorne E, Colley J, Moskvina V, Frye C, Sampson JR, et al: Multiple rare nonsynonymous variants in the adenomatous polyposis coli gene predispose to colorectal adenomas. Cancer Res. 2008, 68: 358-363. 10.1158/0008-5472.CAN-07-5733.View ArticlePubMedGoogle Scholar
- Li BS, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu Q, Elston RC: Using the optimal receiver operating characteristic curve to design a predictive genetic test, exemplified with type 2 diabetes. Am J Hum Genet. 2008, 82: 641-651. 10.1016/j.ajhg.2007.12.025.PubMed CentralView ArticlePubMedGoogle Scholar
- Ye C, Cui Y, Wei C, Elston R, Zhu J, Lu Q: A nonparametric method for building predictive genetic tests on high-dimensional data. Hum Hered 2011, Genet Epidemiol. 2011, X (suppl X): X-X. [http://content.karger.com/produktedb/produkte.asp?DOI=000327299&typ=pdf]Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.