Large-scale risk prediction applied to Genetic Analysis Workshop 17 mini-exome sequence data
- Gengxin Li^{1},
- John Ferguson^{1},
- Wei Zheng^{2},
- Joon Sang Lee^{1},
- Xianghua Zhang^{1, 3},
- Lun Li^{1, 4},
- Jia Kang^{1},
- Xiting Yan^{2} and
- Hongyu Zhao^{1}Email author
https://doi.org/10.1186/1753-6561-5-S9-S46
© Li et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
We consider the application of Efron’s empirical Bayes classification method to risk prediction in a genome-wide association study using the Genetic Analysis Workshop 17 (GAW17) data. A major advantage of using this method is that the effect size distribution for the set of possible features is empirically estimated and that all subsequent parameter estimation and risk prediction is guided by this distribution. Here, we generalize Efron’s method to allow for some of the peculiarities of the GAW17 data. In particular, we introduce two ways to extend Efron’s model: a weighted empirical Bayes model and a joint covariance model that allows the model to properly incorporate the annotation information of single-nucleotide polymorphisms (SNPs). In the course of our analysis, we examine several aspects of the possible simulation model, including the identity of the most important genes, the differing effects of synonymous and nonsynonymous SNPs, and the relative roles of covariates and genes in conferring disease risk. Finally, we compare the three methods to each other and to other classifiers (random forest and neural network).
Background
The development of disease-risk prediction models based on genome-wide association data is a great challenge to statisticians. A major contributing factor to this difficulty is that the observed effects of the most significant features in any particular model are likely to be overestimates of their true effects [1]. Because of the complexities of a Bayesian analysis with hundreds of thousands of features, most of the shrinkage techniques that have been proposed to deal with this problem have a frequentist flavor, such as the LASSO (least absolute shrinkage and selection operator) and ridge regression [2]. Although these procedures tend to be computationally convenient, the resulting shrinkage could be considered ad hoc compared with an empirical Bayes alternative [3], because for the empirical Bayes alternative model shrinkage is guided directly by both the proportion of associated variants and the effect sizes for this subset of associated variants.
Genetic Analysis Workshop 17 (GAW17) provided a large-scale mini-exome sequence data set with a high proportion of rare variants. In this data set the number of genes far exceeds the number of samples, and, as a result, finding a good risk prediction model is a difficult challenge. Here, we demonstrate the use of an empirical Bayes algorithm, originally proposed by Efron [4] in a microarray case-control context, that is particular suitable to this large-scale data setup. This algorithm is a modified version of linear discriminant analysis (LDA) in which certain parameters, which represent standardized differences in the mean expression for case and control subjects, are shrunk before they are substituted into the LDA rule. In addition to describing some of the subtleties that need to be considered when applying Efron’s method to the GAW17 data (or other genome-wide association data), we develop two extensions that allow us to incorporate single-nucleotide polymorphism (SNP) annotation information into the prediction rule: the weighted empirical Bayes (WEB) model and the joint covariance (JC) model. To show the competitive performance of our proposed methods, we compare them with other classifiers: the random forest and the neural network.
Methods
Choice of gene score
A gene score is a composite value calculated by combining all SNP information within the same gene. Several advantages are gained by applying Efron’s empirical Bayes method to such gene scores instead of to individual SNPs. First, by pooling SNPs together in the correct way, we can potentially enrich the signal-to-noise ratio of the data. Second, the dimensionality of the feature space is greatly reduced (from 24,487 SNPs to 3,205 gene scores). Finally, even though LDA as a technique does not require the feature variables to be normal, it is actually an optimal procedure if they are. Although the number of rare alleles for a particular SNP cannot be considered a normal variable, applying this assumption to the score for genes that have many SNPs may be more reasonable.
where G_{ lj } is the number of rare variants for individual j at SNP l, K is the number of SNPs within gene i, and is the empirical minor allele frequency (MAF) at SNP l. In practice, the Madsen-Browning method, which up-weights SNPs with a lower MAF when calculating gene scores, gives more coherent results on the GAW17 data, and whole gene scores are calculated based on this pooling method.
Method 1: empirical Bayes method
where S is the subset of genes showing the largest marginal association with the disease.
Model 2: weighted empirical Bayes model
where p_{ i }_{|}_{ n } and p_{ i }_{|s} are p-values associated with the ith gene score from the nonsynonymous SNPs and the synonymous SNPs, respectively. These p-values were calculated by fitting a logistic regression model in which the disease trait is regressed on either the synonymous or nonsynonymous gene and the Smoke covariate. A larger w_{ i } implies that the nonsynonymous SNPs from the ith gene have a relatively strong association with the disease trait compared with the synonymous SNPs. Throughout this section, the superscripts n and s refer to nonsynonymous and synonymous, respectively. The other notation is consistent with that introduced in the Model 1 subsection.
where and are defined similarly as in the Model 1 subsection.
still follows a normal distribution with expectation and variance 1; then the application of LDA would assign an individual in the same way.
Model 3: joint covariance model
Here we consider and to be different populations of parameters, and, as a result, the associated empirically estimated prior distributions should be different. This motivates shrinking the nonsynonymous and synonymous Z values separately and then applying the resulting Bayes estimates into Eq. (17). If there is evidence in the data that the nonsynonymous SNPs are more powerful in distinguishing between disease and nondisease, then the synonymous SNPs will be shrunk more. This implicitly gives the nonsynonymous gene scores higher weight in the prediction rule.
Other issues: multiple replicates, treatment of covariates, and cross-validation and selection
to account for the imbalance between case and control samples in the actual GAW17 data.
Other classifiers
To evaluate the competitive performance of our proposed methods, we also fitted a random forest classifier [7] and a neural network classifier to the GAW17 data. The random forest classifier is known to perform remarkably well on a large variety of risk prediction problems (see [8]) and has been extensively used in genomic applications. The comparable performance to other classification methods, such as diagonal linear discriminant analysis (DLDA), K nearest neighbor (KNN) analysis, and support vector machines (SVM), has been demonstrated in a microarray study [9], and the successful application to a large data set has been demonstrated in a genome-wide association study [10]. The technique works by fitting a large number of classification or regression trees to bootstrapped versions of the original data set and then averaging over all these trees to form the prediction rule. The neural network classifier is another efficient learning method and has been widely used in many fields, especially risk prediction [8].
Results
Prediction rule of three proposed methods
Feature | Empirical Bayes method | Weighted empirical Bayes method | Joint covariance model | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Genes | #SNP | MAF | Genes | #Syn SNP | #Non SNP | MAF | Genes | #Syn SNP | #Non SNP | MAF | |
1 | Age | Age | Age | ||||||||
2 | Smoke | Smoke | Smoke | ||||||||
3 | ATP11A | 1 | 0.29 | SUSD2 | 13 | 23 | <0.01 | ATP11A | 1 | 0.29 | |
2 | 4 | 0.01–0.05 | |||||||||
1 | 2 | ≥0.05 | |||||||||
4 | FLT1 | 25 | <0.01 | FLT1 | 8 | 17 | <0.01 | BUD13 | 1 | 0.11 | |
7 | 0.01–0.05 | 5 | 2 | 0.01–0.05 | |||||||
3 | ≥0.05 | 2 | 1 | ≥0.05 | |||||||
5 | SUSD2 | 36 | ATP11A | 1 | 0.29 | C10ORF107 | 1 | 0.13 | |||
6 | |||||||||||
3 | |||||||||||
6 | BUD13 | 1 | 0.11 | RIPK3 | 4 | 13 | <0.01 | RIPK3 | 4 | 13 | <0.01 |
1 | 1 | 0.01–0.05 | 1 | 1 | 0.01–0.05 | ||||||
1 | 1 | ≥0.05 | 1 | 1 | ≥0.05 | ||||||
7 | RIPK3 | 17 | <0.01 | BUD13 | 1 | 0.11 | SUSD2 | 13 | 23 | <0.01 | |
2 | 0.01–0.05 | 2 | 4 | 0.01–0.05 | |||||||
2 | ≥0.05 | 1 | 2 | ≥0.05 | |||||||
8 | C10ORF107 | 1 | 0.13 | ADAMTS4 | 10 | 23 | <0.01 | FLT1 | 8 | 17 | <0.01 |
2 | 2 | 0.01–0.05 | 5 | 2 | 0.01–0.05 | ||||||
1 | 2 | ≥0.05 | 2 | 1 | ≥0.05 | ||||||
9 | ADAMTS4 | 33 | <0.01 | WNT16 | 8 | 7 | < 0.01 | GPR158 | 1 | 0.1 | |
4 | 0.01–0.05 | 1 | 2 | 0.01–0.05 | |||||||
3 | ≥0.05 | 2 | ≥0.05 | ||||||||
10 | MAP3K12 | 14 | <0.01 | GOLGA1 | 1 | <0.01 | ANAPC5 | 14 | 12 | <0.01 | |
3 | 0.01–0.05 | 1 | 0.01–0.05 | 1 | 0.01–0.05 | ||||||
1 | ≥0.05 | ≥0.05 |
Comparison of the prediction rule between the empirical Bayes and other classifiers
Feature | Empirical Bayes method | Random forest classifier | Logistic regression | ||||||
---|---|---|---|---|---|---|---|---|---|
Genes | #SNP | MAF | Genes | #SNP | MAF | Genes | #SNP | MAF | |
1 | Age | Age | Age | ||||||
2 | Smoke | Smoke | Smoke | ||||||
3 | ATP11A | 1 | 0.29 | FLT1 | 25 | <0.01 | SUSD2 | 36 | <0.01 |
7 | 0.01–0.05 | 6 | 0.01–0.05 | ||||||
3 | ≥0.05 | 3 | ≥0.05 | ||||||
4 | FLT1 | 25 | <0.01 | SUSD2 | 36 | <0.01 | ATP11A | 1 | 0.29 |
7 | 0.01–0.05 | 6 | 0.01–0.05 | ||||||
3 | ≥0.05 | 3 | ≥0.05 | ||||||
5 | SUSD2 | 36 | SHD | 10 | < 0.01 | BUD13 | 1 | 0.11 | |
6 | 1 | 0.01–0.05 | |||||||
3 | 2 | ≥0.05 | |||||||
6 | BUD13 | 1 | 0.11 | RIPK3 | 17 | <0.01 | RIPK3 | 17 | <0.01 |
2 | 0.01–0.05 | 2 | 0.01–0.05 | ||||||
2 | ≥0.05 | 2 | ≥0.05 | ||||||
7 | RIPK3 | 17 | <0.01 | ADAMTS4 | 23 | <0.01 | FLT1 | 25 | <0.01 |
2 | 0.01–0.05 | 4 | 0.01–0.05 | 7 | 0.01–0.05 | ||||
2 | ≥0.05 | 3 | ≥0.05 | 3 | ≥0.05 | ||||
8 | C10ORF107 | 1 | 0.13 | CECR1 | 8 | <0.01 | MAP3K12 | 14 | <0.01 |
0.01–0.05 | 3 | 0.01–0.05 | |||||||
4 | ≥0.05 | ≥0.05 | |||||||
9 | ADAMTS4 | 33 | <0.01 | GOLGA1 | 1 | <0.01 | ADAMTS4 | 33 | <0.01 |
4 | 0.01–0.05 | 1 | 0.01–0.05 | 4 | 0.01–0.05 | ||||
3 | ≥0.05 | 1 | ≥0.05 | 3 | ≥0.05 | ||||
10 | MAP3K12 | 14 | <0.01 | C14orf108 | 16 | <0.01 | C10ORF107 | 1 | 0.13 |
3 | 0.01–0.05 | 1 | 0.01–0.05 | ||||||
2 | ≥0.05 |
Cross-validation error and AUC value for the three methods
Item | Model | Statistic | Empirical Bayes method | Weighted empirical Bayes method | Joint covariance model |
---|---|---|---|---|---|
Cross-validation error | Gene + environment | Mean | 0.26 | 0.24 | 0.24 |
SE | 0.0020 | 0.0011 | 0.0012 | ||
AUC value | Gene + environment | Mean | 0.76 | 0.80 | 0.78 |
SE | 0.0102 | 0.0015 | 0.0148 | ||
AUC value | Gene | Mean | 0.60 | 0.64 | 0.62 |
SE | 0.0191 | 0.0183 | 0.0191 |
Comparison of AUC value for the empirical Bayes and other classifiers
Item | Model | Statistics | Empirical Bayes model | Random forest classifier | Neural network 1 | Neural network 2 |
---|---|---|---|---|---|---|
AUC value | Gene + environment | Mean | 0.76 | 0.67 | 0.68 | 0.70 |
SE | 0.0102 | – | – | – |
Prediction rule for two classifiers based on one replicate
Feature | Empirical Bayes classifier | Random forest classifier | ||||
---|---|---|---|---|---|---|
Genes | #SNP | MAF | Genes | #SNP | MAF | |
1 | Age | Age | ||||
2 | Smoke | Smoke | ||||
3 | GOLGA1 | 1 | <0.01 | OR1L6 | <0.01 | |
1 | 0.01–0.05 | 3 | 0.01–0.05 | |||
1 | ≥0.05 | 1 | ≥0.05 | |||
4 | FLT1 | 25 | <0.01 | VTI1B | 9 | <0.01 |
7 | 0.01–0.05 | 1 | 0.01–0.05 | |||
3 | ≥0.05 | 1 | ≥0.05 | |||
5 | NFKBIA | 6 | <0.01 | DENND1A | 19 | <0.01 |
0.01–0.05 | 3 | 0.01–0.05 | ||||
2 | ≥0.05 | 4 | ≥0.05 | |||
6 | DGKZ | 17 | <0.01 | C9ORF66 | 4 | <0.01 |
4 | 0.01–0.05 | 3 | 0.01–0.05 | |||
1 | ≥0.05 | 4 | ≥0.05 | |||
7 | SMTN | 23 | <0.01 | CECR1 | 8 | <0.01 |
4 | 0.01–0.05 | 0.01–0.05 | ||||
2 | ≥0.05 | 4 | ≥0.05 | |||
8 | PAK7 | 1 | 0.30 | MAP3K12 | 14 | <0.01 |
3 | 0.01–0.05 | |||||
≥0.05 | ||||||
9 | ADAM15 | 22 | <0.01 | SLC20A2 | 24 | <0.01 |
5 | 0.01–0.05 | 4 | 0.01–0.05 | |||
3 | ≥0.05 | 1 | ≥0.05 | |||
10 | ADAMTS4 | 33 | <0.01 | ALK | 9 | <0.01 |
4 | 0.01–0.05 | 1 | 0.01–0.05 | |||
3 | ≥0.05 | 6 | ≥0.05 |
Cross-validation error and AUC value for the empirical Bayes and random forest methods based on one replicate
Item | Model | Statistics | Empirical Bayes method | Random forest method |
---|---|---|---|---|
Cross-validation error | Gene + environment | Mean | 0.26 | 0.23 |
SE | 0.009 | – | ||
AUC value | Gene + environment | Mean | 0.72 | 0.66 |
SE | 0.058 | – |
Conclusions
It is well known that developing a good disease risk prediction model based on genome-wide association data is a difficult task; the number of predictors can be orders of magnitude higher than the number of samples that are genotyped. This is certainly the case in the GAW17 mini-exome data set, in which there is information on 24,487 SNPs for only 697 samples. In this paper, we have used the good properties of the empirical Bayes prediction model that Efron [4] developed in a large-scale microarray context to build a prediction model for these data. An interesting feature of the GAW17 data is that they provide annotation information for each SNP in the form of a synonymous/nonsynonymous indicator. Because only nonsynonymous SNPs affect protein function, we expect that they, rather than synonymous SNPs, are more likely to be directly involved in disease pathogenesis. We propose two ways (weighted empirical Bayes model and joint covariance model) to properly incorporate this annotation information into the prediction model. The weighted empirical Bayes model provides the best performance (relatively small cross-validation error and larger AUC value). We also compare the three EB classifiers with two other popular classifiers (random forest and neural network). The EB classifiers have superior prediction performance in terms of AUC value. Based on this analysis, we think that Efron’s empirical Bayes risk prediction model, extended in the manner that we describe here, is a useful and powerful tool for disease risk prediction in genome-wide association studies.
Declarations
Acknowledgments
We thank the reviewers and editors for their insightful and constructive comments on our manuscript. We also thank the Yale University Biomedical High Performance Computing Center and the National Institutes of Health (NIH), which funded the instrumentation through grant RR19895. This project is supported in part by NIH grants R01 GM59507 and T15 LM07056 and by a fellowship award from the China Scholarship Council. The Genetic Analysis Workshop 17 is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Zhong H, Prentice RL: Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics. 2008, 9: 621-634. 10.1093/biostatistics/kxn001.PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc B. 1996, 58: 267-288.Google Scholar
- Robert C: The Bayesian Choice. 2001, New York, Springer Texts in Statistics, 2ndGoogle Scholar
- Efron B: Empirical Bayes estimates for large-scale prediction problems. J Am Stat Assoc. 2009, 104: 1015-1028. 10.1198/jasa.2009.tm08523.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384. doi:10.1371/journal.pgen.1000384PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 8): S2-PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L: Random forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009, New York, Springer Series in Statistics, 2ndView ArticleGoogle Scholar
- Diaz-Uriarte R, Alvarez de Andres: Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006, 7: 3-10.1186/1471-2105-7-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Goldstein BA, Hubbard AE, Cutler A, Barcellos LF: An application of random forests to a genome-wide association data set: methodological considerations and new findings. BMC Genet. 2010, 11: 49-PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.