Volume 3 Supplement 7
Genetic Analysis Workshop 16
Predictive modeling in casecontrol singlenucleotide polymorphism studies in the presence of population stratification: a case study using Genetic Analysis Workshop 16 Problem 1 dataset
 Niloofar Arshadi^{1},
 Billy Chang^{1} and
 Rafal Kustra^{1}Email author
DOI: 10.1186/175365613S7S60
© Arshadi et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
In this paper, we apply the gradientboosting machine predictive model to the rheumatoid arthritis data for predicting the casecontrol status. QQplot suggests severe population stratification. In univariate genomewide association studies, a correction factor for ethnicity confounding can be derived. Here we propose a novel strategy to deal with population stratification in the context of multivariate predictive modeling. We address the problem by clustering the subjects on the axes of genetic variations, and building a predictive model separately in each cluster. This allows us to control ethnicity without explicitly including it in the model, which could marginalize the genetic signal we are trying to discover. Clustering not only leads to more similar ethnicity groups but also, as our results show, increases the accuracy of our model when compared to the nonclustered approach. The highest accuracy is achieved with the model adjusted for population stratification, when the genetic axes of variation are included among the set of predictors, although this may be misleading given the confounding effects.
Background
Predictive modeling of casecontrol singlenucleotide polymorphism (SNP) data, using modern statistical and machine learning methods is a viable alternative to classical statistical genetic approaches that utilize onemarkeratatime hypothesis testing. With emphasis on prediction, a well executed modeling strategy can lead to a generalizable prognostic model that may form a basis for a genetic test. Some machine learning models, such as random forests [1] and gradient boosting machine (GBM) [2], can also rank covariates (here, SNPs) in terms of their predictive power. This presents a useful alternative to univariate testing because the promising SNPs are identified in the context of a multivariate predictive model that can discover multiloci associations. Further, an unbiased estimate of predictive performance of such models also presents a useful summary of diagnostic value as opposed to a sea of pvalues one obtains from association studies.
In this paper, we describe results when a dataset with potential ethnicity complexity is subjected to predictive analysis using GBM predictive model, and compare the obtained results to analysis based on original data with ethnicity complexity.
Methods
We have set aside 33% of the data as a test set (285 cases and 400 controls) for the final assessment of predictive performance. The remainder of the data (1,370 subjects), training data, is split into five folds and is used to develop and optimize the GBM genetic predictive model for lifetime risk of developing RA. More precisely, during the parameter optimization process, the model is trained using four folds  called CVtraining folds  and is evaluated on the fifth fold  validation set.
Data preprocessing
We first preprocessed the data for outlier detection, which resulted in seven controls (D0009324, D0009459, D0005318, D0011466, D0012257, D0006047, and D0012446) marked as outliers. We followed the general description of Plenge et al. [3], i.e., a subjecttosubject covariance matrix of a set of 120,000 minimally correlated SNPs is computed, and the four eigenvectors corresponding to the top four eigenvalues of the covariance matrix are extracted (a screeplot of eigenvalues was used to decide on the number of selected eigenvalues). Those subjects whose respective entry in at least one eigenvector differed from the mean by more than six standard deviations were removed from any downstream analysis, and the eigenvectors were recomputed.
Then the SNPs were preprocessed as follows: 1) SNPs with less than 1% minor allele frequency were removed; 2) SNPs with less than 5% missing values were replaced with major alleles, or were excluded from the analysis if missing values were more than 5%; 3) tag SNPs were selected with linkage disequilibrium (LD) greater than 0.8 as measured by R^{2} in PLINK 1.03 using the following option: indeppairwise 50 10 0.80; and 4) because we are particularly interested in discovering novel SNPs predicting the RA risk, we removed SNPs in 6p21 and 8p23 regions. However, we also present some results with these two regions included.
At the end of preprocessing step, we were left with 2,055 subjects and 351,660 SNPs per subject. This data was then divided into training and test sets, and crossvalidation (CV) folds as described above.
SNP selection and crossvalidation
The aim of the SNP selection process is to find a subset of SNPs to include in the predictive model to maximize its predictive accuracy. The selected SNPs may be investigated further by bioinformatics methods or lowthroughput assays.
The SNP subsets we considered were based on univariate pvalue thresholds. We used pvalue correction procedure described previously [4] and PLINK software 1.03 [5] to compute pvalues when the four eigenvectors of the covariance matrix (as explained in the data preprocessing section) are included in the logistic regression model for a given SNP. Then, for various pvalue thresholds a GBM model is built using the SNPs below the threshold. This is done repeatedly within a CV framework as follows. The four folds of the training set are used as a CVtraining set. The pvalues for the subjects in the four folds are recomputed at each iteration, and SNPs with pvalues below the threshold are selected. Then, a clustered GBM model is built, and the value of area under curve (AUC) of the accuracy of predictive model is computed on the fifth fold  validation set. The process is repeated five times, and we report the average AUC.
Model correction for ancestry confounding
The population stratification present in the dataset, which seems to cause significant pvalue inflation, can also be detrimental to building and validating a multivariate predictive model. To deal with that, we build separate predictive models on subsets of the data that are derived by clustering the subjects using the axis of genetic variation. Hence, each cluster will have a reduced overall genetic variability, which is usually assumed to be reducing the underlying ethnicity differences. Another way to see our procedure is to think of building predictive models conditional on certain ethnicity, P(YX, T = t). Here Y is an outcome variable, X, the genetic (SNP) vector, and T is a latent ethnicity value, which is assumed to be the same for all subjects in a cluster. Of course we only have crude approximations to underlying ethnicities so our conditioning on T = t is only an approximation. A more usual way would be to build a model that is adjusted for ethnicities by incorporating variable T directly into the model. However, as we demonstrate in the later section (with an "adjusted" GBM model), the axis of variation variables are strong predictors and can easily overshadow true genetic signal. Further, the predictive performance of such a model cannot be properly estimated using a dataset with population stratification present.
Deriving axes of genetic variation
Here we present our approach to deriving the axis of variation which is equivalent to that presented by Price et al. [4], but allows us to calculate the eigenvectors of a test case based on the axis derived on the training set. This is necessary so that we can obtain unbiased estimates of prediction error.
We cluster the training subjects by performing partitioning around medoids (PAM) on the first four column vectors of U, which will provide us three medoids (m_{i}, i = 1, 2, 3) representing the centroids of the three clusters in the fourdimensional space.
where . is the Euclidean norm. The above is equivalent to the EigenStrat procedure [4], which obtains the same matrix U as in Eq. (1). Using singular value decomposition allows us to project a test set observation onto the axes of genetic variation of the training data. Therefore, we can perform the clustering step of model building using only the training (or CVtraining) set, and to subsequently predict the case status on an unseen data in the test or validation set.
Hence, all steps of our model building (SNP selection, clustering, and model training) are performed only on the training set. This allows us to obtain unbiased estimated of the predictive performance of our modeling strategy.
Gradient boosting machine (GBM)
To model P_{ t }(YX, T = t), we use an ensemble of predictive models (GBM) that is based on the boosting paradigm [2]. The basic model unit in GBM is a small decision tree that allows us to model genetic interaction effects. Many such trees are combined in what is effectively a regularized logistic model with trees as individual covariates.
In our experiments, we utilize the 'gbm' package in R, where we set the interaction depth of each tree to 5, shrinkage factor to 0.01, bag fraction to 0.8, and the number of minimum subjects in each node to 10. These parameters values were picked empirically by examining the result of a small experiment involving various parameter settings, but were not formally optimized. The number of trees in a model should depend on a sample size and was simply set equal to the number of subjects in the training set, which varies among clusters. We also present the results with GBM applied to the whole dataset (unclustered) and applied to the whole dataset augmented with axis of variation variables (adjusted GBM) using the same parameter settings.
Results
To decide on the number of clusters, we use the average Silhouette value [6] when clustering the training set into two, three, and four groups. Our results on the training set show that the average Silhouette value is higher for three and four groups, but because since there are as few as 93 subjects in a fourgroup split, we set the number of clusters to three.
In order to explore the topranked SNPs selected by GBM, we use the relative influence measure in the GBM package. There are 19,701 SNPs used in the GBM model with pvalue = 0.05 (we chose this traditional threshold because around pvalue = 0.05, the performance of the GBM classifiers seems to stabilize, Figure 2). Using the four eigenvectors V_{1}, V_{2}, V_{3}, and V_{4} (as explained in the Methods section), we cluster the whole dataset into three clusters, C_{1}, C_{2}, C_{3}, with 1370, 475, and 210 subjects, respectively. We compare the top 20 SNPs selected by the adjusted GBM, as well as top 20 SNPs chosen by GBM models in each of the three clusters in two scenarios: 1) 6p21 and 8p23 regions are excluded; 2) all SNPs that satisfy our quality control constraints and LD threshold (as explained in the Data preprocessing section) included.
In the first scenario, as expected, V_{1}, V_{2}, and V_{4} appear at the top of the list for the adjusted GBM (the pvalues for these eigenvectors are less than 10^{24}), and there are five SNPs (including rs6596147 on chromosome 5) that appear in both C_{1} cluster and adjusted GBM model. Also, rs2476601, one of the SNPs in PTPN22 on chromosome 1 [3], appears in the top 20 list for the adjusted GBM and for the C_{2} cluster.
In the second scenario that includes SNPs from 6p21 and 8p23, we notice 11 common SNPs between C_{1} cluster and adjusted GBM, but 7 of these are in 6p21 region. That suggests that SNPs on the 6p21 region are strongly associated with RA and are chosen by both fulldata adjusted GBM model and GBM from the largest cluster C_{1}, perhaps overshadowing other SNPs with more moderate signal strengths. However, the top 20 list for C_{3} shows only two SNPs in 6p21 (one of which also appears in the adjusted and C_{1} and C_{2} models), but has 16 SNPs in common with the top 20 list for the C_{3} model that excludes SNPs from both 6p21 and 8p23 regions. That suggests that while the RA susceptibility of subjects in cluster C_{1} (and to a lesser extent in C_{2}, which shares 5 SNPs in 6p21) is influenced by 6p21, genetic risk profile of subjects in C_{3} is much less dependent on that region. We also notice rs6596147 in 5q31.1 region appears again in the selected SNPs by adjusted and C_{1} GBM model, a finding that should be further explored given the mixed results reported in the literature on the association of 5q31 and the RA disease [7, 8]. Also, rs2476601 in PTPN22 falls from the 8^{th} and 10^{th} to the 35^{th} and 169^{th} positions in the adjusted GBM and C_{2}, respectively, suggesting that including SNPs on 6p21 may shadow signals of SNPs in other regions.
Comparing the top ranked SNPs from our models to their univariate pvalues, we notice SNPs such as rs2395117 in 6p21, which has the smallest pvalue = 1.6 × 10^{56} (that is in top 20 list of all models in the second scenario), and SNPs such as rs2691269 on chromosome 19 with pvalue = 8.7 × 10^{3} (that is ranked 3,905 SNPs in the univariate pvalue analysis), implying that topranked SNPs in GBM do not necessarily have the lowest pvalues. It should be noted that there was no SNP in the topranked SNP list of any GBM model from the 8p23 region.
Discussion
The main purpose of our fivefold CV was to select the pvalue threshold that maximizes the accuracy of the prediction model on our training set. However, as Figure 2 shows, the GBM model is relatively robust with respect to the number of irrelevant SNPs included in the model. This is encouraging news, suggesting that such an expensive CV exercise may not be necessary with the GBM model.
We notice that the adjusted GBM outperforms all other predictors: this was expected because the four eigenvectors are highly correlated to the case status (their corresponding pvalues were less than 10^{8}). However, such a model may not be able to discover novel SNPs with moderate signals because they may be overshadowed by strong eigenvector covariates. Also, its prediction error is likely underestimated: if deployed in the population with much different ethnicity distribution, such a model would likely fare much worse. This is because much of its predictive performance gains are likely due to predicting underlying ethnicity rather than disease status, which appears strongly confounded in this dataset.
Clustered GBM exhibits higher prediction accuracy compared to the unclustered GBM. Following the paradigm of "divide and conquer", that could be explained by the fact that clustering can make the classification problem easier by dividing the training data into more homogenous subject groups.
We also observe that we achieve a higher performance on the test set than the CV results. This is likely due to the 25% increase in the sample size for the model built for the test set prediction as compared with models built during CV.
Conclusion
In this paper, we developed a prediction model of RA risk using a GBM predictive framework. To avoid confounding by ethnicity, the data were first clustered on the axes of genetic variations. An ensemble of three GBM classifiers were then trained on the three resulting clusters, which are assumed to be ethnically more homogenous. For predicting the casecontrol status of a subject in the test set, we first determined the cluster that the subject belonged to by projecting on the axis of genetic variation, and then applied the corresponding GBM model.
We are interested in further investigating the set of SNPs ranked higher by the clustered GBM model, and in comparing them with those reported in the literature. It would also be interesting to validate our model on another RA dataset with a different ethnicity distribution to confirm that we have, at least partially, avoided confounding when developing our prediction model.
List of abbreviations used
 AUC:

Area under curve
 CV:

Crossvalidation
 GBM:

Gradient boosting machine
 LD:

Linkage disequilibrium
 RA:

Rheumatoid arthritis
 SNP:

Singlenucleotide polymorphism.
Declarations
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. The authors thank Mathieu Lemire for sharing his code for covariance matrix computation and outlier detection.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Breiman L: Random forests. Machine Learning. 2001, 45: 532. 10.1023/A:1010933404324.View ArticleGoogle Scholar
 Friedman H: Greedy function approximation: a gradient boosting machine. Ann Stat. 2001, 29: 11891232. 10.1214/aos/1013203451.View ArticleGoogle Scholar
 Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LR, Li W, Tan AK, Bonnard C, Ong RT, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altshuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Kastner DL, Klareskog L, Gregersen PK: TRAF1C5 as a risk locus for rheumatoid arthritisa genomewide study. New Engl J Med. 2007, 357: 11991209. 10.1056/NEJMoa073491.PubMed CentralView ArticlePubMedGoogle Scholar
 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006, 38: 904909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81: 559575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
 Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987, 20: 5365. 10.1016/03770427(87)901257.View ArticleGoogle Scholar
 Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661678. 10.1038/nature05911.View ArticleGoogle Scholar
 Marinou I, Till SH, Moore DJ, Wilson AG: Lack of association or interactions between the IL4, IL4Rα and IL13 genes, and rheumatoid arthritis. Arthritis Res Ther. 2008, 10: R8010.1186/ar2454.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.