Data
The first data set of GAW17 consists of a collection of 697 unrelated individuals from the 1000 Genomes Project. There are 200 replicates of simulated trait information and a number of nongenetic covariates such as age, sex, and smoking status. SNP genotypes were obtained from the sequence alignment files provided by the 1000 Genomes Project for their pilot3 study [11]. Included are 24,487 autosomal SNPs from 3,205 genes.
Risk prediction models
To assess the effect of rare variants on global disease risk prediction, we consider prediction models built using an SVM algorithm. The SVM is one of the popular classifiers in the field of machine learning and delivers state-of-the-art performance in a wide variety of biological applications [5]. In essence, the SVM is a supervised learning method that produces nonlinear boundaries by constructing a linear boundary in a transformed version (kernel function) of the feature space (SNP genotypes); thus it achieves maximum separation between two classes of subjects (case group vs. control group). Unlike traditional regression-based methods, the SVM is particularly useful in classifying high-dimensional data by allowing more input features, such as SNPs or genes. We include in the prediction model those genetic variants with p-values less than a prespecified threshold from association analysis, with adjustment for covariates. Here, rare variants are defined as SNPs with minor allele frequency (MAF) less than 5% [12, 13].
The association between disease and common SNPs (MAF ≥ 5%) is evaluated using Fisher’s exact test by comparing allele counts between case subjects and control subjects. SNPs with p-values less than a prespecified threshold (e.g., 1.0 × 10−3) are used for disease risk assessment in the next step. For the analysis of rare variants (MAF < 5%), SNPs are first collapsed by the presence or absence of minor alleles within each gene in each individual [14–17]. For each gene, we consider two sets of rare SNPs: the set of all rare variants and the set of all nonsynonymous rare variants. The collapsing approach is applied to each of the two sets. For each set of variants, the disease status is modeled in a logistic regression framework as a function of the presence or absence of a rare allele in the SNP set. Genes reaching a predefined statistical threshold are included in the risk prediction model. For a gene for which both rare variant sets reach the threshold, the set with the smaller Akaike information criterion (AIC) is selected to model the effect of rare variants in the gene. The p-value threshold used to select variants ranged from 1.0 × 10−5 to 0.01.
The SVM training algorithm is applied to these variants and to the covariates Age, Sex, and Smoking status. The genotype data for common SNPs are coded 0, 1, or 2, reflecting the number of minor alleles. Rare variants are coded 1 or 0, corresponding to the presence or absence, respectively, of minor alleles within each gene. Prediction models are built to discriminate between case subjects and control subjects. The risk prediction model is built using the SVM algorithm in the training data set, and the prediction error of the model is assessed in the validation data sets.
To evaluate the predictive value of rare variants, we conducted two experimental studies. In the first experiment, the set of SNPs included in the risk prediction model was selected from the first trait replicate, and the prediction model was built on the same data set. Prediction error was assessed on the remaining 199 trait replicates. In the second experiment, for each trait replicate we randomly divided the data into a training set and a validation set. SNP selection and risk prediction models were performed on the training set, and prediction error was estimated from the validation set. We repeated this procedure in each of the 200 trait data sets. In this second experiment, the size of the training set took values from 300 to 600, with an increment of 100.
We used the R package e1071 to build the risk prediction models. This package is an interface to the LIBSVM implementation of the SVM algorithm (current version 3.0, http://www.csie.ntu.edu.tw/~cjlin/libsvm). We trained the soft-margin linear SVM classifiers [18] in the training data sets using the SVM penalty parameter C = 1, the default value of the R package.
To evaluate the performance of risk prediction models, we applied receiver operating characteristic (ROC) curve analysis to the validation data sets. The ROC is a widely used tool to evaluate the discrimination ability of a binary classifier. In ROC analysis, the discriminatory power of the prediction model is usually measured as the area under the ROC curve (AUC value). This is the probability that a randomly chosen positive sample will have higher predicted risk than that of a randomly chosen negative sample. We compared the AUC values of prediction models combining both common and rare variants with the AUC values of models incorporating only common variants.