Material
The first replicate of the genome-wide SNP data set and, as phenotype data, RA affection status, gender, lifetime smoking, age at ascertainment, as well as DR alleles from father and mother, were utilized. To mimic a case-control study, we randomly selected one affected sibling per affected pair for the cases and one unaffected sibling per control family for the controls, thus obtaining 1500 cases and 2000 unrelated controls. For the two-stage approach, we randomly split the data into two sets with 750 cases and 1000 controls each.
First stage of analysis
The first stage of our approach was designed to screen for variables most likely to differentiate between cases and controls. Using the phenotype and genotype information for first of the two data sets, RFs with classification trees (CART) were grown to pre-select the most important variables [3].
The importance of the variables was estimated as the permutation importance in a RF. To this end, the number of correct classifications of the out-of-bag (OOB) cases is calculated in every single tree grown in the forest. Then, the values of the specific variable are randomly permuted in the OOB individuals, and these are then re-classified using these new values. Finally, the number of correct classifications with the permuted values was compared with the number of correct classifications in the original data. The difference between these fractions, averaged over all trees in the RF, gives the permutation importance for the respective variable.
To grow RFs and estimate permutation importance values, we used the software R [4] with the randomForest package by Liaw and Wiener. Because of computational limitations, we were not able to grow one RF containing all variables with estimating importance via the permutation procedure. Instead, 155 RFs were grown based on subsets of 5000 variables, randomly selected without replacement. For every RF, 500 trees were grown with a random selection of 20 variables per node. On average, each variable was contained in a RF 84 times (min = 60, max = 106). The average importance scores across all RFs were used as the global importance of a variable.
Díaz-Uriarte et al. [5] proposed a backward elimination heuristic for RFs to obtain a small set of predictive variables. They calculated importance values for all variables once only. To then select variables, they iteratively fitted RFs; for each iteration, they discarded 20% of the least important variables of the previous variable set and calculated OOB error fractions regarding the remaining variables. They finally selected the set of variables which yielded the lowest OOB error across all iterations. With a similar idea, we applied the following forward-elimination approach:
1. Compute global importance score for every variable as described above.
2. Sort variables according to their score.
3. Grow a RF with the most important variable as single predictor.
4. Compute OOB error for this RF.
5. Add next important variable to the set of predictors and grow a RF.
6. Repeat steps 4 and 5.
On the basis of the resulting OOB prediction error estimates, we chose the smallest set of variables leading to a small prediction error (see below for more information).
To avoid multicollinearity of the variables in the second stage, we applied the haplotype tagging approach by Chapman et al. [6] using the mean estimated coefficient of determination across haplotypes R2 ≥ 0.5 as criterion for SNP selection.
Second stage of analysis
The aim of the second step was to obtain valid parameter estimates for the selected variables in a logReg. To reduce the amount of overfitting because of data-dependent variable selection in the first stage, we used an independent data set. Lacking a specific biological hypothesis, an additive genetic effect for each SNP was assumed as recommended [7], and the logReg included all variables that were selected in the first stage. To correct for the multiple testing of the selected variables, nominal p-values were adjusted according to the Bonferroni-Holm procedure [8]. Because model parameters were estimated in this stage, stringent external validation of the model is still required. In our study, results are compared with the simulated models.