Data selection
For our study we choose the GAW15 North American Rheumatoid Arthritis Consortium (NARAC) candidate gene data set (Problem 2) consisting of 20 biallelic markers (0 ≤ r2 ≤ 0.66) from at least 13 RA candidate genes residing within at least 10 different chromosomal locations; with on average one or two markers per gene. Because some cases within this data set originated from the same pedigree, independent individuals were selected at random to give 463 (out of the original 839) cases and 855 independent controls, all of Caucasian origin. Hardy-Weinberg equilibrium (HWE) was tested in the complete sample and in the controls. One rare single-nucleotide polymorphism (SNP), HugotSNP8ms2 (minor allele frequency <0.05), showed nominally significant deviations from HWE in the controls (p = 0.008) and in the full sample (p = 0.006) due to an overrepresentation of rare homozygotes, which might be the result of chance or genotyping error. This would have greatest effect on genotypic or haplotype analysis, neither of which was carried out within this study. All tests for HWE were non-significant after correction for multiple testing (20 markers).
Imputation of missing values
Since RFs cannot handle missing values [3], data imputation was performed. Missing values in the original data set were either replaced by a RF strategy choosing the median of all marker alleles in the class of the missing value (median-replaced), or by imputation according to 'pseudo'-diplotype frequencies (imputed). Those frequencies were based on haplotype frequencies in the combined case-control sample, derived by an Expectation-maximization algorithm such as implemented in UNPHASED (v.2) [6]. Imputed marker genotypes were randomly selected from a multivariate uniform distribution. We included HugotSNP8ms2 in the imputation process since we observed low frequencies for haplotypes carrying the rare allele (<0.05), which are unlikely to affect the multivariate uniform distribution of diplotypes. Nine individuals with more than 50% missing genotypes were excluded as well as rs2240340, which had more than 65.4% dropouts. Missing values for the remaining markers ranged between less than 0.01% and 16.65%. In total, five imputed data sets were generated.
Single-marker and pairwise interaction analysis using LR
Marker genotypes were recoded in terms of additive components [7] and single marker allelic effects and pairwise marker interactions investigated with LR, using the R program suite (v.2.4) [8]. LR models with large SEs (>3) were excluded. Adjustment for multiple testing of interaction terms was made by permutation tests (1000 permutations). For imputed data sets, LR estimates (based on Wald tests) were combined respectively as outlined by Rubin [9]; analyses of the full interaction model were combined using average differences in deviance between the model containing main and interaction effects, and the null model (χ2-test, 3 df).
RF analysis of single- and joint-marker importance
The predictive importance of single markers [3] and marker pairs [5] was measured by Z-scores and significance levels. Z-scores can be inferred from AvImp measures through division by their associated standard error (SE), which is determined by the AvImp measure and the number of trees grown [3]. AvImp measures increase with the number of trees and their rank becomes stable, provided the number of trees is sufficiently large, such as observed by Lunetta et al. [4]. The associated Z-scores and their significance, however, also continue to increase with increasing number of trees, even when the RF reaches stability (see Figure 1). We investigated the correlation between AvImp measures for RFs with 5000 trees and RFs with 50, 100, 200, 300, 400, 500, and 1000 trees, respectively (median-replaced data); and in a similar way, the correlation between Z-scores (Spearman rank correlation). We observed high correlations (rAvImp-5000 ≥ 0.98, rZscore-5000 ≥ 0.96) for RFs with ~500 trees and more, for both single and joint AvImp measures (see Figure 1), implying a) a stable rank among their most important markers and b) that significant markers identified by these RFs remain significant when analyzed using RF grown on 5000 trees. We therefore selected stable but small RFs (500 trees), which give similar AvImp ranks as RFs grown on 5000 trees.
Joint-importance analysis was implemented in the RF program (v.5.1) [3] based on the original Fortran code provided by A. Bureau (personal communication). Genotypes were coded as outlined for LR analysis. When trees were grown, the best split at each node was selected, which most efficiently reduced the out-of-bag error (balanced for cases and controls). Importance measures for imputed data sets were combined using AvImp estimates and pooled variances.