Data
The phenotype (NAPHENO) and candidate gene (CANDGENE.DAT) data were provided by the North American Rheumatoid Arthritis Consortium [4] and consist of 839 RA cases from sib-pair families and 855 unrelated controls. SNPs with missing data of 15% or greater were excluded from the analysis, resulting in a total of 17 SNPs from 13 genes (PTPN22, CTLA4, HAVCR1, IBD5, SLC22A4, IL3, IL4, SUMO4, MAP3K71P2, DLG5, CARD15, RUNX1, and MIF) being considered. The missing genotypes were estimated using the proximity calculated by RFs on the training subjects [3].
In this study, we generated two replicate data sets in which the subjects were independent within samples and were dependent across samples. This approach ignores dependencies between the data sets that in general lead to inflation of type I error in the replication set. However, we feel that this approach is justified to simplify the confirmatory analyses using the available data. The data set was split into two replication data sets by randomly selecting a case from each affected sib pair for one of the data sets and the remaining case contributing to the second data set, with the controls almost evenly divided into the two data sets. Data set 1 includes 398 cases and 427 controls, while data set 2 includes 387 cases and 428 controls. It should be noted that due to the relatedness between the replication and the analysis data sets, the confirmation from the second data set should be interpreted with caution.
The data from the genome-wide linkage panel (a total of 5742 SNPs, excluding chromosome Y SNPs) were available for 1998 individuals from families with RA patient (NAILL01–NAILL25). For families with two or more siblings, one was randomly selected from each family for data set 1, and the second subject was then randomly selected from the remaining of samples for data set 2. The singletons were randomly divided into the two data sets. Each of the data sets consists of 740 unrelated subjects. The analysis of the replication data set should be considered exploratory.
The RF method is not particularly efficient at imputing missing data for large number of SNPs. Therefore, we applied an extension of the expectation-maximization (EM) algorithm implemented in HelixTree (Golden Helix Inc., Bozeman, MT) to impute the missing genotypes for the genome-wide linkage data set. In a simulation study, this method achieved outstanding performance of imputation accuracy above 95% with missing rate ranging from 1% to 10% (Sun et al., unpublished data).
Association analysis
The association between each candidate gene SNP and RA status was tested using all samples. Two methods were used for the candidate gene data set to adjust for the degree of relatedness among the affected siblings, a generalized estimating equations (GEE) method with likelihood ratio statistics (SAS 8.2, SAS Institute Inc., Cary, NC, USA, 1999) and Risch and Teng's statistic [5]. Results from the Risch and Teng's statistic and GEE likelihood ratio were consistent so that only results from the GEE analysis are reported. SNPs with an association with RA at p < 0.05 were also analyzed in a multiple SNP model.
For the genome-wide data set, single-SNP associations with RA were assessed using a χ2 test when all three genotypes were observed and the least frequent genotype observation was more than five. In situations in which these criteria were not met, Fisher's exact test was used to test the association. The p-values were then used to rank the significance of all SNPs.
Identification of phenotypic subgroups among RA patients
An unsupervised clustering analysis was carried out on the RA cases using the detailed phenotypic data (demographic and disease-related clinical and laboratory variables) to identify clinically distinct subgroups of RA patients. For the purpose of the analysis, the phenotypic variables were dichotomized using a threshold of the 75th percentile for AgeOnset (≥50), TenderJtCt (≥13), SwollenJtCt (≥13), JAMScore (≥52), SeverityLH (≥5), and SeverityRH (≥5). DRB1_1 and DRB1_2 marker variables were combined and dichotomized on the basis of the presence or absence of RA-associated epitopes (*01 and *04 alleles). Patient self-reported American Rheumatism Association (ARA) sub-scores and the overall ARA score were not included in the analysis. All other clinical measurements, except BMI (51% missing), were used in the analysis.
Clustering analysis was carried out using modified 'Jaccard' coefficient (proportion of non-zero pairs that are similar between subjects) as the distance measure [6]. Multidimensional scaling (MDS) of the distance matrix was used to visualize and identify sub-groups in the patient population. After identifying subgroups of RA patients characterized by clinical and laboratory phenotypes, we further investigated whether the SNPs were associated with the sub-groups. Specifically, we attempted to predict the different subgroups within RA cases using the SNP data from the candidate gene study.
RA classification using RFs
In addition to the conventional association analysis methods that were used in the study to identify RA susceptibility loci, we also used RFs [3]. The method was used for classification or prediction of cases and controls using the SNP data, and five-fold cross-validations (CVs) were used to evaluate classification accuracy. The receiver operating characteristic (ROC) curve was calculated based on the vectors of sensitivity and specificity for each of the five CVs. The values of area under curve (AUC) from the five CVs were averaged to compare the predictive ability and stability.
All of the 17 candidate gene SNPs and the 5742 SNPs from the genome-wide linkage panel were included in the RFs model. RFs were also applied using selected subsets of SNPs as predictor variables. These subsets were selected on the basis of their ranking on the importance measurement implemented in the RFs package. The SNPs that were highly ranked were selected for the RFs model.
All statistical analyses were performed using the R statistical software package (version 2.1.0) from R Project http://www.r-project.org/. The R package for Random Forests (randomForest 4.5.-15) was installed and utilized following the instructions.