We used the 100 replicates of simulated data in Problem 3 provided by the Genetic Analysis Workshop 15 (GAW15). We performed analyses with knowledge of the "real" answers but screened all of the 9187 SNPs, distributed on the genome to mimic a 10 K SNP chip without regard to the generating model. We used disease status for rheumatoid arthritis (RA) as the outcome and smoking, sex and DR genotype (the simulated DRB1 genotype) as covariates.
Subjects
To obtain biologically independent cases, for each replicate we randomly selected one affected sibling from each of 1500 nuclear families. These 1500 cases were then divided at random into a training data set of 750 affected subjects and a test data set of 750 cases. The GAW data provided 2000 unrelated unaffected individuals for use as controls. Two independent sets of 750 controls were selected at random from the 2000 for use as training data set and test data set controls. Thus, for each replicate we had independent training and test data sets consisting of 750 cases and 750 controls.
Random Forests
RFs grow a large number of classification or regression trees with no trimming or pruning. The RF method produces an importance score for each variable that quantifies the relative contribution of that variable to the prediction accuracy. We used this score to prioritize the predictor variables. The RF also produces prediction errors for the individuals, which we used for evaluation of the method.
We used Random Forests version 5 [6] to analyze the training data. We used an iterative process similar to a strategy previously proposed for gene expression analysis [7] in which, at each iteration, we built a random forest using the training data, and saved the 50% of variables with the highest importance scores to build the next forest. The random forests built at each iteration were named IT0, IT1,..., IT
n
, and the prediction errors of the training data set were estimated for the forest built at each iteration. We call the forest with the best prediction error IT
bp
. The variables included in IT
bp
were then used in second-stage analysis. We compared the performance of the iterative procedure that resulted in the forest IT
bp
, in terms of keeping the true risk variables and removing noise variables, to a simple procedure of selecting the top 50 ranked variables by importance from iteration 0, in the test data sets. Specifically, we compared the prediction error of the IT
bp
forest, the IT0 forest (all variables used; no selection), and a forest built using only the top 50 SNPs from the IT0 forest ("ITtop50"). Because the iterative procedure averaged 53 SNPs in the final forest, we chose 50 SNPs from IT0 to yield a forest with approximately the same number of SNPs. We computed prediction error using the test data ("test"), and using the out-of-bag observations of the training data set ("training").
Network inference
Bayesian networks (BN) are directed acyclic graphs for representing the joint probability distribution of all variables. A network for discrete variables, e.g., Figure 1, is specified by the graph structure (nodes and arcs) and the conditional probability table (CPT) at each node (node chr6_162 is shown). Each node is a variable, and each directed arc implies association and direction of dependency between the two variables. The origin node of an arc is usually called the parental node, and nodes that an arc points to are called child nodes. A child node is conditionally independent of other nodes given its parental nodes. Thus, the joint probability of n variables can be simplified to , where xil, xi2,..., x
ik
in the condition are the parental nodes of x
i
and a subset of x1, x2, xi-1,..., xi+1, x
n
. BN models are useful for describing complex relationships among variables, as well as for making predictions for variables that are regarded as outcomes.
CaMML [8], Causal Minimum Message Length (MML), is a program for generating Bayesian networks. The general goal is to find a model that maximizes the posterior probability of that model given the data. CaMML searches over all possible structures (models) using the Metropolis algorithm. It uses MML as a metric that includes a penalty on model complexity to control the resampling process. We evaluated the performance of CaMML on a set of variables used in the IT
bp
forest described above. We used the test data set to predict case status and estimate prediction error.