Data set
The data set used was the GAW15 simulated rheumatoid arthritis (RA) data. We chose disease status for RA as the phenotype of interest. We used 50 replications of SNP data to create 50 samples of 1000 cases and 1000 controls. We created the cases by randomly sampling one affected member of each simulated family.
We did our analyses knowing the full model and simulation parameters used to generate the data. To explore the ability of LOTUS for selecting important predictors, we used some of the trait loci from the answers. For each sample, an initial set of 309 SNPs were considered, 303 of which were the simulated SNPs on chromosome 18. The rest were the trait loci A, B, C, D, E, and F. We used sample sizes of 250, 500, and 1000 cases (controls), respectively.
We also considered, as a more realistic scenario, the 303 SNPs of chromosome 18 only (trait locus E excluded). The sample size was 1000.
To estimate the false-positive rate, that is the probability of selecting a SNP not at causative locus, we used 4 of the 303 SNPs on chromosome 18.
LOTUS
LOTUS (logistic tree with unbiased selection) is a method for automatic construction of logistic regression trees. LOTUS fits a piecewise (multiple or simple) linear logistic regression model by recursively partitioning the data and fitting a different logistic regression in each partition. This allows nonlinear features of the data to be modeled without requiring variable transformations. A few features make LOTUS especially appropriate for analysis and interpretation of large data sets: negligible bias in split variable selection, relatively fast training speed, applicability to quantitative and categorical variables, choice of multiple or simple linear logistic node models, and suitability for data sets with missing values.
LOTUS constructs logistic regression trees in a top-down fashion [3, 4, 6]. It deals with the selection bias problem due to some predictors taking more values than others, and distinguishes nonlinear from linear effects through the use of a Cochran-Armitage trend-adjusted chi-square test. It can fit either a multiple or simple logistic regression at each node. Once the initial tree is grown, it is pruned back using a pruning method similar to the classification and regression trees (CART) algorithm [7]. LOTUS uses deviance as the 'cost-complexity measure' instead of the sum of squared residuals. The tree with the lowest prediction deviance is chosen based on an independent test set or ten-fold cross-validation.
LOTUS allows the choice of one of three roles for each quantitative predictor variable: f-variable, for fitting only; s-variable, for splitting only; and n-variable for both splitting and fitting. In our application we treated each locus genotype as an n-variable. We fitted a multiple stepwise linear logistic regression tree. A p-value of 0.05 was used for forward selection and backward elimination. The maximum number of predictor variables to be selected at each node was chosen to be ten.
The LOTUS computer program is freely available [8].
MDR
MDR is a nonparametric, combinatorial, model-free data-mining method, which has been successful in identifying gene × gene interactions in a balanced case-control design. With MDR, multilocus genotypes are pooled into high-risk and low-risk groups, thereby reducing the dimensionality of the genotype predictors from high dimensions to one dimension. That is, MDR employs constructive induction [9], the process of defining a new predictor as a function of two or more other predictors. The new one-dimensional multilocus-genotype predictor is used to choose the best set of loci from each one- to L-locus set according to classification and prediction errors. The MDR algorithm has reasonable power to detect epistasis [10].
Selection of interesting SNPs
We studied two procedures for selecting a small set of interesting SNPs from an initial large set. In the first procedure we simply selected all SNPs in the final tree produced by LOTUS. However, if one selects as interesting only the SNPs in the final regression tree produced by LOTUS, some of the important SNPs, and possibly trait loci, might not be selected. Their effect might have been overlooked due to the strong effect of some of the selected loci, higher order interactions, or the parameter settings of the algorithm such as maximum number of predictors at each node. To address this, we incorporated MDR in our second selection procedure. We used MDR to select the best model among all possible one- to four-locus subsets of the predictor set selected by LOTUS. The markers in this best model were then removed from the initial set of SNPs and LOTUS was run again. The markers selected at the first and second runs of LOTUS constitute the final set of interesting SNPs.
Our second procedure may lead to a larger final set of interesting SNPs and thereby increased false-positive rate. However, the incorporation of MDR might improve the selection of important SNPs. Our goal is not to miss markers possibly involved in disease etiology. Although the proposed procedures are for initial selection and the false-positive rate is not of major concern, to keep it reasonably low we use at most two LOTUS runs.
LOTUS can process many thousands of predictors at one run. However, due to the computational limitations of MDR, we considered only hundreds of SNPs in our simulations, and chose the parameters of LOTUS accordingly.