Data set
We have chosen to use the simulated data (Problem 3) from GAW15. This data simulation was set up to mimic the familial pattern of RA, including a strong effect of DR type at the HLA 2 locus on chromosome 6. A large population of nuclear families (two parents and two offspring) was generated. This population contains close to 2 million sibling pairs (3.6 million subjects). RA affection status was determined for everyone from a complex genetic and environmental model. There were four loci (A on chromosome 16, B on chromosome 8, and C and D both on chromosome 6) in addition to a strong effect of the DR alleles that directly, or through interactions with smoking and gender, modeled RA status. Additional loci modeled severity and other related RA outcomes. From this population, a random sample of 1500 families was selected from among families with two affected offspring (the affected sib-pair (ASP) group) and another random sample of 2000 families was selected from among families where no member was affected (control group). Within the 2000 families selected for the control group, one offspring was randomly selected to be in the final control group.
Microsatellites and SNPs were generated on 22 autosomes. These markers were designed to be like real human autosomes in terms of genetic and physical map lengths. The marker and trait loci were generated to have similar properties, such as linkage disequilibrium, to those observed in real data. We chose to analyze the SNPs, for all controls and the first sibling in the ASP group in Replicate 1. In addition, we used similar data from Replicates 2 and 3 as tuning and test data sets.
LASSO-Patternsearch algorithm
The LASSO-Patternsearch algorithm [6] is an approach to identify patterns of risk factors that is built on a global core. We modify the original algorithm for use with genetic (SNP) data (add a screen step, consider only main effects and second-order interactions, and tune the smoothing parameters with a tuning set). Through the use of a series of basis functions described below, we can build a model for the relation between phenotype and variables that embodies main effects and two-factor interactions ("patterns"). The basis functions we use assume dichotomous risk factors. Responses are coded 1 for cases and 0 for controls; females are coded 1 and males 0; smokers as 1 and non-smokers as 0. Age is the only continuous risk factor and we code an elder group (≥ 55) as 1 and a younger group (<55) as 0. Because nearly all SNPs have three levels: normal, one variant allele, and two variant alleles, we retain this information by initially coding them as 0, 1, and 2, respectively. HLA DR also has three levels and we initially code them as DRX = 0, DR1 = 1, and DR4 = 2. For these three level variables, we define basis functions in a generalized way described below, which is equivalent to introducing two dummy variables.
The modified algorithm has three steps:
We first define our coding basis functions to be used: let x
j
be the jth variable and x be (x1, x2,..., x
N
) where N is the number of variables. Let if x
j
= 1, and 0 otherwise, and let if x
j
= 2, and 0 otherwise. We call these "main effects" basis functions. Let if x
j
= r, x
k
= s, r, s = 1, 2, and 0 otherwise (so there are four basis functions for each pair (j, k)). We call these "two-factor interaction" basis functions. These basis functions will be used to code the variables into logistic or penalized logistic regression models. Let p(x) be the probability that y = 1, given x, and let f(x) = log[p(x)/(1 - p(x))]. The negative log likelihood function is given by:
(1)
We will code the dependence on x by f(x) = μ + ∑cℓBℓ(x), where the Bℓ will be specified subsets of the basis functions defined above, and μ and {cℓ} are estimated by minimizing L(y, f). The goal is to select those basis functions that encode the variables or pairs of variables that best separate cases from controls.
Because there are more than 9000 SNPs on all 22 chromosomes, incorporating 9000 main effects and two-factor interactions, there will be more than 108 basis functions and we cannot deal with them all simultaneously. We first prescreen for main effects with a logistic regression model as follows. For each j = 1,..., N, we find μ, and to minimize the negative log likelihood L(y, f
j
), where . We test the hypothesis at the 0.05 level that is different from 0 and if it is, the jth variable will go to the second part of the prescreen step and the basis function (x) will go to Step 1. Note that each SNP may contribute two basis functions and they are not necessarily significant simultaneously. In that case, the significant basis function will go to the LASSO step and the SNP is still eligible for the screening of interactions. For each pair of variables (x
j
, x
k
), we construct the model and minimize L(y, f
jk
). We test the hypotheses that the coefficients are different from 0 at the 0.002 level and the basis functions (patterns) that survive go to Step 1. At this point we would like to select the largest number of candidates that can be comfortably handled by the core global LASSO step. The significance level of 0.002 was chosen in an ad hoc manner to select candidates for the next step and resulted in a large set that, very roughly, met this goal.
• Step 1: The LASSO step
In this step, we use the LASSO penalty (l1) to do variable selection. We relabel the basis functions that survive Step 0 as B
l
, l = 1, 2..., N
B
, where N
B
is the total number of the basis functions. We estimate f by minimizing
>I
λ
(y, f) = L(y, f) + λJ(f),
where and . The smoothing parameter λ balances the trade-off between data fitting and the sparsity of the model. We will choose the smoothing parameter by the prediction accuracy on a separate tuning set. This is done as follows: for each trial value of λ, the minimizer of Eq. (2) produces f
λ
(x) and hence p
λ
(x). We make the important observation that the ratio of cases and controls in the training set is the same as the ratio of cases and controls in the tuning set. Thus, if one had a perfect estimate of p(x) for the population that generated both the test and tuning set, and costs of misclassification were the same for both types of misclassification, then the Bayes rule (to minimize expected cost) for classifying members of the tuning set would be to classify a member as case if p(x) > 0.5 and as a control if p(x) < 0.5 [7]. Therefore we are motivated to examine the actual error rates on the tuning set, for each choice of λ, by using p
λ
(x) = 0.5 as the classifier.
• Step 2: The logistic regression step
Step 1 produces a relatively sparse model, but we have seen a general tendency for the LASSO-Patternsearch to err on the conservative side in selecting basis functions, that is, there is a very high probability of including all relevant basis functions, at the expense of including some noise terms [6, 8]. Thus, we took a closer look at the terms that passed Step 1 by putting them into a parametric logistic regression and testing the significance of each term at level α. Rather than choose α on an ad hoc basis, it is selected based on prediction accuracy on the tuning set. The significant term goes into the logistic regression model again and that gives the final model.
It is believed that this multi-step process is an effective procedure to meet two goals simultaneously, sparsity and generalizability, and the results below tend to bear this out.