### Model selection using penalties

Let *y*_{
i
}, *i* = 1,..., *m*, be the binary response variable and let *x*_{
ij
}, *j* = 1,..., *p*, be the predictor variables. Further define the monotone logit transformation *η*_{
i
}= *p*_{
i
}/(1 - *p*_{
i
}), where *p*_{
i
}is the probability of observing *y*_{
i
}= 1. For the logistic regression model {\eta}_{i}={\beta}_{0}+{\displaystyle {\sum}_{j=1}^{p}{\beta}_{j}{x}_{ij}}, where *β*_{0} denotes the offset, the binomial log-likelihood *l* is

l(\beta )={\displaystyle {\sum}_{i=1}^{m}\left\{{y}_{i}\mathrm{log}p({x}_{i};\beta )+(1-{y}_{i})\mathrm{log}(1-p({x}_{i};\beta ))\right\}.}

(1.1)

Here *β* = {*β*_{0}, *β*_{
j
}} and the vector *x*_{
i
}includes the constant term 1.

In case of a large number of predictors, it is often desirable to determine a smaller subset with the strongest effects. Our first strategy is to consider stepwise selection with Akaike's Information Criterion (AIC) [2], as defined by AIC = -2*l*(*β*) + 2*k*. Here, *k* is the number of parameters included in the model, and AIC penalizes for the addition of parameters.

As an alternative we impose the so-called *lasso* penalty [2–4]. The above log-likelihood [Eq. (1.1)] can be modified as follows:

{l}^{{\lambda}_{s}}(\beta )=l(\beta )-{\lambda}_{s}{\Vert \beta \Vert}_{s}^{},

(1.2)

where *s* = 1 and ||*β*||_{1} = ∑_{
j
}|*β*_{
j
}|, the *L*_{1} norm of the parameter vector *β*. This lasso-type penalization can be useful for variable selection, because it shrinks some coefficients to zero. Note that only the *β*_{
j
}values are subject to penalization, not the offset *β*_{0}. An optimal *λ* can be determined by AIC and 10-fold cross-validation (CV).

From a Bayesian point of view, Eq. (1.2) can be seen as the posterior mode for combining a flat prior *β*_{0}, and independently normally distributed *β*_{
j
}values. Therefore, lasso parameter *λ*_{1} can be seen as the inverse of the variance of the prior. From this perspective, it becomes clear why Markov Chain Monte Carlo (MCMC) techniques can be applied for better handling of model uncertainty.

### Bayesian model selection

We implemented a fully Bayesian approach to variable selection for the logistic regression model, with hierarchical specification on the regression parameter vector and logit link on the class probabilities. A flat prior was assumed for the intercept term *β*_{0}, and the *β*_{
j
}values were independently modeled as N(0,\frac{1}{\zeta}{\nu}^{2}{I}_{k})-distributed random variables, where *ν* is a known scale and 1/*ζ* a rescaling factor, such that *ζ* has a Gamma(*a*, *b*) distribution with *a* and *b* positive real numbers. We assumed a uniform prior on SNP location choices (i.e., from a variable selection point of view, all SNPs have equal prior probability). An auxiliary variable approach was implemented to generate the logistic model via mixture modeling within an ordinary normal regression model [5, 6]. A hybrid MCMC sampler was applied, based on random choices between three steps to add (birth), remove (death), or move a SNP to the model. For more details we refer to Mertens [7], Green [8], and Holmes and Held [6]. We performed a simulation of 100,000 iterations and discarded the first 50,000 as burn-in. The classification performance from the model was investigated based on the marginal mean posterior class probabilities. Sensitivities and specificities were presented along with the receiver operating characteristic (ROC) curve. The Bayesian logistic regression variable selection model was implemented in MATLAB.

### Evaluation of the model selection regarding prediction performance

Because it is difficult to defend a model that predicts poorly, we also examined prediction performance. The first simulated data set, Replicate 1, was used as a training set to determine the optimal set of variables, and Replicates 2 to 10 were used for validation by calculating average prediction error. This was defined to be the average of the classification errors from the predicted values using selected SNPs for each of nine data sets. The computations were performed with the programming language R modifying various existing packages.

### Materials

We used ten replicates of simulated data, a dense map of 17,820 SNPs on chromosome 6, modeled after the rheumatoid arthritis (RA) data. We selected a high LD region of approximately 5 Mb including the trait loci (DR/C locus) with a large trait effect; we knew the "answers". Setting a threshold of *p*-value < 0.001 using the Cochran-Armitage trend test, we obtained 73 SNPs. Further, we chose 200 cases and 200 controls by the RA status.