Penalized regression approaches are an attractive option for the analysis of large numbers of predictor variables (such as genotypes at many genetic loci) that may influence a response variable (such as disease status). Most genome-wide studies use single-locus association tests such as the Cochran-Armitage trend test, or, equivalently, logistic regression with a single predictor variable (encoding the effect of a particular locus) included in the regression equation at any given time. Theoretically, regression methods allow the simultaneous inclusion of several different variables in the regression equation, e.g., variables coding for genotype rather than allele effects (thus modeling "dominance"), or variables that encode effects at several different loci. However, standard regression methods fail when the sample size (the number of people) is small compared to the number of predictors.
Standard linear regression can be formulated as finding the vector β of parameter estimates (regression coefficients) β
j
(j = 1,...,p) at p predictors that minimizes the sum of squared differences
,
where, for person i, y
i
is a quantitative outcome variable and x
ij
is a predictor variable (such as a genotype variable taking values 0, 1, or 2 according to the number of risk alleles at locus j). In penalized regression, one minimizes this function subject to a constraint on the coefficients such as
or as
. The theory of Lagrange multipliers suggests that this problem may be re-formulated as minimizing
where g(β0, β)corresponds to the original sum of squared differences, h(λ, β)is a penalty term, and λ is a tuning parameter (or vector of parameters) that controls the strength of penalization. Ridge regression [1] uses a so-called L2 penalty
producing coefficients that are scaled down or "shrunk" towards zero and prediction models that often perform better that least-squares owing to a bias-variance trade-off [2]. All predictors remain in the model, some with small coefficients. The lasso [3],
uses an L1 penalty, resulting in both shrinkage and variable selection, in that many of the coefficients become set to zero. Zou and Hastie [2] proposed a penalty h(λ1, λ2, β)that is a convex combination of the lasso and ridge penalties
which they termed the naïve elastic net. However, this method can over-shrink the coefficients and performs poorly unless either λ1 or λ2 is close to 0. Zou and Hastie [2] therefore instead proposed using a modified version of the elastic net that essentially scales up the naïve elastic net coefficients by a factor of (1 + λ2).
The naïve and modified elastic net approaches enjoy a grouping property whereby predictors that are highly correlated tend to have similar coefficient estimates [2]. An alternative penalization method that enjoys a similar property is the group lasso [4], which minimizes the objective function f(β0, β) = g(β0, β) + h(λ, β) with
(i.e., half the sum of squared differences) and penalty term
Here, the predictors are divided into G groups (g = 1,...,G) and f
g
and l
g
indicate the first and last predictor in group g. The penalty term in the group lasso is intermediate between the L1 penalty of the lasso and the L2 penalty used in ridge regression and, as pointed out by Wu and Lange [5], provides a natural coupling between parameters in the same group. Wu and Lange [5] actually propose an alternative approach, which is to minimize the objective function f(β0, β) = g(β0, β) + h(λ, β), with g(β0, β) equal to either half the sum of squared differences as above (denoted l2 regression) or to
(denoted l1 regression), with the penalty term taking the form
This is similar in form to the naïve elastic net penalty, except that, like the group lasso, it uses ||β
g
||2 instead of
in the group-specific penalty controlled by λ2.
Penalization is an attractive option in genetic studies because it allows the grouping of predictors that relate to the same genetic variant or region, and also because we genuinely expect the vast majority of loci to have regression coefficient 0. Although originally developed for quantitative outcomes, penalization methods have been extended to deal with binary outcomes (such as disease). Penalization is achieved by minimizing an objective function f(β0, β) = g(β0, β) + h(λ, β) with the penalization term h(λ, β) taking one of the forms above, and g(β0, β)equalling minus one [6] or two [7] times the log likelihood of the data. Software implementations include the R package "glmnet", which fits the lasso or elastic-net regularization path for linear, logistic, and multinomial regression models, and the R package "grplasso," which fits a variant of the group lasso approach for binary outcome data.