Volume 3 Supplement 7
Genetic Analysis Workshop 16
Analysis of North American Rheumatoid Arthritis Consortium data using a penalized logistic regression approach
 Pascal Croiseau^{1} and
 Heather J Cordell^{1}Email author
DOI: 10.1186/175365613S7S61
© Croiseau and Cordell; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
We applied a penalized regression approach to singlenucleotide polymorphisms in regions on chromosomes 1, 6, and 9 of the North American Rheumatoid Arthritis Consortium data. Results were compared with a standard singlelocus association test. Overall, the penalized regression approach did not appear to offer any advantage with respect to either detection or localization of diseaseassociated polymorphisms, compared with the singlelocus approach.
Background
Penalized regression approaches are an attractive option for the analysis of large numbers of predictor variables (such as genotypes at many genetic loci) that may influence a response variable (such as disease status). Most genomewide studies use singlelocus association tests such as the CochranArmitage trend test, or, equivalently, logistic regression with a single predictor variable (encoding the effect of a particular locus) included in the regression equation at any given time. Theoretically, regression methods allow the simultaneous inclusion of several different variables in the regression equation, e.g., variables coding for genotype rather than allele effects (thus modeling "dominance"), or variables that encode effects at several different loci. However, standard regression methods fail when the sample size (the number of people) is small compared to the number of predictors.
Standard linear regression can be formulated as finding the vector β of parameter estimates (regression coefficients) β_{ j }(j = 1,...,p) at p predictors that minimizes the sum of squared differences ,
which they termed the naïve elastic net. However, this method can overshrink the coefficients and performs poorly unless either λ_{1} or λ_{2} is close to 0. Zou and Hastie [2] therefore instead proposed using a modified version of the elastic net that essentially scales up the naïve elastic net coefficients by a factor of (1 + λ_{2}).
This is similar in form to the naïve elastic net penalty, except that, like the group lasso, it uses β_{ g }^{2} instead of in the groupspecific penalty controlled by λ_{2}.
Penalization is an attractive option in genetic studies because it allows the grouping of predictors that relate to the same genetic variant or region, and also because we genuinely expect the vast majority of loci to have regression coefficient 0. Although originally developed for quantitative outcomes, penalization methods have been extended to deal with binary outcomes (such as disease). Penalization is achieved by minimizing an objective function f(β_{0}, β) = g(β_{0}, β) + h(λ, β) with the penalization term h(λ, β) taking one of the forms above, and g(β_{0}, β)equalling minus one [6] or two [7] times the log likelihood of the data. Software implementations include the R package "glmnet", which fits the lasso or elasticnet regularization path for linear, logistic, and multinomial regression models, and the R package "grplasso," which fits a variant of the group lasso approach for binary outcome data.
Methods
Data
We analyzed the North American Rheumatoid Arthritis Consortium (NARAC) data, consisting of 868 rheumatoid arthritis (RA) cases and 1194 controls genotyped at 545,080 singlenucleotide polymorphisms (SNPs) across 22 autosomal chromosomes. These data were recently used in combination with additional samples [8] to perform genomewide association analysis, confirming previously proposed associations between disease and variants in HLA and PTPN22, and also reporting a new locus on chromosome 9. We therefore focused on these regions for application of our penalized regression approach.
Quality control
We used the software PLINK [9] to perform basic quality control checks. SNPs were excluded based on a SNP genotype call rate of <95%, minor allele frequency <1%, and HardyWeinberg equilibrium (HWE) pvalue < 10^{7}. We also removed individuals with >5% missing genotypes. We used multidimensional scaling of the Genetic Analysis Workshop (GAW) 16 data, together with publicly available HapMap data on 210 unrelated individuals from four populations, to confirm that the individuals from the GAW data had European ancestry and were not related.
Singlelocus analysis
We used PLINK to perform a CochranArmitage trend test at each SNP. Unlike Plenge et al. [8], we made no attempt to correct for population stratification, as we wished to compare our singlelocus results with those from group lasso penalized regression, which does not (in its current software implementation) allow inclusion of additional covariates such as principalcomponent scores from an eigenvector analysis [10].
Penalized regression analysis using the group lasso procedure
We applied the group lasso procedure proposed by Meier et al. [6] implemented in the R package "grplasso" to SNP data in the three regions of association (chromosomes 1, 6, and 9) detected by Plenge et al. [8]. Because the software required data to be available at all predictor variables, PLINK was first used to impute any missing genotypes on the basis of linkage disequilibrium (LD) patterns with observed genotypes. We chose this particular penalization approach and software because it is one of the few available methods that deal with binary (case/control) as opposed to quantitative outcomes, and because we were attracted by the natural coupling of parameters that could potentially be achieved through use of the group lasso penalty term.
Consideration of groups of predictors simultaneously could be useful if one wished to include more than one predictor per SNP (e.g., to model genotype effects rather than allelic effects, or interactions only in the presence of main effects) or to impose some other grouping based on (for example) biological function. However, in our analyses, we used only a single predictor variable per locus (coded 0, 1, or 2 according to the number of variant alleles), and thus each SNP formed a group by itself.
where l(β_{0}, β) is the logistic regression loglikelihood function and the function is used to rescale the penalty with respect to dimensionality of the parameter vector for group g (not relevant here). The choice of the tuning parameter λ controls the amount of penalization. A natural way to estimate λ is to use crossvalidation [5], however this can be very time consuming, particularly when coupled with the bootstrapping approach that we describe below. Instead we used the simpler proposal by Meier et al. [6] to take λ equal to log(G), where G is the number of groups, in our case the number of SNPs to be fitted in the model. Thus, λ varied from log(1000) = 6.9 to log(7000) = 8.85 in the results described below.
The output from a penalized regression procedure consists of an estimated regression coefficient for each predictor in the model: model selection is performed by estimation rather than hypothesis testing [5]. Because we do not have any measure of the variability of the estimated coefficient, interpretation of the importance or significance of any particular predictor can be problematic. Ideally, we would like to present results in the form of a significance test for each coefficient in order to perform comparisons with standard singlelocus tests of association. To address this limitation, we used a bootstrap: the penalized regression procedure was performed 50 times on 50 different bootstrap replicates constructed by selecting observations (people) with replacement from the original sample. This allowed us to estimate the variance of each regression coefficient. We then constructed a zscore at each locus by dividing the observed regression coefficient by its estimated standard error, and converted this to a pvalue, assuming the zscore to be normally distributed. This procedure is not, strictly speaking, correct, because penalized regression does not enjoy the asymptotic properties of standard regression procedures: shrinkage of the regression coefficients means their distribution cannot be assumed to be asymptotically normal. However, we hoped that this procedure would provide us with a ballpark estimate of the relative significance of the regression coefficients (relative to one another), even if the exact significance levels could not be considered reliable.
Results
Discussion
Penalization approaches are an appealing alternative to standard regression techniques for analysis of large numbers of predictor variables in the context of genomewide association studies. Use of such techniques is just beginning to emerge: ridge regression [11] has been used for distinguishing between causative and noncausative variants for quantitative phenotypes, and penalized logistic and least angle regression have been used for identifying genegene interactions in binary traits [7, 12]. A closelyrelated Bayesian penalized regression procedure [13] has also been suggested for genomewide and/or finemapping studies. Although, theoretically, the simultaneous inclusion of many markers across the genome in a single regression analysis has some appeal (on account of the reduction in residual variance that can be achieved), it is unclear whether one would genuinely expect this to improve upon singlelocus analysis with respect to detection of diseaseassociated polymorphisms. A more promising application is the finemapping problem, in which one is interested in determining from a smaller (although still potentially large) set of strongly correlated predictors in a region, which ones drive the association and are thus potentially causal or lie close to causal variant(s). Simulations suggest that penalized regression may offer some improvement over singlelocus methods in this regard [11, 13], although interpretation is complicated by difficulties in defining criteria for "true" and "false" detections in this context. In the analyses described here, we did not find the group lasso approach to offer any advantage over singlelocus methods with respect to either detection, or localization, of diseaseassociated polymorphisms. Singlelocus analysis provided a clear and localized signal of association, whereas the penalized approach generated a number of somewhat isolated signals, some with unusually small pvalues, across the regions investigated. Further investigation (data not shown) suggests that use of a higher penalty may produce better results: ideally one might wish to use crossvalidation to choose the best value of λ from a range of possible values; however, this is likely to be prohibitively timeconsuming on a genomewide scale. Further investigation of alternative penalization algorithms and of methods for choosing penalization parameters and assessing significance is warranted.
List of abbreviations used
 GAW:

Genetic Analysis Workshop
 WE:

HardyWeinberg equilibrium
 LD:

Linkage disequilibrium
 NARAC:

North American Rheumatoid Arthritis Consortium
 RA:

Rheumatoid arthritis
 SNPs:

Singlenucleotide polymorphisms.
Declarations
Acknowledgements
This work was supported by the Wellcome Trust, grant reference 074524. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Hoerl AE, Kennard R: Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970, 12: 5567. 10.2307/1267351.View ArticleGoogle Scholar
 Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Statist Soc Ser B. 2005, 67: 301320. 10.1111/j.14679868.2005.00503.x.View ArticleGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. J R Statist Soc Ser B. 1996, 58: 267288.Google Scholar
 Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. J R Statist Soc Ser B. 2006, 68: 4967. 10.1111/j.14679868.2005.00532.x.View ArticleGoogle Scholar
 Wu TT, Lange K: Coordinate descent algorithms for lasso penalized regression. Ann Appl Statist. 2008, 2: 224244. 10.1214/07AOAS147.View ArticleGoogle Scholar
 Meier L, Geer van de S, Buhlmann P: The group lasso for logistic regression. J R Statist Soc Ser B. 2008, 70: 5371.View ArticleGoogle Scholar
 Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics. 2008, 9: 3050. 10.1093/biostatistics/kxm010.View ArticlePubMedGoogle Scholar
 Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LR, Li W, Tan AK, Bonnard C, Ong RT, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altshuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Kastner DL, Klareskog L, Gregersen PK: TRAF1C5 as a risk locus for rheumatoid arthritisa genomewide study. N Engl J Med. 2007, 357: 11991209. 10.1056/NEJMoa073491.PubMed CentralView ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81: 559575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
 Price AL, Petterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006, 38: 904909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
 Malo N, Libiger OJ, Schork NJ: Accommodating linkage disequilibrium in geneticassociation analyses via ridge regression. Am J Hum Genet. 2008, 82: 375385. 10.1016/j.ajhg.2007.10.012.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang Z, Zhang S, Wong MY, Wareham NH, Sha Q: An ensemble learning approach jointly modelling main and interaction effects in genetic association studies. Genet Epidemiol. 2008, 32: 285300. 10.1002/gepi.20304.PubMed CentralView ArticlePubMedGoogle Scholar
 Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genetics. 2008, 4: e100013010.1371/journal.pgen.1000130.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.