Volume 3 Supplement 7
Genetic Analysis Workshop 16
Comparison between the stochastic search variable selection and the least absolute shrinkage and selection operator for genomewide association studies of rheumatoid arthritis
 Sudeep Srivastava^{1} and
 Liang Chen^{1}Email author
DOI: 10.1186/175365613S7S21
© Srivastava and Chen; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
Background
Because multiple loci control complex diseases, there is great interest in testing markers simultaneously instead of one by one. In this paper, we applied two model selection algorithms: the stochastic search variable selection (SSVS) and the least absolute shrinkage and selection operator (LASSO) to two quantitative phenotypes related to rheumatoid arthritis (RA).
Results
The Genetic Analysis Workshop 16 data includes 2,062 unrelated individuals and 545,080 singlenucleotide polymorphism markers from the Illumina 550 k chip. We performed our analyses on the cases as the quantitative phenotype data was not provided for the controls. The performance of the two algorithms was compared. Using sure independence screening as the prescreening procedure, both SSVS and LASSO give small models. No markers are identified in the human leukocyte antigen region of chromosome 6 that was shown to be associated with RA. SSVS and LASSO identify seven common loci, and some of them are on genes LRRC8D, LRP1B, and COLEC12. These genes have not been reported to be associated with RA. LASSO also identified a common locus on gene KTCD21 for the two phenotypes (marker rs230662 and rs483731, respectively).
Conclusion
SSVS outperforms LASSO in simulation studies. Both SSVS and LASSO give small models on the RA data, however this depends on model parameters. We also demonstrate the ability of both LASSO and SSVS to handle more markers than the number of samples.
Introduction
It is now feasible to perform largescale, highdensity genomewide association studies (GWAS) to search for common genetic variants underlying common diseases (reviewed in [1, 2]). Due to their computational feasibility, singlemarker tests remain the primary tools in the analysis of GWAS data. However, most common diseases are complex and are caused by multiple genetic variants, each having only a small effect. The possible interactions among genetic variants and the interactions between genes and environment present additional challenges for disease mapping.
Because multiple loci contribute to complete diseases, testing markers simultaneously instead of one by one may increase statistical power. Advanced technology can provide us thousands of highquality marker genotypes. To identify the correct set of genetic variants from thousands of markers, efficient and reasonable model selection algorithms are urgently needed. Two popular model selection methods have been proposed: the stochastic search variable selection (SSVS) [3] and the least absolute shrinkage and selection operator (LASSO) [4]. With SSVS, a latent variable γ is introduced to perform variable selection for regression models. If γ_{ j }= 1 the j^{th} variable is included in the model; if γ_{ j }= 0, the j^{th} variable is excluded from the model. A homogenous ergodic Markov chain can be generated by the Gibbs sampler. The empirical distribution of γ based on the Markov chain will converge to the actual posterior distribution of γ [5]. LASSO method proposed by Tibshirani is a shrinkagebased selection method for linear regression. LASSO minimizes the residual sum of squares subject to the constraint on the sum of absolute value of coefficients. This L1Norm constraint produces shrunk coefficients with some of them exactly equal to zero, which leads to interpretable models. In 2004, Efron et al. proposed the least angle regression (LARS) [6], which is a computationally efficient modelselection algorithm. There is a close connection between LARS and LASSO. A simple modification of the LARS algorithm can yield all LASSO solutions. LASSO is one of the most popular model selection methods that involve minimization of the mean square error with respect to some constraints. On the other hand, SSVS is based on the Gibbs sampler, which belongs to the broader class of MarkovChain Monte Carlo methods. Hence, a comparison of the two methods would be of great interest.
In this paper, we focus on two quantitative phenotypes related to rheumatoid arthritis (RA). The data was provided by the Genetic Analysis Workshop 16. RA is an autoimmune disease that causes chronic inflammation in joints, resulting in loss of function and disability. We applied SSVS and LASSO to the two quantitative phenotypes to identify genetic variants associated with RA. We compared the results based on 545,080 SNPs.
Methods
SSVS
where Y is n × 1, X = [X_{1},..., X_{ p }] is n × p, β = (β_{1},..., β_{ p }), and θ^{2} is scalar.
which is an ergodic Markov chain that converges to its stationary distribution. We ran the Gibbs sampler for 2,000 iterations to achieve stationarity and then ran it for an additional 8,000 iterations to estimate the posterior probabilities of the γ_{ i }values.
LASSO
This is a quadratic programming problem but can be solved by a simple modification to the LARS algorithm by Efron et al. [6]. LARS, a modification of the classic modelselection method known as forward selection, orders regression covariates on the basis of their decreasing association with the output variable. By modifying LARS to enforce the sign consistency restriction of LASSO, we can generate all LASSO solutions corresponding to different values of the free parameter λ. Selecting the active model at a given iteration would give us LASSO solution corresponding to a particular value of λ. Hence, using a cutoff for the number of iterations allowed for the modified LARS, we can control λ.
Simulation studies
where λ_{ i }is the iteration at which the i^{th} marker enters the model.
Sure independence screening for the RA data
SIS selects the largest componentwise magnitudes of the vector. Here, we selected the 1,000 markers with the largest correlations with the quantitative traits for SSVS and LASSO. Because the number of samples are 746 and 867 for the IgM and the antiCCP phenotypes, respectively, we demonstrate the use of SSVS and LASSO to select a model in which the number of markers is greater than the number of samples. SIS reduces computational time required by the modelselection methods. SIS has been shown to retain important variables in the screened model with a large probability when a set of conditions are satisfied [9].
Results
Simulation studies
Simulation results of QTL mapping
AUC_{b}  

Coefficient  Marker density^{a}  SSVS (τ= 1)  SSVS (τ= 2)  SSVS (τ= 3)  LASSO  Ftest 
0.5  200  0.898  0.891  0.884  0.684  0.729 
0.5  100  0.923  0.918  0.913  0.7  0.753 
0.5  10  0.949  0.944  0.942  0.783  0.800 
0.5  1  0.950  0.947  0.942  0.798  0.814 
1  200  0.969  0.955  0.947  0.873  0.814 
1  100  0.987  0.982  0.978  0.886  0.854 
1  10  0.997  0.996  0.995  0.965  0.922 
1  1  0.999  0.998  0.997  0.979  0.932 
1.5  200  0.989  0.985  0.976  0.926  0.838 
1.5  100  0.998  0.998  0.996  0.939  0.887 
1.5  10  1  1  0.999  0.991  0.951 
1.5  1  1  1  1  0.995  0.951 
2  200  0.991  0.988  0.986  0.952  0.823 
2  100  0.999  0.997  0.997  0.962  0.874 
2  10  1  1  0.999  0.995  0.963 
2  1  1  1  1  0.998  0.966 
RA
Markers with a minor allele frequency less than 0.01 were screened out. After this step, 425,555 markers for the IgM phenotype and 425,586 markers for the antiCCP phenotype remained. After removing the samples with missing phenotypes, 746 samples for the IgM phenotype and 867 samples for the antiCCP phenotype were left. The correlation between the phenotype and the genotype was calculated. The top 1,000 markers were chosen for each of the two phenotypes. Then, SSVS and LASSO were run on both phenotypes.
The LASSO algorithm was run for 500 iterations unless it terminated earlier automatically because there were no more significant markers. The optimal cutoff for the number of iterations was selected using a 10fold crossvalidation. The minimum crossvalidation error was achieved at 31 iterations for the IgM phenotype and at 33 iterations for the antiCCP phenotype. It identified 31 significant markers for the IgM phenotype and 33 for the antiCCP phenotype. The marker positions are also shown in Figure 1. No common markers were identified for the two phenotypes. However two markers, rs230662 for the IgM phenotype and rs483731 for the antiCCP phenotype, are within 100 Kb of each other and both lie on the same gene KTCD21 at chromosome location 11q14.1. These markers have been enclosed in a diamond in Figure 1.
SSVS and LASSO identified a few common loci for the IgM phenotype and the antiCCP phenotype. For the IgM phenotype, SSVS selected rs1938032 and LASSO selected rs12032393; both markers are located on the gene LRRC8D on chromosome 1 (enclosed by a star in Fig. 1). On chromosome 7, SSVS selected rs2392467 and LASSO selected rs13234380 for the IgM phenotype. Both markers are within 100 kb of each other and are shown in Figure 1 (enclosed by a star). For the antiCCP phenotype, SSVS selected rs10183908 and LASSO selected rs972485 in the gene LRP1B on chromosome 2 (enclosed by a triangle). On chromosome 4, SSVS selected rs192068 and LASSO selected rs1540052, while on chromosome 10, SSVS selected rs807013 and LASSO selected rs17113682 (enclosed by triangles). Both marker pairs are within 100 kb of each other, but neither of them are located on annotated genes. SSVS and LASSO selected two common markers, rs2186830 and rs334438, on chromosome 18 (enclosed by triangles). Among them, marker rs2186830 is located on the gene COLEC12. None of these genes have been reported to be associated with RA in the literature.
Discussion
SSVS and LASSO have advantages and disadvantages in their application to the model selection problem. SSVS is computationally intensive and cannot handle a very large set of markers. Hence, a more aggressive marker screening step needs to be done. The Gibbs sampler also requires the selection of prior parameters. In our simulation studies, by changing the prior of the γ values, we did not see a significant difference in the power of SSVS (data not shown). However, changing σ and τ will change the results. Further study is required on how to select these parameters. On the other hand, LASSO is very fast and hence the model is selected very quickly. It can handle large number of markers because the computational time in each step is proportional to the size of the model at that iteration. However, a big problem with LASSO is the selection of the cutoff. The Cp type score mentioned in Efron et al. [6] does not work in this data set because it does not give a small enough model (e.g., the model size using the Cp type score exceeds 200 markers for a given phenotype). LASSO may also overshrink parameters because it shrinks both nonsignificant coefficients and significant coefficients at the same rate. This can lead to overshrinking of the nonzero coefficients and result in models with large size as more variables are required to fit the data. To rectify this problem, cross validation is performed using the ordinary lease squares estimates of the model selected at each iteration rather than using LASSO estimates for β values. The problem of overshrinking can be minimized by using different penalties for different coefficients, which is performed in the adaptive LASSO [10]. In addition, it is very important to choose the prescreening steps carefully. A total of 1,000 markers were used for our studies to demonstrate the ability of SSVS and LASSO to handle more markers than the sample size. In future studies, it would be of great interest to address the number of markers chosen in the prescreening step to increase power.
For SSVS, we found that for different marker coeffcients (β), the best results were achieved by different values of τ. SSVS with τ = 1 had the maximum power compared with other values of τ. While for β = 2, SSVS with τ = 2 has the maximum power compared to τ = 1,3. It indicates that different values of hyperparameters are required to detect markers with different effects.
SSVS and LASSO both rely on the assumption that the predictors are independent. However, the markers are dependent on each other due to linkage disequilibrium, which would need to be considered to make an accurate statistical inference. SIS also assumes independence, but is more robust to dependent markers than SSVS or LASSO. A modification of LASSO known as the adaptive LASSO [10] deals with correlated predictors by using adaptive weights for different predictors. It would be useful to compare the results of adaptive LASSO to that of LASSO.
SSVS and LASSO both assume a linear model and independent predictors. Using simulation studies, we have demonstrated that both SSVS and LASSO select common causal markers. When the markers are farther apart, i.e. they are independent of each other, and the prediction accuracy increases. In the real data, these assumptions may be violated. In complex diseases like RA, the individual marker effect is very weak and the marker effect may be nonlinear. The markers are not independent of each other due to high linkage disequilibrium. In addition, there are other confounding variables such as population structure and epistatic effects. All of the factors will affect the performance of the two methods to different degrees. Therefore, only a few common markers are selected by the two methods. It would be of great interest to study the behavior of SSVS and LASSO when these effects are incorporated into a more complex simulation model.
The human leukocyte antigen (HLA) region on chromosome 6 has been reported to be associated with the disease by various association and linkage studies [11, 12]. However, we did not identify any markers within that region using LASSO or SSVS. This is an interesting result because markers in the HLA region correlate very strongly with the disease status. A possible explanation could be the missing quantitative phenotype data for the controls. QTL studies focus on quantitative traits instead of the binary disease status information. By incorporating disease status via a different method like logistic regression, random forests, and so on, one can build a more powerful model for the association study. This would be extremely useful in our data set because the disease status shows a very high association with the HLA region, but the quantitative phenotype data shows no association with the HLA region.
Conclusion
In this paper, we compared SSVS and LASSO on simulated and real data. SSVS outperforms LASSO as well as the singlemarker Ftest in the simulation studies. The two methods were compared on the RA data provided by the Genetic Analysis Workshop 16 workshop after a dimension reduction using the SIS. Association studies were carried out for the two quantitative phenotypes using the two algorithms. The two methods identified two common markers for the antiCCP antibody phenotype and two and three marker pairs that are located close to each other for the rheumatoid factor IgM phenotype and the antiCCP antibody phenotype, respectively.
Some of these markers were found to be in annotated genes such as LRRC8D, LRP1B, and COLEC12. However these genes have not been reported to be associated with RA. In summary, SSVS and LASSO are very useful tools in GWAS studies for quantitative data.
List of Abbreviations used
 antiCCP:

Anticyclic citrullinated peptide
 AUC:

Area under the curve
 GWAS:

Genomewide association study
 HLA:

Human leukocyte antigen
 IgM:

Immunoglobulin M
 LARS:

Least angle regression
 LASSO:

Least absolute shrinkage and selection operator
 RA:

Rheumatoid arthritis
 ROC:

Receiver operator characteristic
 SIS:

Sure independence screening
 SSVS:

Stochastic search variable selection.
Declarations
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This work was partially supported by the NIH grant P50 HG 002790 and the AFAR research grant.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN: Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev Genet. 2008, 9: 356369. 10.1038/nrg2344.View ArticlePubMedGoogle Scholar
 Kruglyak L: The road to genomewide association studies. Nature Rev Genet. 2008, 9: 314318. 10.1038/nrg2316.View ArticlePubMedGoogle Scholar
 George E, McCulloch R: Variable selection via Gibbs sampling. J Am Stat Assoc. 1993, 88: 881889. 10.2307/2290777.View ArticleGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via LASSO. J Roy Stat Soc Ser B. 1996, 58: 267288.Google Scholar
 Casella G, George E: Explaining the Gibbs sampler. Am Stat. 1992, 46: 167174. 10.2307/2685208.Google Scholar
 Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32: 407499. 10.1214/009053604000000067.View ArticleGoogle Scholar
 Huizinga TW, Amos CI, Helmvan Mil van der AH, Chen W, van Gaalen FA, Jawaheer D, Schreuder GM, Wener M, Breedveld FC, Ahmad N, Lum RF, de Vries RR, Gregersen PK, Toes RE, Criswell LA: Refining the complex rheumatoid arthritis phenotype based on specificity of the HLADRB1 shared epitope for antibodies to citrullinated proteins. Arthritis Rheum. 2005, 52: 34333438. 10.1002/art.21385.View ArticlePubMedGoogle Scholar
 Ma S, Huang J: Combining multiple markers for classification using ROC. Biometrics. 2007, 63: 751757. 10.1111/j.15410420.2006.00731.x.View ArticlePubMedGoogle Scholar
 Fan J, Lv J: Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc Ser B. 2008, 70: 849911. 10.1111/j.14679868.2008.00674.x.View ArticleGoogle Scholar
 Zou H: The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006, 101: 14181429. 10.1198/016214506000000735.View ArticleGoogle Scholar
 Fisher S, Lanchbury J, Lewis C: Metaanalysis of four rheumatoid arthritis genomewide linkage studies: confirmation of a susceptibility locus on chromosome 16. Arthritis Rheum. 2003, 48: 12001206. 10.1002/art.10945.View ArticlePubMedGoogle Scholar
 Osorio Y, Fortéa J, Bukulmez H, PetitTeixeira E, Michou L, Pierlot C, CailleauMoindrault S, Lemaire I, Lasbleiz S, Alibert O, Quillet P, Bardin T, Prum B, Olson JM, Cornélis F: Dense genomewide linkage analysis of rheumatoid arthritis, including covariates. Arthritis Rheum. 2004, 50: 27572765. 10.1002/art.20458.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.