Volume 3 Supplement 7
Detecting single-nucleotide polymorphism by single-nucleotide polymorphism interactions in rheumatoid arthritis using a two-step approach with machine learning and a Bayesian threshold least absolute shrinkage and selection operator (LASSO) model
© González-Recio et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
The objective of this study was to detect interactions between relevant single-nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). Data from Problem 1 of the Genetic Analysis Workshop 16 were used. These data consisted of 868 cases and 1,194 controls genotyped with the 500 k Illumina chip. First, machine learning methods were applied for preselecting SNPs. One hundred SNPs outside the HLA region and 1,500 SNPs in the HLA region were preselected using information-gain theory. The software weka was used to reduce colinearity and redundancy in the HLA region, resulting in a subset of 6 SNPs out of 1,500. In a second step, a parametric approach to account for interactions between SNPs in the HLA region, as well as HLA-nonHLA interactions was conducted using a Bayesian threshold least absolute shrinkage and selection operator (LASSO) model incorporating 2,560 covariates. This approach detected some main and interaction effects for SNPs in genes that have previously been associated with RA (e.g., rs2395175, rs660895, rs10484560, and rs2476601). Further, some other SNPs detected in this study may be considered in candidate gene studies.
Rheumatoid arthritis (RA) is a chronic disease with known autoimmune pathophysiology. RA is a heritable condition and association studies have already identified a genomic region in chromosome 6 (the HLA region). While this represents progress in elucidating genetic contributions to RA, much is still unknown about the underlying genetic causes, and there is plenty of evidence that there exist other genes affecting disease risk, both as major effects and in epistasis . Genome-wide association studies (GWAS) of diseases and complex traits have become a major focus of research in human genetics. GWAS may provide robust results for acquiring knowledge on the underlying genetic behavior of RA. One of the most difficult challenges of GWAS is how to deal with the large p, small n problem, arising when the number of variables considered (p) is much larger than the number of subjects (n). The problem becomes particularly difficult when one seeks to estimate single-nucleotide polymorphism (SNP) by SNP interactions. One approach for efficiently handling high dimensional GWA data consists of two steps: 1) reducing dimensionality by filtering non-informative markers, and 2) applying a more sophisticated model to quantify the effect of the selected SNPs and their interactions. Information gain and the wrapper procedure are examples of machine learning that have shown benefits over linear regression for Step 1. These methods are easy to implement and may deal with crude, noisy, and inconsistent information. They alleviate redundancy, colinearity, and the assumption of multivariate normality, making them appealing in genomic studies. The function that relates covariates to observations is unspecified, providing more flexibility in the model. Further, they may deal with non-additive effects, which are of interest in genetic epidemiology. Some drawbacks to these methods exist: information gain may not completely remove colinearity in the system and the wrapper procedure it is too computationally demanding to use on a large number of records and SNPs. Hence, a second step is necessary. A large variety of methods have been previously proposed for Step 2. Here, we propose a novel method that is able to reduce colinearity and make a higher shrinkage to zero than other methods for less relevant SNP and SNP × SNP effects.
The Genetic Analysis Workshop 16 (GAW16) provides an opportunity for testing novel methods, such as those proposed above, on a well characterized dataset, to compare results and interpretations, and to discuss current problems in genetic analysis. The aim of this study was to identify additional disease susceptibility loci in the GAW16 RA data in a two-step approach: first, reducing the number of SNPs to be tested through machine learning algorithms; second, identifying interactions or epistatic effects between HLA and non-HLA SNPs using a Bayesian threshold LASSO model.
Data from the North American Rheumatoid Arthritis Consortium (NARAC) provided by the GAW16 were used in the analyses. The initial batch consisted of 868 cases and 1,194 controls genotyped with the 500 k Illumina chip (545,080 SNPs). Further description of the initial data can be found in Plenge et al. . Local institutional review board (IRB) approval was obtained.
SNPs showing >2% missing genotypes were excluded (64,041). Monomorphic SNPs were omitted (1,920). SNPs with minor allele frequency <0.05 that did not show association with the disease through a Fisher's exact test (p < 0.001) were also omitted (47,190). Finally, SNPs that deviated from Hardy-Weinberg equilibrium (p < 0.0001) in the controls were discarded (11,835). The number of SNPs remaining in the analysis was 420,094.
Stage 1: Preselection of significant SNPs using machine learning
Information gain or entropy theory
where . At this point, SNPs were divided in two groups based on whether they were in the HLA region or somewhere else along the genome (non-HLA SNPs). The HLA region was defined starting at HLA-F (29,799,096 to 29,803,052) and extending to HLA-DPB1 (33,151,738 to 33,162,954). The 100 non-HLA SNPs with the highest information gain were selected to test for interactions with HLA SNPs in Stage 2. The HLA SNPs that passed to Stage 2 were selected as follows.
Selection of independent SNPs in the HLA region on chromosome 6
The most relevant SNPs in this region were selected using the wrapper procedure . This procedure aims to reduce redundancy and colinearity in a feature subset. The HLA SNPs with an information gain above the 99.65 percentile (approximately the top 1,500 SNPs) were considered as candidates, and were included in this wrapper. This method involves searching through all possible combinations of SNPs in the data to find an 'optimal' subset of SNPs that best classify the phenotype outcome (binary: case or control), using an attribute evaluator and a search method. The attribute evaluator used was the naïve Bayesian classifier, with a bidirectional hill-climbing search method. The naïve Bayesian evaluator can be described as follows:
We chose a five-fold cross-validation scenario in which a different list of SNPs was generated in each fold. Therefore, SNPs that appeared in three or more folds were extracted and included in Stage 2. The wrapper approach was implemented using the weka software .
Stage 2: Selection of significant SNPs and interactions
To identify interactions among SNPs in the HLA region and between an HLA SNP and a SNP elsewhere in the genome, we applied a Bayesian version of the LASSO (least absolute shrinkage and selection operator) . The LASSO constrains the sum of absolute values of the regression coefficients, leading some coefficient estimates to be exactly zero. This can be viewed as a feature selection, and is suitable for quantifying estimates, as well.
where λis the vector of liabilities for all individuals, βare the LASSO estimates with their respective incidence matrix X, and, as a modeling choice, e was considered the vector of residuals independent and identically distributed as . In accordance with tradition, we fixed the threshold to be 0 and the residual variance to be 1; alternate choices result in the same model.
Let Xβ = X M β M + XHLA-HLAβHLA-HLA+ XHLA-nonHLAβHLA-nonHLA, where β M is the vector of major effects, βHLA-HLAcorresponds to the vector for interaction effects between HLA SNPs, and βHLA-nonHLAis to the vector for interaction effects between HLA and non-HLA SNPs, with X M , XHLA-HLA, and XHLA-nonHLAbeing the corresponding incidence matrices. These incidences matrices were constructed such that each major effect for SNP j was codified as = (-1, 0, or 1) for aa, Aa, and AA, respectively. Interactions were codified as follows: for each SNP j we defined two covariates, and , with = 1if the genotype was coded as 0 (0, otherwise), and = 1 if the genotype was coded as 1 (0, otherwise). Then, for each SNP j × SNP k interaction, we set four covariates, for m, n equal to 1 or 2.
Samples from posterior distributions of those estimates were drawn from the Gibbs sampling algorithm described in Park and Casella , with a chain length of 100,000 samples discarding the first 50,000 as burning, after checking convergence.
Information gain was calculated for each SNP across all chromosomes. The total entropy of the data was 0.98, which is the maximum information gain that a feature (i.e., SNP) could provide. The highest information gain was found in the HLA region on chromosome 6, as expected, with a maximum value of 0.19 for SNP rs2395175. Thirteen other SNPs in this region had an information gain higher than 0.10. The highest information gain outside of the HLA region was 0.017 (rs2476601 on chromosome 1). Within the HLA region, 472 out of the 1,323 SNPs had information gain in the 99.65 percentile. The wrapper procedure selected 6 out of the 472 SNPs.
A pre-screening stage seems necessary in genome-wide association studies to reduce the large p, small n problem. The machine learning approach (information gain + wrapper) used in this study detected the most important known region associated with RA and reduced the number of SNPs in both the HLA region and across the genome. In a second stage, the BTL model selected covariates of major and epistatic effects strongly associated with RA, some of them already known. The major effects of the two HLA SNPs with highest information gain did appear in the top 27 covariates, showing their importance on the liability to RA.
Shi et al.  used a LASSO model on the simulated data from the GAW15 Problem 3. However, they used a different pre-screening method and a non-Bayesian version of the LASSO. The Bayesian counterpart provides a measure of the reliability of the estimates. Because data used in the GAW16 RA problem are real data, the accuracy of the proposed approach cannot be tested immediately. However, the results in this study can be compared to the results generated by other methods applied to the same data, and also with past and future analyses. Further, SNPs found in this study with unknown previous function might act as markers of candidate genes in future research. Proving the benefits of these methods over others widely used in the field is a challenge for the future.
List of abbreviations used
Bayesian threshold LASSO
Genetic Analysis Workshop 16
Genome-wide association studies
Institutional review board
Least absolute shrinkage and selection operator
North American Rheumatoid Arthritis Consortium
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. OG-R acknowledges financial support from the Babcock Institute for International Dairy Research and Development at the University of Wisconsin-Madison. EL's research was supported by the Wisconsin Agriculture Experiment Station and by grant NSF-DMS-044371. T. ATV's research was supported by Spanish Ministry of Health grant FIS BA08/90039.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Mei L, Li X, Yang K, Cui J, Fang B, Guo X, Rotter JI: Evaluating gene × gene and gene × smoking interaction in rheumatoid arthritis using candidate genes in GAW15. BMC Proc. 2007, 1 (suppl 1): S17-10.1186/1753-6561-1-s1-s17.PubMed CentralView ArticlePubMedGoogle Scholar
- Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LR, Li W, Tan AK, Bonnard C, Ong RT, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altschuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Katner DL, Klareskog L, Gregersen PK: TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study. N Engl J Med. 2007, 357: 1199-1209. 10.1056/NEJMoa073491.PubMed CentralView ArticlePubMedGoogle Scholar
- Mackay DJC: Information Theory, Inference, and Learning Algorithms. 2003, Cambridge, Cambridge University PressGoogle Scholar
- Kohavi R, John GH: Wrappers for feature subset selection. Artif Intell. 1997, 97: 273-324. 10.1016/S0004-3702(97)00043-X.View ArticleGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2005, San Francisco, Morgan Kaufmann, 2Google Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B. 1996, 58: 267-288.Google Scholar
- Park T, Casella G: The Bayesian Lasso. J Am Stat Assoc. 2008, 103: 681-686. 10.1198/016214508000000337.View ArticleGoogle Scholar
- Wright S: An analysis of variability in number of digits in an inbred strain of guinea pigs. Genetics. 1934, 19: 506-536.PubMed CentralPubMedGoogle Scholar
- Källberg H, Padyukov L, Plenge RM, Rönnelid J, Gregersen PK, Helm-van Mil van der AHM, Toes REM, Huizinga TW, Klareskog L, Alfredsson L, Epidemiological Investigation of Rheumatoid Arthritis (EIRA) Study Group: Gene-gene and gene-environment interactions involving HLA-DRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis. Am J Hum Genet. 2007, 80: 867-875. 10.1086/516736.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt D, Goronzy JJ, Weyand CM: CD4+ CD7- CD28- T cells are expanded in rheumatoid arthritis and are characterized by autoreactivity. J Clin Invest. 1996, 97: 2027-2037. 10.1172/JCI118638.PubMed CentralView ArticlePubMedGoogle Scholar
- Jorgensen C, Noel D, Gross G: Could inflammatory arthritis be triggered by progenitor cells in the joints?. Ann Rheum Dis. 2002, 61: 6-9. 10.1136/ard.61.1.6.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi W, Lee KE, Wahba G: Detecting disease-causing genes by LASSO-Patternsearch algorithm. BMC Proc. 2007, 1 (suppl 1): S60-10.1186/1753-6561-1-s1-s60.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.