Allelic based gene-gene interactions in rheumatoid arthritis

The detection of gene-gene interaction is an important approach to understand the etiology of rheumatoid arthritis (RA). The goal of this study is to identify gene-gene interaction of SNPs at the allelic level contributing to RA using real data sets (Problem 1) of North American Rheumatoid Arthritis Consortium (NARAC) provided by Genetic Analysis Workshop 16 (GAW16). We applied our novel method that can detect the interaction by a definition of nonrandom association of alleles that occurs when the contribution to RA of a particular allele inherited in one gene depends on a particular allele inherited at other unlinked genes. Starting with 639 single-nucleotide polymorphisms (SNPs) from 26 candidate genes, we identified ten two-way interacting genes and one case of three-way interacting genes. SNP rs2476601 on PTPN22 interacts with rs2306772 on SLC22A4, which interacts with rs881372 on TRAF1 and rs2900180 on C5, respectively. SNP rs2900180 on C5 interacts with rs2242720 on RUNX1, which interacts with rs881375 on TRAF1. Furthermore, rs2476601 on PTPN22 also interacts with three SNPs (rs2905325, rs1476482, and rs2106549) in linkage disequilibrium (LD) on IL6. The other three SNPs (rs2961280, rs2961283, and rs2905308) in LD on IL6 interact with two SNPs (rs477515 and rs2516049) on HLA-DRB1. SNPs rs660895 and rs532098 on HLA-DRB1 interact with rs2834779 and four SNPs in LD on RUNX1. Three-way interacting genes of rs10229203 on IL6, rs4816502 on RUNX1, and rs10818500 on C5 were also detected.


Background
Rheumatoid arthritis (RA) is a complex, chronic inflammatory disease whose etiology remains unknown. It has been known that RA is a result of the complicated networks of multiple genes along with the environmental factors such as smoking. It is more common in females. Through a combined linkage and association study [1], the HLA gene cluster on 6p21 has been shown to have the most likely predisposing loci for RA. In addition to HLA, numerous genetic variants influence the pathology of RA.

Open Access
Unfortunately, detection of gene-gene interaction has been challenging due to an issue of high dimensionality from multi-locus combinations that require a large sample size.
In this study, we applied a novel approach to detect gene-gene interaction influencing RA using the casecontrol subjects provided by the North American Rheumatoid Arthritis Consortium (NARAC). In contrast to the previously available method searching for the interaction at the genotype level, our approach focuses on the detection of interaction of at the allelic level with a novel definition: the allele-based gene-gene interaction occurs when a particular allele in one gene and a particular allele at another unlinked genes are dependent on the contribution to RA (Figure 1) [1]. Based on the 639 SNPs from 26 candidate genes related to RA pathology, we performed a score test based on logistic regression and a F-test based on the Cochran-Armitage regression model developed for the detection of allelic based gene-gene interaction [2,3].

Characteristics of data
As a regular quality control procedure, population stratification analysis using all 531,688 single-nucleotide polymorphisms (SNPs) was performed by EIGENSTRAT as we included the subjects of the four populations from HapMap database (Yoruba, CEPH, Japanese, and Han Chinese). The result showed that the case and control subjects are confirmed as European Americans. Additionally, we tested sex inconsistency between X chromosome and the clinical report and removed seven subjects whose data were discrepant. After the removal, 866 cases and 1189 controls were used for the further analysis. The 26 candidate genes were selected based on the following reasons: 1) previous reported results (HLA-DRB1, PTPN22, and TNF) [4,5]; 2) genes related to macrophage migration inhibitory factor and linked to the production of inflammatory cytokines (MIF, IL6, IL1B, IL3, IL4, and IL13) [8,9]; 3) genes playing an immunologically important role in down-regulating the immune response (CTLA4, RUNX1, STAT4, and SLC22A4) [8]; 4) genes used to test for interaction by Mei et al. and Ding et al. [6,7]; 5) genes relevant to inflammatory disease [8]. We also removed SNPs deviating from Hardy-Weinberg equilibrium (p-value < 10 -5 ) and having a minor allele frequency of less than 0.01. The names, locations, and the number of SNPs being tested for the 26 genes are provided in Table 1.

Statistical model
The underlying principle of our method is to identify the association of allelic combination between two unlinked markers with a disease trait so that subjects are assigned an allelic score given their observed genotype information. The score is a conditional probability of obtaining the particular allelic combination given the observed genotypes at the two loci of each subject. For example, a subject with AA (at marker M 1 ) and Bb (at marker M 2 ) genotype has 1/2 in the AB combination and 1/2 in the Ab combination, X AB = P(AB|M 1 = AA, M 2 = Bb) = 1/2, X Ab = P(Ab|M 1 = AA, M 2 = Bb) = 1/2 Table 2 shows the allelic scores of a subject whose genotype is given [2].

Score statistic by logistic regression model
Denote y i = 1 if i th subject is affected by RA and y i = 0 otherwise. In the non-parametric maximum likelihood  solution that allows an arbitrary covariate distribution, fitting a standard prospective logistic regression in casecontrol sampling design is equivalent to fitting a retrospective logistic regression except that an intercept in case-control sampling needs the information of sampling fraction of cases and controls [10,11]. Therefore, the prospective logistic regression model is used due to the equivalence in parameter estimates of interaction effect. The likelihood function of the standard logistic regression is ) . X i, AB = P(AB|M 1 , M 1 ) is the allelic score of A allele and B allele from M 1 and M 2 genotypes of i th subject and b k is interaction effect of k th {AB, Ab, aB} allelic combination. The overall proportion of y is R/N, where R is the number of case subjects and N is the number of total subjects. Under the assumption of no covariates, let U T = (U AB , U Ab , U aB ) T be the score function, which is a derivative of the log-likelihood function with respect to which is the observed Fisher information matrix corresponding to U τ = (U AB , U Ab , U aB ) τ . Detailed derivation and theoretical justification were published by Jung and Zhao [3].

Extension of Cochran-Armitage trend regression
With the same allelic scores in Table 2 at the two markers, we can model a linear trend of proportion of cases over total number of subjects at each allelic combination, p k, j = r k, j /n k, j , where n k, j = r k, j /s k, j for k (AB, Ab, aB, ab), j (0,1/4,1/2,1) for two markers. r k, j and s k, j are the number of affected subjects and unaffected subjects having j score at k allelic combination, respectively. It has been shown that regressing p k, j on Z AB, j , Z Ab, j , Z aB, j is equivalent to regressing y i on Z AB, j , Z Ab, j , Z aB, j [12]. As an extension of Cochran-Armitage trend method, the interaction effect of two markers on RA trait can be modeled as Under the assumption of no covariates, the theoretical regression coefficients are functions of linkage disequilibrium (LD) between a marker and a disease locus as follows: where K P P P P P P P P P P P P P P P P ( . ) ( . ) ( 0 5 1 0 5 0 5 1 0 5 1 0 5 1 P P P P P P P P P P P P P . ( P P P P P P P P P P P P P P P P P .
are the average effect of the g e n e s u b s t i t u t i o n a t e a c h d i s e a s e l o c i a n d ( ) The global test statistic for interaction effect over all allelic combinations under the null hypothesis H 0 : , which follows F(3, N-4) distribution with l = 0. The analytical properties of two methods were derived by Jung and Zhao [3] and simulation studies showed that the score test and F test by Cochran-Armitage trend are asymptotically equivalent.
Simulation study of power and type I error rates and comparison of genotype-based method A simulation study was performed to study power and type I error rates at the 1% significance level [3]. Six two-way interaction models were simulated using the simulation of software SNaP [3]. Most of the models were designed based on the combination of dominant and recessive inheritance at the genotypic level at each marker. These models are 1) For comparison of genotype-based method, logistic regression of two-way interaction is modeled as follows: . Additionally, we calculated the empirical power of multifactor dimensionality reduction (MDR) in the same simulated data sets. Table 3 illustrates that the power of the score test of the allelic based method over the six two-interaction models are higher than that of the two genotype-based methods. Type I error rates of the allele-based method are close to nominal value of 0.01. Further detailed results of simulation and analytical derivation are available in Jung and Zhao [3].

Non-nested model comparison using Cochran-Armitage regression method
Technically, the proposed two-way interaction and threeway interaction model are not nested models, so an artificial nesting approach was employed to select the best model as follows: Under the assumption of normal errors, an artificial nesting approach called the J test [14] was utilized as follows: Because f is linear in b, the comparison requires that one estimates δ and then fits a linear regression and test for a = 0 using the ordinary t-statistic [13,14]. On the other hand, we can compare Akaike information criterion (AIC) and Bayesian information criterion (BIC) for a two-way model with that of a three-way interaction model.

Analysis procedure
The procedure to search for the best interaction model consists of multiple steps based on the proposed methods.
Step 1 When performing a two-way interaction analysis [Model (1) and (3)] of two SNPs, each is selected from each gene and the global test for an interaction is performed. Note that two SNPs in the same gene are removed from the interaction analysis. We then compared the interaction model (3) with a main effect model (6) in order to search for the pure interacting SNPs that are not confounded with the main effects, and selected the best two-way interaction models which met three criteria: 1) the p-value less than 2.5 × 10 -7 from interaction test (the total 203,841 combination; adjusted for Bonferroni correction), 2) the p-value of the test for comparison between the interaction model and the main effect model less than 0.01, and 3) the testing SNPs should have both the smallest AIC and the smallest BIC.
The following models were considered: No genetic effect model Main effect model : : : :  Step 2 Based on the pair-wise SNPs selected by Step 1, we conducted three-way interaction model analysis as we added one SNP at a time from one of the remaining genes, which is the scheme of the forward selection procedure. With the same procedure of a two-way model selection and an additional comparison of the three-way model with two-way interaction model, the best three-way interaction models were selected by the same criteria described in Step 1.
Step 3 We continued these steps until no further high-dimensional interaction model was identified. Table 4 lists ten pairs of two-way interacting genes and the function of SNPs identified. Because we analyzed all SNPs in LD with a gene, there are multiple SNPs in a gene interacting with a SNP of the other gene. SNP rs2476601 on PTPN22 interacts with rs2306772 on SLC22A4, which interacts with rs881372 on TRAF1 and rs2900180 on C5, respectively. SNP rs2900180 interacts with rs2242720 on RUNX1, which interacts with rs881375 on TRAF1. SNPs rs881375 and rs2900180 are in LD (R 2 = 0.89). Furthermore, rs2476601 on PTPN22 interacts with three SNPs (rs2905325, rs1476482, and rs2106549) on IL6. Three SNPs that are not in LD on IL6 interact with two SNPs (rs477515 and rs2516049) on HLA-DRB1. SNP rs660895 on the same gene interacts with rs2834779 on the 5' UTR region on RUNX1, and rs660895 and rs532098 interact with four SNPs on RUNX1. Additionally, rs4947324 on NFKBIL1 is not in LD with three SNPs on HLA-DRB1, but it interacts with them. Two SNPs (rs6586516 and rs2477142) on PADI4 interact with rs3761847 on TRAF1, which interacts with five SNPs in LD on RUNX1. Furthermore, we detected three-way interacting genes which are rs10229203 on IL6, rs4816502 on RUNX1 and rs10818500 on C5. Figure 2 summarized the pathway of ten pairs genes and one group of three interacting genes (indicated in red).

Discussion
In this study, the allele-based gene-gene interaction analyses were applied to case-control data sets of RA. Based on the analysis with 639 SNPs from 26 candidate genes that were previously detected through linkage study or fine mapping, we identified ten two-way interacting genes with multiple SNPs in LD from a gene and one three-way interaction. We have not identified any four-way interaction effects. However, the 26 candidate genes selected in this study may not represent all candidate genes for RA and we observed that Illumina 550k chip may not have a good gene-wide coverage for SNPs because no SNPs of SUMO4 and VEGFA in the platform are available.
A more standard interaction model using a logistic regression consisting of two main effect terms (X = 0, 1, 2 according to the number of alleles) and a multiplicative term of the main effect (additive × additive) was applied to the same data set. There is no interacting SNPs by an even more lenient criteria (p-value<10 -5 ).
Three criteria to justify the significant interaction models were used. For the interaction models, Bonferroni correction was used for multiple testing, and for the comparison of the interaction model with a main-effect model (significance level of 0.01), the smaller AIC and BIC were utilized. The final selected interacting SNPs satisfied all of the criteria, which may be conservative and may cause false-negative error. There still remains the issue of multiple comparisons in the high-dimensional interactions and the complexity of the procedure to screen the interaction effects.

Conclusion
As shown in the results, the proposed allele-based approach allows us to identify multiple interactions that may not have been identified as risk factors for RA. PTPN22, SLC22A4, HLA-DRB1, IL6, PADI4, TRAF1, NFkBIL1, C5, and RUNX1 may play interactive roles for RA, especially PTPN22 and SLC22A4, which are related to the reaction of antigen for RA. Therefore, our method taking into account the nonrandom association of all allelic combinations may help detect novel genetic variants and interpret biological pathways.