Assessment of gene-covariate interactions by incorporating covariates into association mapping

The HLA region is considered to be the main genetic risk factor for rheumatoid arthritis. Previous research demonstrated that HLA-DRB1 alleles encoding the shared epitope are specific for disease that is characterized by antibodies to cyclic citrullinated peptides (anti-CCP). In the present study, we incorporated the shared epitope and either anti-CCP antibodies or rheumatoid factor into linkage disequilibrium mapping, to assess the association between the shared epitope or antibodies with the disease gene identified. Incorporating the covariates into the association mapping provides a mechanism 1) to evaluate gene-gene and gene-environment interactions and 2) to dissect the pathways underlying disease induction/progress in quantitative antibodies.


Background
Rheumatoid arthritis (RA) represents a complex disease in which genes and environmental factors interplay to manifest the symptoms of this condition. Despite the phenotypic heterogeneity involved in this disease, the genetic contribution to RA is estimated to be 50-60% [1], and the HLA region has the largest influence on genetic risk. Gregersen et al. [2] showed that HLA-DRB1 alleles encoding a common amino-acid sequence (the shared epitope (SE)) in the third hypervariable region of the DRB1 molecule have been identified as risk alleles for RA. The functional significance implied by the location of this SE sequence has stimulated efforts to search for the putative RA antigen [3]. In addition, because most RA patients have autoantibody responses, including rheumatoid factor (RF, a measurement of the reactive IgM antibodies) and cyclic citrullinated peptides (anti-CCPs), Huizinga et al. [3] compared the HLA profile in a healthy population and Page 1 of 6 (page number not for citation purposes)

BioMed Central
Open Access in RA patients who did or did not produce anti-CCP antibodies, demonstrating strong interactions between the SE and anti-CCP antibodies. Based on these findings, we incorporated the number of SE alleles (categorized by NN, SN, and SS for 0, 1, and 2 SE alleles, respectively) and anti-CCP or RF into our association mapping to search for the disease locus and to assess the associations between SE, antibodies, and the disease gene in a region of 6p21 (previously identified in our linkage analyses). Two dummy variables, NN and SN, were created; SS was been treated as the reference group (NN = 1 if NN, 0 otherwise; SN = 1 if SN, 0 otherwise). Incorporating this association profile allows us to investigate associations between the disease locus and the covariates, and will help to elucidate etiopathologies between different phenotypes and the disease genes.
Recently, Liang and Chiu [4] proposed a robust multipoint association mapping approach using case-control data. This approach provides an estimate of the genetic effect and the location of the disease locus τ, along with sampling uncertainty to help investigators narrow down chromosomal regions that putatively harbor a disease gene (τ). The genetic effect denoted by "C" characterizes the excess disease allele frequency among cases compared to controls. Chiu et al. [5] extended this method to estimate C by incorporating covariates into the mapping, in order to estimate τ more efficiently. The covariates could be either quantitative or qualitative. Hence, we incorporated anti-CCP or IgM RF as well as the number of SE alleles into our association mapping to estimate the disease locus for RA and to assess interactions between the disease gene and the covariates simultaneously [6].

Materials
A total of 868 cases and 1194 controls in the North American Rheumatoid Arthritis Consortium study (NARAC) were included in analyses. We used PLINK [7,8] to clean the data, excluding two cases and one control from the data set due to their genome-wide heterozygosity <30%. We also checked the call rate, cryptic relatedness, and multidimensional scaling. No additional subjects were excluded.
From the total 35,574 single-nucleotide polymorphisms (SNPs) available originally, we excluded the SNPs with a call rate <90% or MAF (minor allele frequency) <0.05, and those that failed the Hardy-Weinberg equilibrium test (p < 0.05) as measured by the PLINK software. A total of 29,616 SNPs remained for further gene mapping.
The 29,616 SNPs were used to make plots for the average excess target allele frequency in cases compared to controls, divided by the estimated magnitude of linkage disequilibrium (LD) (as shown on page 146 in Chiu et al.) [5]. A region harboring 2732 SNPs with higher peaks between 28 and 40 cM was identified. A total of 1561 SNPs ranging from about 28,000 to 40,000 kb, with blocks defined by r 2 > 0.7, was selected as tag SNPs using Haploview software. We converted base pair (bp) into centimorgan (cM) approximately by dividing bp by 10 6 . Hence, our analyses focused on 706 cases with RF, anti-CCP, and number of SEs and 1191 controls with number of SEs.

Association approaches
The method hinges on the following expression [4,5]: The genetic effect C, quantified by the excess high-risk allele frequency among cases compared to controls at the disease locus τ, can then be modeled as a function of the covariates through logistic regression [6] as follows: The regression coefficients b T characterize the associations between the genetic effect at τ and the covariates. For covariates available for cases and controls, the approach from Chiu et al. [5] was directly applicable. On the other hand, incorporating covariates into LD mapping can improve the efficiency in estimating τ when the covariate carries additional information on the association under study. Because this method represents an extension of the approach proposed by Liang and Chiu [4] that is based on the generalized estimating equation approach, it is also robust, given that no assumption about the genetic mechanism is required, other than that the region contains no more than one susceptibility locus for the qualitative trait while incorporating multiple markers into the analysis simultaneously. No assumption about the underlying genetic mechanism of an incorporated quantitative covariate is required either.
The LD mapping was conducted through the sliding window approach, in which sequential analyses were performed for every 100 SNPs. We examined the associations between the estimated disease locus and the covariates in the case-control study. In addition, we compared the results from incorporating the covariates with those from the original search, where no covariates were incorporated. This allowed us to evaluate the improvement in efficiency provided by the additional information from the given covariates. To make a comparison with other association analyses, we also performed individual SNPs analysis using the chi-square test in PLINK. Figure 1 illustrates the average excess target allele frequency in cases compared to controls divided by the estimated magnitude of LD [4,5] for all SNPs located between 28,000 and 40,000 kb on 6p21. Several local peaks appeared in this region, the global peak being located within the region of 32,500-33,000 kb. Theoretically, the SNPs in high LD with τ or at τ should have a higher excess of the target allele (presumably the highrisk allele) among cases compared to controls than other SNPs in the region. Figure 1 provides an illustration of approximately where the peak (the location of τ) may be and whether the one-disease-locus assumption is reasonable. The results with and without incorporating covariates from the 706 cases and 1191 controls are displayed in Tables 1, 2, 3. Several local peaks were identified in this region with and without incorporating covariates (data not shown). The global peak for the estimate of τ was located at 32.623 cM (95% CI = [32.6191, 32.6264]): it had the highest estimated C value of 0.9, along with a p-value less than 10 -15 for testing C = 0 (the absence of association). The estimated location happened to be the locus of HLA-DRB1 gene, indicating that this gene is strongly associated with disease risk (i.e., it might be the disease gene or is in strong LD with the disease gene (Table 1). In the single-SNP analysis (Figure 2), 251 out of 1561 SNPs had a p-value ≤ 3.20 × 10 -5 (the significance level was chosen after applying the Bonferroni correction for multiple tests). The most significant SNP appeared to be the SNP rs660895 at 32.685360 cM (p-value = 1.42 × 10 -107 ), as displayed in Table 4. The number of SE alleles is significantly associated with the estimated disease locus, with both p-values for SN and NN less than 10 -15 . There was, however, no evidence that the genetic effect at this location was associated with RF or anti-CCP levels. We also assessed the interaction between the SE numbers and the estimated disease locus by testing whether b 2 (the regression coefficient for SN) = b 3 (the regression coefficient for NN) (Tables 2, 3). The interactions were very significant for the estimated disease locus at 32.6 cM, revealing that these loci have interactions with SE alleles. In addition to the quantitative covariate, we also incorporated two dummy variables reflecting male-male and male-female casecontrol pairs, respectively, into the mapping. We also tested whether there was gene-sex interaction. The results suggested that there was no sex effect, or sex-gene interaction in our analyses (data not presented).

Discussion
Based on Figure 1 and the analysis results, the onedisease-locus assumption seems to be reasonable for the data in this region. Because results from previous studies suggested the presence of an association between anti-CCP and SE [4], we examined the linear relationship between anti-CCP and the number of SE alleles in the 706 cases using a linear regression model. Our results suggested that the anti-CCP levels were significantly different when comparing cases with 1 and 2 SE alleles (p = 0.0050), but were not significantly different when comparing cases with 0 and 2 SE alleles (p = 0.85).

Figure 1
The average excess target allele frequency in cases compared to controls divided by the estimated magnitude of LD for all SNPs located between 28,000-40,000 kp on 6p21. Green "X", the estimated disease locus, green bracket indicates the 95% confidence interval for the disease locus.

Conclusion
We applied a robust multipoint LD mapping approach to locate disease genes for RA with incorporation of  covariates as well as to assess whether the disease genes were associated with the covariates. Our results suggest that the disease loci in the region of 6p21 were strongly associated with the SE alleles. The efficiency in estimating the disease genes remained similar when incorporating RF or anti-CCP into the mapping, revealing that these two quantitative covariates did not provide additional information on the disease locus localization.
Through this application, we demonstrated that while performing multipoint fine mapping, this approach not only facilitates examination of gene-gene interactions and gene-covariate interactions, but also helps to elucidate the pathways of complex diseases.