Association mapping of susceptibility loci for rheumatoid arthritis

We analyzed a case-control data set for chromosome 18q from the Genetic Analysis Workshop 15 to detect susceptibility loci for rheumatoid arthritis (RA). A total number of 460 cases and 460 unaffected controls were genotyped on 2300 single-nucleotide polymorphisms (SNPs) by the North American Rheumatoid Arthritis Consortium. Using a multimarker approach for association mapping under the framework of the Malecot model and composite likelihood, we identified a region showing significant association with RA (p < 0.002) and the predicted disease locus was at a genomic location of 53,306 kb with a 95% confidence interval (CI) of 53,295–53,331 kb. A common haplotype in this region was protective against RA (p = 0.002). In another region showing nominal significant association (51,585 kb, 95% CI: 51,541–51,628 kb, p = 0.037), a haplotype was also protective (p = 0.002). We further demonstrated that reducing SNP density decreased power and accuracy of association mapping. SNP selection based on equal linkage disequilibrium (LD) distance generally produced higher accuracy than that based on equal kilobase distance or tagging.


Background
Rheumatoid arthritis (RA) is a common chronic disease, with a moderately strong genetic component. Chromosome 18q has shown evidence for linkage in the U.S. and French linkage scans [1]. The North American Rheumatoid Arthritis Consortium (NARAC) performed fine mapping on a 10-Mb region on 18q with a dense singlenucleotide polymorphism (SNP) map and the data were collected by the Genetic Association Workshop (GAW) 15 for Problem 2. Here we applied a novel association mapping approach based on the Malecot model and composite likelihood to identify disease associated regions and predict the locations of possible disease loci [2]. Haplotype analysis on the candidate regions was performed. We also studied the effect of region length and SNP density on the accuracy of association mapping in comparison with our analysis of the simulated data in Problem 3 of GAW15 [3].

Data
A total of 2300 SNPs in a 9,519.224 kb region of 18q were genotyped by NARAC in 460 cases of RA and 460 controls. Controls were recruited from a New York City population. Seven SNPs showing significant departure from Hardy-Weinberg equilibrium (HWE) in the control samples using a likelihood ratio chi-square test (χ 2 ≥ 10) were removed, resulting in a total of 2293 SNPs [4]. Further removal of 81 SNPs with a minor allele frequency (MAF) of less than 5% resulted in a total of 2212 SNPs for our main data analysis.

LD map
Physical locations of these SNPs were determined from build 35 (UCSC May 2004) of the human genome sequence. An LD map expressed in linkage disequilibrium (LD) units (LDUs) was created using the control samples with the LDMAP-cluster program, a parallel version of LDMAP program that rapidly constructed the map http:www.som.soton.ac.uk/research/geneticsdiepidemiology/ldmap/ [5]. LDU is determined by the product of the ε and distances in kilobases for an interval of two adjacent SNPs and is additive, where ε represents the exponential decline of LD with distance for that interval. The LD map length was 151.115 LDUs, which is essentially the same as the 2293 SNPs containing rare SNPs.
We also used the LD map built from the CEU samples of the HapMap Phase II data http://www.som.soton.ac.uk/ research/geneticsdiv/epidemiology/LDMAP/ map2.htm [5]. The same region on the CEU LD map contains 8086 SNPs with a length of 202 LDUs. Despite its higher SNP density, 185 SNPs were missing and therefore their LDU locations were linearly interpolated. However, alternative LD maps did not seem to exert a significant effect on the results (data not shown), and we hereby only report the results using the LD map constructed from the GAW15 data.

Association mapping
The 10-Mb segment of 18q was divided into 14 non-overlapping consecutive regions. Each region had a minimum of 10 LDUs and 30 SNPs by default without breaking LD blocks and was analyzed individually. We also used a 5-LDU region length, resulting in a total of 26 regions for association analysis. In the Malecot model, association is a function of several parameters, the most important of which is S, the predicted location of the disease variant [2]. Composite likelihood combines information of all pairwise marker-disease associations in each region. S and its 95% confidence interval (CI) are estimated by fitting the model to the data and maximizing the composite likelihood. Significance tests are carried out by contrasting two hierarchical models. Model A assumes no association and no parameters are estimated. Model D assumes an association and S and two other parameters are estimated with ε specified. The difference in the -2 natural log composite likelihood (denoted as Λ) between the two models (denoted as Λ A -Λ D ) is a statistic monotonic to a chisquare with 3 degrees of freedom ( ). A permutation test was performed for each region with hundreds of replicates under the null hypothesis of no association by shuffling case-control status to obtain an empirical pvalue [2]. The specified value for ε of 1.0543 was obtained by fitting the LD map to the genotype data for the control samples. However, similar values for ε did not seem to have an appreciable effect on the results (data not shown).
This approach has been implemented in the CHROM-SCAN program and a parallel version, CHROMSCANcluster, based on cluster computing was used for permutations with 1000 replicates http://www.som.soton.ac.uk/ research/geneticsdiv/epidemiology/chromscan/.
Pearson's χ 2 s were obtained for allelic associations between single SNPs and RA.

Haplotype analysis for candidate regions
Haplotype analysis using the PHASE program version 2 [6] was performed for candidate regions showing nominal significant association. The five most common haplotypes and their frequencies were compared between cases and controls. A chi-square test was applied to identify significant associations by testing each haplotype in turn against all others, including rare ones.

SNP density and region length
To generate different SNP density, we used Tagger in Haploview software to select tagging SNPs based on pairwise LD (r 2 ) using the control samples with rare SNPs included [7]. For comparison, we selected the same number of SNPs as Tagger but by equidistance in LDU or kilobases. To do this, SNPs with the same LDU were reassigned LDU locations by linear interpolation (tilting) so that every SNP would have a unique location. By centering at the "disease locus", we also studied the effect of LDU region length on the results.

Association mapping of disease locus
We found a nominally significant association between region 5 and RA. The estimated location of the disease locus S 1 was at 53,306 kb near a SNP of global maximal chi-square (rs3745064, χ 2 = 12.25, p = 0.00047) using the LD map with a 10-LDU region length (p = 0.002). After Bonferroni correction for 14 regions the results were still statistically significant (p c = 0.02). Removing this SNP did not change the results. However, inclusion of SNPs with MAF < 5% resulted in a wider 95% CI (53,274-53,342 kb, point location at 53,307 kb). Figure 1 shows that the estimated location for the disease variant was in a 10-kb LD An LD map in relation to the putative disease loci S 1 and S 2 block where a cluster of SNPs showed modest association with RA. At such a significance level, none of the SNPs were statistically significant after correction for multiple comparisons. This is in contrast to multimarker approaches, in which one cluster of SNPs is considered at a time in light of LD among nearby SNPs, markedly reducing the number of tests.
At 5-LDU region length with a total of 26 regions, we found a locus (S 2 ) at 51,585 kb showing nominal significant association (p = 0.04, Table 1 and Figure 1). Similar results were obtained using the data with rare SNPs (MAF < 5%) included. In this region there was also a cluster of SNPs associated with RA. However, consideration of multiple testing this region would not be statistically significant.

Haplotype analysis for candidate regions
Haplotype analysis was performed on sub-regions containing S 1 and S 2 with the majority of nominally associated SNPs included. S 1 sub-region (53,297-53,312 kb, 0.043 LDUs) contained 16 SNPs (Table 2) and S 2 subregion (51,556-51,616 kb, 0.368 LDUs) contained 21 SNPs (Table 3). Haplotypes H 1 and H 3 at S 1 and H 5 at S 2 appeared to be significantly associated with RA, with the first two haplotypes being almost complementary. Both H 1 at S 1 and H 5 at S 2 showed protective effects against RA.

Genes and mRNA at the candidate regions
The UCSC genome browser (May 2004) was used to find genes and mRNAs within the 95% CI of loci S 1 and S 2 . No known genes have been found nearby S 1 , but the area of the 95% CI for locus S 1 contains four human mRNA (CR590917, AK021717, AK124558, and BC013134), two of which span the point estimate of S 1 . Therefore, this region might contain genes not yet identified. In addition, this area is highly conserved across species, implying functional importance of the genomic sequence. A known  gene (AK127787) is within the 95% CI of S 2 , but is 10 kb away from its point estimate.

Region length and SNP density
Point estimates of S were identical for all region lengths centred at S 1 . The 95% CI was also relatively stable, with a slow increase with region length (Table 4). Enlarged region length compromised the significance levels, perhaps due to noise from distant SNPs, given that the informative SNPs were clustered in a rather small region.
Computing time prolonged with increasing number of SNPs. Small region lengths, however, resulted in a heavy penalty for multiple testing. Four LDUs provided the most significant result for S 1 (P c = 30 × 0.0008 = 0.02, Table 4).
Our analysis of simulated data indicated that reduced SNP density decreased mapping accuracy, and SNP selection based on equal LD distance produced smaller location errors than that based on equal kilobase distance or tagging [3]. Interestingly, among the three selection approaches for S 1 region, Tagger selected the most number of SNPs while equal kilobase distance, the least number.
SNPs selected by equal LDU distance generally provided the highest location accuracy (Table 5). Power was reduced with decreasing density, as indicated by the values of Λ A -Λ D ( Table 5). Using the kilobase map resulted in higher location errors in most cases and lower Λ A -Λ D values, indicating reduced power in all circumstances compared with using the LD map (data not shown).

Conclusion
We reported a significant association between a region of 18q and RA. The estimated genomic location of the disease variant was at 53,306 kb. The Malecot model and composite likelihood approach has narrowed the possible disease locus to a 36-kb candidate region. A haplotype significantly associated with reduced risk of RA was identified in this region. DNA sequences between 53,295-53,331 kb of this region are highly conserved in vertebrates. A haplotype around 51,585 kb was also identified as reducing the risk of RA. Further sequencing or functional studies may be helpful to identify the disease variants. Reducing SNP density decreases power and location accuracy. We also conclude that SNP selection based on   equal LD distance can maximally retain the prediction accuracy of the disease loci than that based on equal physical distance or SNP tagging.

Competing interests
The author(s) declare that they have no competing interests. a For the studied region for Tagger, Equal LD (E_LD) and kb (E_kb) distance, respectively. b Assuming S 1 is the "disease locus," the region was fixed at 10 LDUs centering at S 1 and the location error was calculated as S-53,306 kb.