A genome-wide association scan for rheumatoid arthritis data by Hotelling's T2 tests.

We performed a genome-wide association scan on the North American Rheumatoid Arthritis Consortium (NARAC) data using Hotelling's T2 tests, i.e., TH based on allele coding and TG based on genotype coding. The objective was to identify associations between single-nucleotide polymorphisms (SNPs) or markers and rheumatoid arthritis. In specific candidate gene regions, we evaluated the performance of Hotelling's T2 tests. Then Hotelling's T2 tests were used as a tool to identify new regions that contain SNPs showing strong associations with disease. As expected, the strongest association evidence was found in the region of the HLA-DRB1 locus on chromosome 6. In the region of the TRAF1-C5 genes, we identified two SNPs, rs2900180 and rs3761847, with the largest and the second largest TH and TG scores among all SNPs on chromosome 9. We also identified one SNP, rs2476601, in the region of the PTPN22 gene that had the largest TH score and the second largest TG score among all SNPs on chromosome 1. In addition, SNPs with the largest TH score on each chromosome were identified. These SNPs may be located in the regions of genes that have modest effects on rheumatoid arthritis. These regions deserve further investigation.


Background
Rheumatoid arthritis (RA) is the most common inflammatory joint disease and has an autoimmune etiology. The exact cause of RA is still unknown, but it is well known that RA has a strong genetic component [1]. The HLA-DRB1 locus has been clearly demonstrated to be associated with RA [2][3][4]. Other candidate genes, such as PTPN22 and TRAF1-C5, which confer a modest level of risk of RA, have also been identified recently [5,6]. We conducted a genome-wide association analysis on the data of the North American Rheumatoid Arthritis Consortium (NARAC). The objective of this analysis Page 1 of 6 (page number not for citation purposes)

BioMed Central
Open Access was to identify associations between single-nucleotide polymorphisms (SNPs) or markers and RA. In specific candidate gene regions, we evaluated the performance of Hotelling's T 2 tests on known associations. Then, we used the Hotelling's T 2 tests to identify additional SNPs that showed strong association with RA. These SNPs are located in regions that are very likely related to the disease and deserve further investigation.

Methods
We used the Hotelling's T 2 test developed by Fan and Knapp [7] and Xiong et al. [8] to analyze the NARAC data. Consider a case-control design with N cases from an affected population and M controls from an unaffected population. When analyzing SNPs, we study bi-allelic markers with two alleles, which we denoted by 1 and 2 that can form three genotypes 1/1, 1/2 and 2/2. Then a coding vector can be defined for each case/ control by either i) genotype coding or ii) allele coding. Let X i and Y j denote the coding vector for the i th case and the j th control, respectively. In our study, X i = (1,0) τ for genotype 1/1, X i = (1,0) τ for genotype 1/2, and X i = (0,0) τ for genotype 2/2 were used in the genotype coding, whereas the allele coding simply counts the number of allele 1 of a genotype. If multiple markers are available, the coding vectors of each case/control can be combined together. For instance, the allele coding vector of a case/ control of n SNPs is an n-dimensional vector; and the genotype coding vector of a case/control of n SNPs is 2ndimensional. For multi-allelic markers, the coding method is described by Fan and Knapp [7]. Let us define a pooled-sample variance covariance matrix by are the mean vectors of cases and controls, respectively. The Hotelling's T 2 test statistic [9] is defined as In the following, we will denote the Hotelling's T 2 for allele coding as T H and the Hotelling's T 2 for genotype coding as T G . Assume the sample sizes N and M are large enough so that the large sample theory applies. Under the null hypothesis of no association, the statistic T H (or T G ) is asymptotically distributed as a central chi-square c 2 statistic with n (or 2n) degree(s) of freedom if n SNPs are used in the analysis. Under the alternative hypothesis of association, T H (or T G ) is asymptotically distributed as a non-central chi-square c 2 statistic [7,8,10].
Based on the Hotelling's T 2 test statistics, we have developed a SAS Macro (hotel_cc.sas) to implement the method, which is available online [11].

Results
First, we applied the Hotelling's test statistics and performed a genome-wide scan on the NARAC data by analyzing one SNP at a time. The NARAC data contained a total of 2062 individuals (868 cases and 1194 controls). Our analysis used data from 22 autosomes. The RA data of Genetic Analysis Workshop (GAW) 16 included 545,080 SNP-genotype fields from an Illumina 550 k chip (22 autosomes, sex chromosomes, and mitochondria). We dropped all SNPs with low call rates (less than 95%) or not in Hardy-Weinberg equilibrium in the controls (p-value < 10 -5 ) and dropped all SNPs which are not on the autosomes. After this filtering, 490,613 SNPs on 22 autosomes were used in our analysis. The strongest signal was found in the region of the HLA-DRB1 gene on chromosome 6 at location 32,654,524-32,686,031 bp. In Figure 1, Graphs I and II show the Hotelling's test scores for chromosome 6. Both T H and T G scores reached the highest value around the location of 32.5 Mb in the region of HLA-DRB1. Graphs III and IV showed the results in the region of HLA-DRB1 gene (the legend indicates location of the HLA-DRB1 gene). Most of the test scores in the region were very significant.
We present the six SNPs on chromosome 6 with the highest test scores in the left-hand part of Table 1. The most significant result was found at SNP rs2395175 (p-value = 9.25 × 10 -144 ). These SNPs are all located around the HLA-DRB1 gene. It is interesting that both T H and T G reached the highest scores at the same four SNPs (rs2395175, rs660895, rs6910071, and rs2395163). Interstingly, T H reached the 5 th highest score at SNP rs3763309 and the 6 th highest at SNP rs3763312; conversely, T G reached the 5 th highest score at SNP rs3763312 and the 6 th highest at SNP rs3763309. Actually, the order of two SNPs for T H and T G that reached the 7 th and 8 th highest scores switches too; in addition, T H and T G reached the 9 th to 13 th highest scores at the same SNPs (data not shown). Thus, the region of the HLA-DRB1 gene contains multiple SNPs that are highly associated with RA. In addition, the p-values of the test T G were generally smaller than those of T H , i.e., the genotype coding test T G leads to more significant results than the allele coding test T H . This observation is consistent with the evidence for non-additivity of DRB1 effects [12].
It is well known that the HLA-DRB1 alleles are associated with RA [1,2]. We performed an analysis in which HLA-DRB1 alleles *0101, *0102, *0401, *0404, *0405, *0408, *1001, which are components of the shared epitope were treated as risk alleles, and the other alleles were collapsed as one. Here we used the multi-allelic version of the Hotelling's T 2 tests [7]. The test score for allele coding was T H = 650.81 with 7 degrees of freedom (p-value = 2.76 × 10 -136 ), and test score for genotype coding was T G = 694.82 with 35 degrees of freedom (p-value = 1.36 × 10 -123 ). The results were consistent with those using individual SNPs above. On the basis of individual SNP analysis, we performed a forward analysis of multiple SNPs. Using the most significant SNP rs2395175 as baseline, we added one SNP a time for an analysis of two SNPs. We identified that each of three SNPs, rs660895, rs6910071, and rs3763312, contributed significant association in addition to the contribution of the base SNP rs2395175 (p-value < 0.01). Moreover, the most significant result was from the two SNPs rs2395175 and rs660895. Then, we added one SNP at a time to the two most significant SNPs; we found each of the two SNPs, rs6910071 and rs3763312, contributed significant association (p-value < 0.01). Finally, four SNPs together were found to be significantly associated with RA (rs2395175, rs660895, rs6910071 and rs3763312; p-value < 0.01).
Graphs V-VIII of Figure 1 showed the results of chromosome 9 (the legend indicates location of the-TRAF1-C5 genes). In Plenge et al. [6], SNP rs3761847 at position 120,769,793 bp and SNP rs2900180 at position 120,785,936 bp were found to be significantly associated with RA in the region of the TRAF1-C5 genes. We found consistent results since T H = 34.21 of SNP rs2900180 was the largest (p-value = 4.95 × 10 -9 ), and T H = 32.17 of SNP rs3761847 was the second largest among all SNPs on chromosome 9 (p-value = 1.41 × 10 -8 ). Other SNPs on chromosome 9 that showed highest scores were also reported on the right-hand side of Table 1. Interestingly, the SNPs identified via T H were the same as the ones identified via T G (the right-hand side of Table 1). As with chromosome 6 in the region HLA-DRB1, we performed a forward analysis of multiple SNPs. Using rs2900180 as baseline, we found no other SNP that contributed significant association (p-value > 0.05). Thus, all association is from SNP rs2900180 in the region of the TRAF1-C5 genes.
In the region of the PTPN22 gene on chromosome 1, we identified one SNP (rs2476601) that was reported to be associated with RA by Begovich et al. [5]. From the results in the candidate regions on chromosomes 6, 9, and 1, we noticed that the highest test scores of T H and T G were from SNPs located very close to the candidate genes HLA-DRB1, TRAF1-C5, and PTPN22, respectively. Therefore, the SNPs with high test scores are of interest for further investigation to identify genes that have modest effect on RA. In Table 2, we presented the SNPs that showed the highest T H scores among all SNPs of each chromosome. We chose to present the results based on the test statistic T H , since it is more robust than T G in terms of more stable type I error rates [7]. To make a comparison, we presented the most significant results from PLINK in Table 2. The SNPs identified by statistic T H are the same as those identified by PLINK, except rank switches on chromosomes 11 and 16. It is possible that other SNPs that have high test scores are worthy of further study. Due to the limited length of this article, we could not present detailed genome-wide test data here but we will provide detailed information on request.

Discussion
The results of our genome-wide scan provided a large number of SNPs that have high test scores. One reason for this is the large sample size of NARAC data. For further study, one may start with the regions that contain the SNPs that have highest test scores, i.e., the regions with strongest signals. The Hotelling's T 2 tests do not  Tables 1 and 2 can be analyzed similarly.
We compared our results with those in literature [2][3][4][5][6] and found them to be consistent. In addition, we analyzed the data using PLINK and found similar results as those of Table 1 and Table 2; partial results are presented in Table 2. Hence, our results for analysis of data from candidate studies and genome-wide scans showed that the Hotelling's tests performed well. Furthermore, we could jointly use multiple SNPs in analysis as we did for data of chromosomes 6 and 9.

Conclusion
We performed a genome-wide association scan for RA data by applying Hotelling's T 2 tests. In the candidate regions of the HLA-DRB1, TRAF1-C5, and PTPN22 genes, we identified SNPs that have the highest test scores across chromosomes 6, 9, and 1, respectively. Given the encouraging results in the candidate gene regions, the regions containing SNPs with high test scores are of interest for further investigation to map genes which have modest effects on RA. We provided the SNPs and their positions that had the largest scores for each chromosome. The regions of these SNPs deserve more investigation to map RA genes.