Volume 3 Supplement 7
A principal-components-based clustering method to identify multiple variants associated with rheumatoid arthritis and arthritis-related autoantibodies
© Black and Watanabe; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Multivariate techniques are an important area of investigation for studying contributions of multiple genetic variants to disease onset and pathology. We analyzed the Genetic Analysis Workshop 16 North American Rheumatoid Arthritis Consortium (NARAC) data using a principal-components analysis (PCA) with an orthoblique rotation to identify specific subsets of single-nucleotide polymorphisms (SNP) in the major histocompatibility complex (MHC) region associated with rheumatoid arthritis (RA) and rheumatoid factor IgM (RFUW), and compared this method with a traditional PC approach. Using the orthoblique PC-based clustering method, we identified new clusters of SNPs across the MHC region associated with RA and RFUW, and replicated known SNP cluster associations with RA, such as those in the HLA-DRB region.
Testing a candidate gene or region for association with phenotypes typically involves testing multiple single-nucleotide polymorphisms (SNPs). This necessarily introduces the issue of multiple test corrections, which reduce power in order to control the type 1 error rate. Therefore, development of multivariate methods to identify causal loci and to reduce the burden of multiple testing is an area of ongoing investigation for complex diseases. Several multivariate techniques have been used to examine whether multiple SNPs are associated with disease or quantitative traits [1–4]. However, such methods typically suffer from low power under various scenarios or the inability to reduce the large number of SNPs to a smaller subset that may point to a specific location within the region.
Gauderman et al. introduced a principal-components method to assess whether multiple variants within a candidate gene are associated with disease . Principal-components analysis (PCA) is used to derive linear transformations of the original SNP data, in which eigenvectors are chosen to maximize the variance of each PC relative to the overall variation in the region . Each eigenvalue represents the variance of a particular PC, and typically, only a subset of PCs that account for a large proportion of the total variation are chosen for analysis, reducing the number of parameters to be tested. These PCs serve as covariates in an omnibus test of association with disease or trait [3, 4].
The PC approach has been shown to have greater power than standard joint SNP or haplotype-based tests to detect association between multiple SNPs and disease, especially when the number of haplotypes is large . However, the coefficients of each eigenvector are derived from pair-wise correlations among the SNPs, and thus lack specific interpretation. Eigenvector loadings of the original variables on a PC do not reflect the true importance of the SNPs to that PC, making the association between multiple PCs and disease outcomes difficult to interpret .
We propose a PC-based clustering method as an alternative approach that reduces dimensionality of the data and maintains the power of a PC approach, but allows for unique identification of multiple SNPs in the region being tested. The algorithm uses an orthoblique rotation of PCs on genotype data to form distinct clusters, where each cluster is defined by a specific array of SNPs. A subset of clusters that explains a large proportion of the total locus variation is selected, such that those clusters can be tested for association with disease outcomes.
The PC approach and our proposed oblique PC-based clustering method were applied to the analysis of rheumatoid arthritis (RA) data from Genetic Analysis Workshop 16 (GAW16) (Problem 1). We compare and contrast results from these two approaches and compare findings with previously published results for these data [5–8].
Sample data included genome-wide association data from Affymetrix GeneChip 100 k Mapping Array containing 116,204 SNPs for RA and RA-related traits, such as rheumatoid factor IgM (RFUW) and anti-cyclic citrullinated peptide (anti-CCP), on 2,062 North American Rheumatoid Arthritis Consortium (NARAC) subjects. Of the 1,250 SNPs in the major histocompatibility complex (MHC) region on 6p21 spanning 3.2 Mb, we restricted analysis to individuals with RA status and complete genotype data (n = 1,187 on 838 SNPs).
Pre-analysis processing of data
Observed genotype frequencies were assessed for deviation from Hardy-Weinberg equilibrium and allele frequencies estimated using the computer program Haploview (V.4.1). For these analyses, we excluded markers with minor allele frequencies (MAFs) < 0.01, and coded genotypes as 0, 1, or 2 according to the number of minor alleles. Log transforms were applied to quantitative trait data to approximate univariate normality. We performed PCA as described by Gauderman et al. . The subset s of PCs used in the analysis was determined by the quantity that accounted for 80% of the total variation.
where C n is the nth cluster score, c n is a vector of standardized cluster coefficients, and g is the vector of SNPs [g1, g2, ..., g k ]. While all SNPs are subdivided into a total of n clusters, the number of SNPs within each cluster varies, yielding cluster coefficients equal to zero for SNPs not included in the nth cluster. Given the cluster coefficients and individual genotype scores, cluster scores are computed for each individual. For comparison with traditional PC analysis, n was determined by the number of clusters that accounted for 80% of the total locus variation.
For continuous outcomes, we fit a multiple linear regression model with PC scores or cluster scores as covariates. Likelihood-ratio tests were used to contrast the null model (intercept only) to that with either s PCs (if traditional PCA) or n clusters (if cluster analysis) to assess significance, with s or n degrees of freedom (df). Given significant association under omnibus test of all PCs or PC-based clusters, 1-df Wald tests were used to test association between RA or RA-related trait with each PC or PC-based cluster conditional upon all PCs or clusters. Because PC and cluster scores are estimated from the correlation structure of the genotype data, it should be noted that p-values resulting from any association testing framework may not be completely accurate. A bootstrap or randomization procedure that includes computation of PC scores or cluster components would likely yield more accurate p-values. For the purposes of this paper, nominal p-values are reported. All analyses were performed using SAS (v.9.1).
Cases (n= 515)
Controls (n= 672)
Male n (%)
Female n (%)
Anti-CCP (units/mL) Mean (SD)
RFUW (IU/mL) Mean (SD)
HLA region association with rheumatoid arthritis and arthritis-related traits
RA affection status
Traditional PC analysis
4.8 × 10-85
PC-based cluster analysis
1.4 × 10-74
5.7 × 10-7
Our PC-based clustering algorithm identified 188 clusters that accounted for 80% of the variance and were used to test for association. Cluster size ranged from 1 to 14 SNPs. Using likelihood-ratio tests, the PC-based cluster method also found significant association between the MHC region and RA (p = 1.4 × 10-74) and RFUW (p = 5.7 × 10-7) (Table 2). Similar to the PC analysis, the PC-based clustering method showed no evidence for association between the MHC region and anti-CCP (p = 0.21). Twenty-four SNP clusters were associated with RA (p ≤ 0.05) and 36 SNP clusters showed association with RFUW (p ≤ 0.05); 2 clusters were common to both outcomes.
Cluster associations with rheumatoid arthritis case-control status
Cluster associations with RFUW
Traditionally, investigators examining gene regions or specific candidate genes might genotype hundreds of SNPs, possibly perform tag SNP selection, and test each SNP for association with disease or disease-related traits. Unfortunately, this approach necessitates multiple test correction, resulting in a significant reduction in power. PC analysis has been suggested as an exploratory approach that parses the information contained in a large number of correlated SNPs into a smaller number of orthogonal PCs that can be analyzed for association instead of individual SNPs [3, 4]. A significant omnibus test of PCs indicates statistical association between a given region, as represented by the SNPs genotyped, and disease outcomes. However, PCA cannot be used to identify the specific SNPs contributing to the association, and therefore still requires testing of individual variants, to isolate the specific SNP(s) contributing to the association. We introduce a PC-based clustering method that retains many of the favorable attributes of PC regression, but allows for identification of the subset of SNPs contributing to the evidence for association, which reduces the multiple testing burden. We compared the traditional PC approach to the PC-clustering method using the NARAC data, and demonstrate that PC-clustering identifies variants in the 3.2-Mb MHC region contributing to RA risk and variation in RA-related traits.
While traditional PC analysis makes it possible to analyze only the subset of PCs that represent most of the variation in a candidate region, PCs still represent linear combinations of all SNPs in the data set, which makes interpretation of significant PCs difficult. Upon inspection of the 29 PCs from the full model found to be significantly associated with RA status, we found the 822 eigenvector loadings on these PCs to range from -0.148 to 0.145, with most hovering close to 0. Thus, we were only able to infer from PC analysis that variation in the MHC region, as represented by these 822 SNPs, is strongly associated with RA risk. Additional interpretation of the specific SNP(s) driving significant associations between PCs and phenotypes can only be achieved by testing all 822 SNPs individually for association. In contrast, the PC-based clustering algorithm we employed reduced 822 SNPs to 188 discernable SNP clusters that also accounted for 80% of the regional variation. The clusters, which are subsets of the 822 SNPs analyzed, allow unique identification of those SNPs that may contribute to the evidence for association. For example, of the 24 SNP clusters associated with RA status, Cluster 1 and Cluster 23 were found to be the most significant. Cluster 1 represents a distinct set of SNPs covering ~883 kb of the 3.2-Mb region examined, while Cluster 24 covers a non-overlapping region of ~295 kb. While Cluster 1 represents SNPs flanking HLA-C and HLA-B, Cluster 23 comprises SNPs surrounding the HLA-DRA, HLA-DRB5, and HLA-DRB1 loci. In fact, rs3099844 and rs2857595 found in Cluster 1 were previously identified by Lee et al.  as belonging to a haplotype associated with anti-CCP positive RA, which 98% of cases in the present study were. Additionally, rs2395175 in Cluster 23 ranked among the top ten SNPs for association with RA in a recent genome-wide association study by Plenge et al. .
The clustering algorithm also identified 36 SNP clusters found to be associated with variation in RFUW among RA cases. The most significant associations included Clusters 2, 5, 20, 24, and 183. Clusters 2, 5, 24, and 183 are composed of SNPs located in the chromosomal region between HLA-A and HLA-C, with Clusters 2 and 5 capturing the specific variation in and around HLA-C. Interestingly, Yen et al. demonstrated that HLA-C alleles may modulate the pattern of RA progression . Moreover, Lee et al. found rs887464 in Cluster 183 to be associated with RA affection . Cluster 20, composed of nine SNPs, represents variants located within and proximal to HLA-DQB2. Previous examination of genes in the MHC class II region, conditional on the HLA-DRB loci, has shown the HLA-DQB2 locus to have a vital role in RA [11, 12]. As RA is heterogeneous in terms of the progression of joint destruction , further examination of the SNPs in these clusters may provide information regarding genetic determinants of RA progression or symptom severity.
While our PC-based clustering method offers the interpretability a traditional PC approach lacks, there are other issues to be considered. First, we required more clusters than PCs to satisfy the 80% explained-variance threshold, which increased the degrees of freedom utilized for the omnibus test of association. The additional degrees of freedom usually results in reduced power to detect global association compared to the traditional PC approach. This may be due to the fact that while PCs are orthogonal, or independent, cluster components formed by the clustering algorithm are oblique. At each iteration, PC1 and PC2 are computed from a distinct set of SNPs that have been assigned to a given cluster, such that the first PC of one cluster may be correlated with the first PC of another cluster. Thus, although each SNP is assigned to the cluster with which it has the highest squared correlation, all SNPs share some degree of correlation with the other clusters they were not assigned to. This underlying correlation among clusters may be indicative of the correlation pattern among SNPs, although not necessarily haplotype blocks, and thus better reflect the true relationship of the variants within the MHC candidate region, but may also result in slightly reduced power to detect association.
Both traditional PC and PC-based clustering methods indicate the MHC gene region is significantly associated with RA and RFUW. However, traditional PCA is unable to highlight which SNPs contributed to this association. In contrast, the PC-based clustering method maintains many of the virtues of the traditional PC approach, but has the advantage of isolating the SNP(s) contributing to evidence for association. Therefore, the PC-based clustering method may be a better approach to testing multiple variant associations with phenotypes of interest.
List of abbreviations used
Anti-cyclic citrullinated peptide
Genetic Analysis Workshop 16
Minor allele frequency
Major histocompatability complex
North American Rheumatoid Arthritis Consortium
Rheumatoid factor IgM
The authors thank John Morrison for technical assistance in accessing the GAW16 data and the NARAC investigators for contributing their data to GAW16. MHB was supported by a USC training grant in the cellular, molecular and biochemical sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Vermeulen SHHM, Den Heijer M, Sham P, Knight J: Application of multi-locus analytical methods to identify interacting loci in case-control studies. Ann Hum Genet. 2007, 71: 689-700. 10.1111/j.1469-1809.2007.00360.x.View ArticlePubMedGoogle Scholar
- Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, Van Der ADL, Feskens EJM: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006, 7: 23-10.1186/1471-2156-7-23.PubMed CentralView ArticlePubMedGoogle Scholar
- Gauderman JW, Murcray C, Gilliland F, Conti D: Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007, 31: 383-395. 10.1002/gepi.20219.View ArticlePubMedGoogle Scholar
- Wang K, Abbott D: A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008, 32: 108-118. 10.1002/gepi.20266.View ArticlePubMedGoogle Scholar
- Lee HS, Lee AT, Criswell LA, Seldin MF, Amos CI, Carulli JP, Navarrete C, Remmers EF, Kastner DL, Plenge RM, Li W, Gregersen PK: Several regions in the major histocompatibility complex confer risk for anti-CCP-antibody positive rheumatoid arthritis, independent of the DRB1 locus. Mol Med. 2008, 14: 293-300. 10.2119/2007-00123.Lee.PubMed CentralView ArticlePubMedGoogle Scholar
- Irigoyen P, Lee AT, Wener MH, Li W, Kern M, Batliwalla F, Lum RF, Massarotti E, Weisman M, Bombardier C, Remmers EF, Kastner DL, Seldin MF, Criswell LA, Gregersen PK: Regulation of anti-cyclic citrullinated peptide antibodies in rheumatoid arthritis: contrasting effects of HLA-DR3 and the shared epitope alleles. Arthritis Rheum. 2005, 52: 3813-3818. 10.1002/art.21419.View ArticlePubMedGoogle Scholar
- van Gaalen FA, van Aken J, Huizinga TW, Schreuder GM, Breedveld FC, Zanelli E, van Venrooij WJ, Verweij CL, Toes RE, de Vries RR: Association between HLA class II genes and autoantibodies to cyclic citrullinated peptides (CCPs) influences the severity of rheumatoid arthritis. Arthritis Rheum. 2004, 50: 2113-2121. 10.1002/art.20316.View ArticlePubMedGoogle Scholar
- Plenge RM, Cotsapas C, Davies L, Price AL, de Bakker PI, Maller J, Pe'er I, Burtt NP, Blumenstiel B, DeFelice M, Parkin M, Barry R, Winslow W, Healy C, Graham RR, Neale BM, Izmailova E, Roubenoff R, Parker AN, Glass R, Karlson EW, Maher N, Hafler DA, Lee DM, Seldin MF, Remmers EF, Lee AT, Padyukov L, Alfredsson L, Coblyn J, Weinblatt ME, Gabriel SB, Purcell S, Klareskog L, Gregersen PK, Shadick NA, Daly MJ, Altshuler D: Two independent alleles at 6q23 associated with risk of rheumatoid arthritis. Nat Genet. 2007, 39: 1477-1482. 10.1038/ng.2007.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Harris CW, Kaiser HF: Oblique factor analytic solutions by orthogonal transformations. Psychometrika. 1964, 29: 347-362. 10.1007/BF02289601.View ArticleGoogle Scholar
- Yen JH, Moore BE, Nakajima T, Scholl D, Schaid DJ, Weyand CM, Goronzy JJ: Major histocompatibility complex class I-recognizing receptors are disease risk genes in rheumatoid arthritis. J Exp Med. 2001, 193: 1159-1167. 10.1084/jem.193.10.1159.PubMed CentralView ArticlePubMedGoogle Scholar
- Shiina T, Inoko H, Kulski JK: An update of the HLA genomic region, locus information and disease associations: 2004. Tissue Antigens. 2004, 64: 631-639. 10.1111/j.1399-0039.2004.00327.x.View ArticlePubMedGoogle Scholar
- Kochi Y, Yamada R, Kobayashi K, Takahashi A, Suzuki A, Sekine A, Mabuchi A, Akiyama F, Tsunoda T, Nakamura Y, Yamamoto K: Analysis of single-nucleotide polymorphisms in Japanese rheumatoid arthritis patients shows additional susceptibility markers besides the classic shared epitope susceptibility sequences. Arthritis Rheum. 2004, 50: 63-71. 10.1002/art.11366.View ArticlePubMedGoogle Scholar
- Weyand CM, Klimiuk PA, Goronzy JJ: Heterogeneity of rheumatoid arthritis: from phenotypes to genotypes. Springer Semin Immunopathol. 1998, 20: 5-22. 10.1007/BF00831996.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.