Volume 3 Supplement 7
Principal-component-based population structure adjustment in the North American Rheumatoid Arthritis Consortium data: impact of single-nucleotide polymorphism set and analysis method
© Peloso et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Population structure occurs when a sample is composed of individuals with different ancestries and can result in excess type I error in genome-wide association studies. Genome-wide principal-component analysis (PCA) has become a popular method for identifying and adjusting for subtle population structure in association studies. Using the Genetic Analysis Workshop 16 (GAW16) NARAC data, we explore two unresolved issues concerning the use of genome-wide PCA to account for population structure in genetic associations studies: the choice of single-nucleotide polymorphism (SNP) subset and the choice of adjustment model. We computed PCs for subsets of genome-wide SNPs with varying levels of LD. The first two PCs were similar for all subsets and the first three PCs were associated with case status for all subsets. When the PCs associated with case status were included as covariates in an association model, the reduction in genomic inflation factor was similar for all SNP sets. Several models have been proposed to account for structure using PCs, but it is not yet clear whether the different methods will result in substantively different results for association studies with individuals of European descent. We compared genome-wide association p-values and results for two positive-control SNPs previously associated with rheumatoid arthritis using four PC adjustment methods as well as no adjustment and genomic control. We found that in this sample, adjusting for the continuous PCs or adjusting for discrete clusters identified using the PCs adequately accounts for the case-control population structure, but that a recently proposed randomization test performs poorly.
Cases for the North American Rheumatoid Arthritis Consortium (NARAC) study were collected across the United States; controls were collected from Long Island, NY . Because of the differences in the mix of European ancestry across the US, we expect to see population structure in the Genetic Analysis Workshop 16 (GAW16) NARAC sample. Population structure can lead to spurious association when the distributions of both the trait and the genotype of interest vary among subpopulations. Even modest differences among subpopulations can lead to spurious associations in large samples: population structure bias is of particular concern for genome-wide studies in which large numbers of subjects are needed to detect modest effects of a single-nucleotide polymorphism (SNP) with the phenotype .
Several methods have been developed to detect and adjust for population structure, including genomic control , structured association , and genome-wide principal-components analysis (PCA) [5, 6]. Here, we focus on the PCA approach. The principal components (PCs) are computed using the genotype matrix from genome-wide SNP data, where each SNP genotype is a count (0, 1, or 2) of the number of copies of the minor allele [5, 6]. Each PC is a weighted sum of all SNP genotype counts. The first PC accounts for the largest proportion of genetic variation; each additional PC accounts for successively smaller proportions. The PCs have been shown to capture genetic differences among individuals of European heritage due to ancestry . These components can be used as covariates in an association analysis to account for differences in ancestry among individuals .
When limited numbers of SNPs are available to compute the PCs, each SNP will contribute a greater proportion of information to the PCA. In this situation, SNPs in strong linkage disequilibrium (LD) may have excessive influence on one or a set of PCs . However, it is not clear whether the same issue arises in the context of dense genome-wide SNPs, where some investigators have advocated for the use of all SNPs to compute the PCs , while others have proposed the use of a subset of SNPs in linkage equilibrium . In previous analyses of the NARAC data set, investigators noted that the third and fourth PCs from a genome-wide PCA heavily weighted SNPs within a 3.8-Mb region on 8p23 with a known inversion polymorphism , and advocated the omission of SNPs in this region, as well as the MHC region that is strongly associated with RA. Thus, conflicting views remain on how best to select a set of SNPs to use for PCA for the purpose of adjustment for population structure in genetic association studies.
In addition to the controversy concerning choice of SNPs for PCA, investigators have proposed several methods to adjust for the structure. The developers of the computer program EIGENSTRAT recommend adjustment for the PCs as linear covariates in an association model, while other investigators have proposed approaches that cluster individuals, arguing that the PCs, phenotype, and allele frequencies may not be linearly related, particularly when the PCs create distinct clusters . Kimmel et al. propose a randomization test as an alternative to adjusting for PCs within a traditional regression framework . Using simulation, they show that their population structure association test (PSAT) is considerably more powerful than the linear adjustment approach of EIGENSTRAT in some situations . It is unclear whether the proposed approaches will result in substantively different results in typical genome-wide association studies using individuals of European ancestry.
To address these questions, we compare six choices for subsetting SNPs for PCA, and four methods for adjusting for structure identified using PCA, as well as no adjustment and genomic control, using the NARAC rheumatoid arthritis (RA) genome-wide SNP data provided by GAW16. Our goal is to determine to what extent SNP choice and method of population structure adjustment affects the results of the genome-wide association study, measured by the genomic control inflation factor, and by specific association results for previously reported positive associations between SNPs and RA [11–13].
GAW16 NARAC samples (868 cases, 1194 controls) were genotyped using the Illumina 550 k chip. All subjects had fewer than 10% missing genotypes. Seven subjects had X chromosome heterozygosity incompatible with their phenotypic sex (homozygosity > 0.2 for females or <0.8 for males)  and were omitted from all analyses. We excluded SNPs with departure from Hardy-Weinberg equilibrium in controls (p < 10-6), ≥ 2% missing genotypes, or minor allele frequency <0.01, leaving a total of 470,943 autosomal SNPs.
SNP subsets for PCA
We computed the genotypic PCs using the 2055 NARAC individuals for six subsets of SNPs: (S1) all SNPs passing quality control ("filtered SNPs"), (S2) filtered SNPs, omitting SNPs in the MHC region, (S3) filtered SNPs omitting SNPs in the MHC region and the inversion polymorphism on chromosome 8, (S4) filtered SNPs omitting all of chromosome 6 and the inversion polymorphism on chromosome 8, (S5) filtered SNPs omitting one SNP from each pair of SNPs having allelic correlation r2 > 0.4 , and (S6) filtered SNPs omitting one SNP from each pair of SNPs with r2 > 0.2.
For SNP Sets (S2) and (S3), we omitted all SNPs within 10 kb of 6q21.3 , the MHC region that is highly associated with RA. We defined the chromosome 8 inversion polymorphism as the region spanning 8 Mb to 11.5 Mb . Previous studies have omitted these regions when conducting genome-wide PCA to avoid bias due to a region strongly linked to case-status, and a region with known high LD and extreme PC SNP weights, respectively .
Adjustment for structure in association analyses
To simplify presentation, we selected a single SNP subset (S3, filtered SNPs omitting MHC and inversion polymorphism SNPs) for exploration of adjustment methods (see Results for rationale for the choice of Set S3). We determined the distributions of genome-wide p-values excluding the highly associated MHC region and association results for two specific SNPs that have been reported to be associated with RA: rs2476601 [11, 12] and rs3761847  for six models: (A) logistic regression adjusting for the DRB locus (no adjustment for structure), (B) logistic regression adjusting for the DRB locus and PC1-4 as continuous covariates, (C) logistic regression adjusting for the DRB locus and the five clusters obtained by clustering, (D) logistic regression adjusting for the DRB locus, PC1-4 as continuous covariates, and the five PC clusters, (E) PSAT  with disease probability assigned based on the five PC-based clusters, and (F) genomic control using the logistic model adjusting for the DRB locus. For all analyses, we coded the SNP genotype as the count of the number of minor alleles. This model assumes a linear relationship between the log odds of case status and the number of minor SNP alleles. The first four PCs for SNP Subset (S3) were associated with RA status (Wilcoxon rank sum test p-value < 0.05); we therefore used these PCs for adjustment and clustering in association analyses [Methods (B)-(E)]. For association Models (C), (D), and (E) we used the partitioning around medoids algorithm ("pam" function in R ) to cluster the individuals using the four PCs associated with RA case status. A five-cluster solution was optimal based on silhouette widths. The DRB locus is defined as the number of high- and medium-risk alleles each individual carried .
Logistic regression analyses [Models (A)-(D)] were performed using the computer program PLINK  and analysis for Method (E) was performed using the PSAT software . PSAT tests the association between a SNP and disease, accounting for population structure using a randomization test. The user specifies the probability of being a case for each cluster based on ancestry information. For each replicate, case status for each individual is re-assigned at random according to the baseline probability for the cluster to which the individual is assigned. The probabilities of disease and sample sizes in the five clusters were 0.08 (n = 77), 0.11 (n = 68), 0.12 (n = 460), 0.37 (n = 230), and 0.63 (1120). One million samplings were used to assess significance.
Summary of PCA with six SNP subsets
SNP Subset for PCA
PCs significantly associated with RA (α = 0.05)
S1. Filtered SNPs
1, 2, 3, 5, 6, 8, 9
S2. Removing the MHC region
1, 2, 3, 5
S3. Removing MHC and inversion on chr 8
1, 2, 3, 4
S4. Removing chr 6 and inversion on chr 8
1, 2, 3, 4
S5. Removing LD between SNPs with r2 > 0.4
1, 2, 3, 4
S6. Removing LD between SNPs with r2 > 0.2
1, 2, 3, 4, 5
Correlation between S3 PCs and other SNP subset PCs.
PCA removing MHC and inversion on chromosome 8 (S3)b
PC8, PC9, PC10
(0.35, 0.25, 0.37)
Adjustment for structure
When using PCA to account for population structure, the primary goal is to reduce type I error to the nominal (expected) level without reducing power for true positive associations. The genomic control inflation factor (λ), defined as the median association test statistic across all SNPs divided by its expected value, is an overall indicator of the inflation in the test statistics . We computed λ for genome-wide analyses excluding the MHC region, adjusting for the PCs significantly associated with RA at α = 0.05 as linear covariates. For the six SNP subsets, λ ranged between 1.022 and 1.030, indicating good control of inflation compared to the substantial inflation (λ = 1.24) observed for analyses unadjusted for structure (Table 1).
Linear adjustment for PCs to account for population structure has been shown to have lower power and higher type I error than alternatives in some situations [9, 10], therefore we explored other methods for adjusting for structure. To simplify presentation, we selected a single SNP subset (S3) for exploration of adjustment methods. SNP Sets S1 and S2 produced several PCs that were significantly associated with RA and also had heavily weighted SNPs in some genomic regions (Figure 1). Because heavy weights in a region will reduce the power to detect true associations with SNPs in that region, we selected the S3 SNP subset, which used the maximum number of SNPs to determine the PCs while minimizing overweighting of SNPs in any region.
Results for positive control SNPs for six methods of adjusting for structure
βa ± SE
βa ± SE
A. Logistic regression adjusting for DRB
0.61 ± 0.11
3.70 × 10-8
0.42 ± 0.07
8.04 × 10-9
B. Logistic regression adjusting for DRB and PC1-4
0.47 ± 0.12
1.32 × 10-4
0.39 ± 0.08
2.08 × 10-6
C. Logistic regression adjusting for DRB and cluster
0.46 ± 0.12
1.57 × 10-5
0.45 ± 0.08
4.21 × 10-8
D. Logistic regression adjusting for DRB, PC1-4, and cluster
0.46 ± 0.13
2.44 × 10-4
0.42 ± 0.08
4.55 × 10-7
E. PSAT based on PC1-4 cluster
1.17 × 10-6
2.80 × 10-8
F. Genomic control
0.16 ± 0.12
7.44 × 10-7
0.42 ± 0.08
2.22 × 10-7
0.50 ± 0.15
6.60 × 10-4
0.35 ± 0.04
4.00 × 10-14
We performed PCA for six subsets of genome-wide SNPs; the first two PCs were similar for all subsets, and were highly associated with RA case status. The third and fourth PCs were very similar for SNP sets that either removed all SNPs in LD or removed the chromosome 8 inversion region known for high LD. SNP sets that included all SNPs had SNPs with extreme weights for some PCs. In this data set, structure adjustment using PCs obtained from any of the SNP subsets similarly reduces the genomic inflation factor. However, SNP sets that result in PCs that over-weight a genomic region may be prone to loss of power because strong PC weights on a true-positive SNP may adjust away the SNP effect.
We evaluated four methods of adjusting for structure using PCs. Adjustment for the four PCs associated with case status as linear covariates or for the cluster assignments in logistic regression substantially reduces both the genomic inflation factor and the significance level of the most highly associated SNPs (Figure 2), suggesting that the structure that exists in this sample does result in some bias. A criticism of adjusting for PCs as linear covariates is that the PCs and disease frequency may not be linearly related, especially when the PCs create distinct clusters . However, in this sample we found that adjusting for either cluster assignment or PCs as linear covariates in logistic regression produced similar genomic inflation factor (λ) and comparable effect estimates for two specific positive control SNPs. Although the PSAT randomization method has been reported to provide improved power and better control of type I error [9, 10], in this sample the method produced a considerably higher inflation factor for the genome-wide test statistics than the other structure adjustment methods (Figure 2). Because PSAT relies on clustering individuals based on PCs or other distance metrics, it may ultimately be more useful for accounting for population structure at the cross-continent level, rather than the more subtle structure observed in populations of European descent.
Exploring existing genome-wide association study data allows us to gain experience with methods for adjusting for population structure under realistic conditions, including variable LD due to inversion polymorphisms and regions of highly associated SNPs. We have found that population structure exists in the GAW16 RA data, and that the structure is associated with case status. Therefore, the association tests are potentially subject to confounding by genetic ancestry. In this sample, we have determined that the choice of SNP subset for PCA does not substantively affect the resulting reduction in genomic inflation factor for genome-wide association analyses. However, for some SNP subsets, the PCs associated with case status and used in the structure adjustment heavily weighted small numbers of SNPs, which could lead to reduced power to detect true associations in those genomic regions. Thus, the SNP set selected could have an impact on power. We recommend that in practice, for studies with modest levels of structure such as that expected when subjects are all of European ancestry, investigators check the PCs that will be used for structure adjustment for extreme weights and choose alternative SNP sets if any genomic regions are heavily weighted.
List of abbreviations used
Genetic Analysis Workshop 16
North American Rheumatoid Arthritis Consortium
Population structure association test
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. We give special thanks to the following people who helped us obtain and clean the data: A Broka, LA Cupples, J Dupuis, A Hendricks, and A Manning. We thank G Kimmel for his assistance with PSAT.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LR, Li W, Tan AK, Bonnard C, Ong RT, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altshuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Kastner DL, Klareskog L, Gregersen PK: TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study. N Engl J Med. 2007, 357: 1199-1209. 10.1056/NEJMoa073491.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies. Nat Genet. 2004, 36: 512-517. 10.1038/ng1337.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigen analysis. PLoS Genet. 2006, 2: e190-10.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P, Seligsohn U, Waliszewska A, Schirmer C, Ardlie K, Ramos A, Nemesh J, Arbeitman L, Goldstein DB, Reich D, Hirschhorn JN: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4: e236-10.1371/journal.pgen.0030236.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu K, Wang Z, Li Q, Wacholder S, Hunter DJ, Hoover RN, Chanock S, Thomas G: Population substructure and control selection in genome-wide association studies. PLoS One. 2008, 3: e2551-10.1371/journal.pone.0002551.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Q, Yu K: Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008, 32: 215-226. 10.1002/gepi.20296.View ArticlePubMedGoogle Scholar
- Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet. 2007, 81: 895-905. 10.1086/521372.PubMed CentralView ArticlePubMedGoogle Scholar
- Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, Conn MT, Chang M, Chang SY, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330-337. 10.1086/422827.PubMed CentralView ArticlePubMedGoogle Scholar
- Carlton VE, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, Kastner DL, Seldin MF, Criswell LA, Gregersen PK, Beasley E, Thomson G, Amos CI, Begovich AB: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis. Am J Hum Genet. 2005, 77: 567-581. 10.1086/468189.PubMed CentralView ArticlePubMedGoogle Scholar
- Plenge RM, Cotsapas C, Davies L, Price AL, de Bakker PI, Maller J, Pe'er I, Burtt NP, Blumenstiel B, DeFelice M, Parkin M, Barry R, Winslow W, Healy C, Graham RR, Neale BM, Izmailova E, Roubenoff R, Parker AN, Glass R, Karlson EW, Maher N, Hafler DA, Lee DM, Seldin MF, Remmers EF, Lee AT, Padyukov L, Alfredsson L, Coblyn J, Weinblatt ME, Gabriel SB, Purcell S, Klareskog L, Gregersen PK, Shadick NA, Daly MJ, Altshuler D: Two independent alleles at 6q23 associated with risk of rheumatoid arthritis. Nat Genet. 2007, 39: 1477-1482. 10.1038/ng.2007.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Newton JL, Harney SM, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun. 2004, 5: 151-157. 10.1038/sj.gene.6364045.View ArticlePubMedGoogle Scholar
- The comprehensive R archive network. [http://cran.r-project.org]
- Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN: Demonstrating stratification in a European American population. Nat Genet. 2005, 37: 868-872. 10.1038/ng1607.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.