Volume 3 Supplement 7
Comparison of methods for correcting population stratification in a genome-wide association study of rheumatoid arthritis: principal-component analysis versus multidimensional scaling
© Wang et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Population stratification (PS) represents a major challenge in genome-wide association studies. Using the Genetic Analysis Workshop 16 Problem 1 data, which include samples of rheumatoid arthritis patients and healthy controls, we compared two methods that can be used to evaluate population structure and correct PS in genome-wide association studies: the principal-component analysis method and the multidimensional-scaling method. While both methods identified similar population structures in this dataset, principal-component analysis performed slightly better than the multidimensional-scaling method in correcting for PS in genome-wide association analysis of this dataset.
In the past few years, the genome-wide association (GWA) approach has become a widely used tool for identifying genetic loci related to disease risk. Population stratification (PS) is a major challenge in GWA studies (GWAS), because of the risk of generating false positives that represent genetic differences from ancestry rather than genes associated with a disease. Among the methods developed for correcting PS in GWAS, the principal-component analysis (PCA) method [1, 2] and the multidimensional-scaling (MDS) method [3, 4] are also capable of detecting population structure. The PCA method identifies principal components that represent the population structure based on genetic correlations among individuals. The MDS method detects meaningful underlying dimensions that explain observed genetic distance, e.g., pairwise identity-by-state (IBS) distance, among individuals. While other methods for addressing population structure exist, we focused on these two methods in this study.
The objectives of this study were: 1) to compare the population structures identified by PCA and MDS in the rheumatoid arthritis (RA) dataset of Genetic Analysis Workshop 16 (GAW16); and 2) to evaluate the performance of these two approaches for correcting PS in GWA analyses.
GAW16 Problem 1 data
GAW16 Problem 1 data, provided by the North American Rheumatoid Arthritis Consortium (NARAC), contained genome-wide data on 868 RA cases and 1,194 controls. Genotype data on 545,080 single-nucleotide polymorphisms (SNPs) were available for analysis.
Genotype data quality control
Quality control of genotype data was conducted at both the individual level and the SNP level. At the individual level, a call rate of at least 0.95 was required. Sex discrepancies were examined using the heterozygosity rate of X-chromosome. At the SNP level, a call rate of at least 0.90, a minor allele frequency of at least 0.01, and a p-value from the Hardy-Weinberg equilibrium test of at least 0.05/545,080 were required.
PCA was performed using the computer program EIGENSOFT 2.0 [1, 2]. Theoretically, the leading components should reflect population structure. In this case, some of the leading components appeared to be dominated by a small set of markers all mapped to a few very small chromosome regions that showed extended linkage disequilibrium (LD). To deal with this problem, we applied a modified version of the PCA as described by Fellay et al. . A first round PCA was conducted using all autosomal SNPs with minor allele frequency >0.01. SNP loadings for the leading components were compared with a normal distribution to determine whether these components depended on many SNPs across the genome or if they were dominated by relatively few SNPs all mapped to a few small chromosome regions with extended LD, as would be expected when the given component reflected population structure or a more localized LD effect, respectively. To correct for the local effects, the PCA was re-applied in a reduced SNP set. In this reduced SNP set, i) SNPs with loadings that deviated from their expected normal quantiles with a distance greater than one were excluded along all leading components; ii) remaining SNPs were pruned using the "indep-pairwise" option in PLINK 1.03  such that all SNPs within a given window size of 100 had pairwise r2 < 0.2; iii) each SNP was regressed on the previous two SNPs, and the residual entered into the PCA. SNP loadings on all components deemed significant by the Tracy-Widom statistic  were re-inspected to make sure that no component was dominated by a small LD region of the genome. In case there were still leading components dominated by local LD regions, the second round of PCA was repeated with adjusted parameters until no component was dominated by a small LD region. Population outliers were excluded along all significant components.
MDS analysis was performed using PLINK1.03 . All SNPs that passed quality control were pruned such that all SNPs within a given window size of 100 had pairwise r2 < 0.2. Pairwise IBS distance was calculated using all autosomal SNPs that remained after pruning. Five nearest neighbors were identified for each individual based upon the pairwise IBS distance. IBS distance to each of the five nearest neighbors was then transformed into a Z score. Individuals with a minimum Z score among the five nearest neighbors less than -4 were excluded from analysis as population outliers. MDS dimensions were extracted using the "MDS-plot" option.
Genome-wide association analyses
Three GWA analyses were performed using PLINK 1.03 . These three GWA analyses were Cochran-Armitage trend test without any adjustment for PS, logistic regression with the final set of significant principal components as covariates, and logistic regression with leading MDS dimensions as covariates. The genomic inflation factor  was calculated for each GWA analysis.
Results and discussion
Genotype data quality control
All individuals had call rates >0.95 at the individual level. An examination of sex led to the exclusion of seven individuals due to incorrect or ambiguous sex information when compared with phenotype data. At the SNP level, 5,449 SNPs with call rates <0.90, 23,205 SNPs with minor allele frequencies <0.01, 1,389 SNPs with p-values from Hardy-Weinberg equilibrium test <0.05/545,080, and 10 SNPs on Y-chromosome were excluded from analysis. After genotype data quality control, there were 2,055 subjects and 515,741 SNPs in the analysis dataset.
PCA, MDS, and population structure
In the first round of PCA, 59 components met the criteria for statistical significance using the Tracy-wisdom statistic. Nearly half of the SNPs that deviated from their expected normal quantiles with a distance of at least 1 (4,413 of 9,980) were in the HLA region (chr6: 25,000,000-33,500,000 bp), a region that had been reported with higher genetic heterogeneity across different populations. After removing SNPs that deviated from their expected normal quantiles with a distance of at least one and pruning the SNPs based on LD information, 81,636 autosomal SNPs were included in the second round of PCA. This analysis resulted in eight significant components using the Tracy-Widom statistic. A small number of individuals (n = 9) were excluded as outliers with scores on one of the significant principal components more than six standard deviations beyond the sample mean score.
In the MDS analysis, 81,652 autosomal SNPs were used to calculate the pairwise IBS distance after SNP pruning. A small number of individuals (n = 7) were excluded as outliers with a minimum Z score among five nearest neighbors less than -4. Among the seven outliers, five were also among the nine outliers excluded by PCA. The first eight dimensions were retained for correcting the PS in GWA analysis.
Correlation between first eight principal components and first eight MDS dimensions
Top eight principal components
Top eight MDS dimensions
p-Values at SNP rs2476601 from three GWA analyses and their rankings in non-HLA SNPs
Rank in non-HLA SNPs
Trend test without adjustment
5.42 × 10-12
Logistic regression adjusted for principal components
Logistic regression adjusted for MDS dimensions
Although PCA performed slightly better than MDS in correcting PS in the GWA analysis of this dataset, it would be inappropriate to conclude that PCA is a preferred approach in all GWAS. MDS is a more flexible method in general as compared with PCA. First, PCA requires that underlying data follow a multivariate normal distribution, while MDS imposes no such restriction. Second, PCA requires computation of a covariance matrix first, while MDS can be applied to any kind of distances or similarities. Pairwise IBS distance is only one example of many distance measures to which MDS can be applied. As a special case, MDS can be applied to the covariance matrix used in PCA as well. In this case, the performance of MDS in correcting PS will be equivalent to it of PCA .
In this paper, we compared the performance of PCA and MDS in identifying population structure and correcting for PS in GWAS using data provided to GAW16 participants by the NARAC. While the two methods identified similar population structures in this dataset, PCA performed slightly better than MDS in correcting for PS in the GWA analyses of this data set.
List of abbreviations used
Genetic Analysis Workshop 16
North American Rheumatoid Arthritis Consortium
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2: e190-10.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Q, Yu K: Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008, 32: 215-226. 10.1002/gepi.20296.View ArticlePubMedGoogle Scholar
- Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M, Zhang K, Gumbs C, Castagna A, Cossarizza A, Cozzi-Lepri A, De Luca A, Easterbrook P, Francioli P, Mallal S, Martinez-Picado J, Miro JM, Obel N, Smith JP, Wyniger J, Descombes P, Antonarakis SE, Letvin NL, McMichael AJ, Haynes BF, Telenti A, Goldstein DB: A whole-genome association study of major determinants for host control of HIV-1. Science. 2007, 317: 944-947. 10.1126/science.1143767.PubMed CentralView ArticlePubMedGoogle Scholar
- Tracy C, Widom H: Level-spacing distributions and the Airy kernel. Commun Math Phys. 1994, 159: 151-174. 10.1007/BF02100489.View ArticleGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Gregersen PK, Silver J, Winchester RJ: The shared epitope hypothesis. An approach to understanding the molecular genetics of susceptibility to rheumatoid arthritis. Arthritis Rheum. 1987, 30: 1205-1213. 10.1002/art.1780301102.View ArticlePubMedGoogle Scholar
- Newton JL, Harney SM, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun. 2004, 5: 151-157. 10.1038/sj.gene.6364045.View ArticlePubMedGoogle Scholar
- Jawaheer D, Seldin MF, Amos CI, Chen WV, Shigeta R, Etzel C, Damle A, Xiao X, Chen D, Lum RF, Monteiro J, Kern M, Criswell LA, Albani S, Nelson JL, Clegg DO, Pope R, Schroeder HW, Bridges SL, Pisetsky DS, Ward R, Kastner DL, Wilder RL, Pincus T, Callahan LF, Flemming D, Wener MH, Gregersen PK, North American Rheumatoid Arthritis Consortium: Screening the genome for rheumatoid arthritis susceptibility genes: a replication study and combined analysis of 512 multicase families. Arthritis Rheum. 2003, 48: 906-916. 10.1002/art.10989.View ArticlePubMedGoogle Scholar
- Irigoyen P, Lee AT, Wener MH, Li W, Kern M, Batliwalla F, Lum RF, Massarotti E, Weisman M, Bombardier C, Remmers EF, Kastner DL, Seldin MF, Criswell LA, Gregersen PK: Regulation of anti-cyclic citrullinated peptide antibodies in rheumatoid arthritis: contrasting effects of HLA-DR3 and the shared epitope alleles. Arthritis Rheum. 2005, 52: 3813-3818. 10.1002/art.21419.View ArticlePubMedGoogle Scholar
- Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, Conn MT, Chang M, Chang SY, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330-337. 10.1086/422827.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.