Volume 3 Supplement 7
Genetic Analysis Workshop 16
Comparison of methods for correcting population stratification in a genomewide association study of rheumatoid arthritis: principalcomponent analysis versus multidimensional scaling
 Dai Wang^{1}Email author,
 Yu Sun^{1},
 Paul Stang^{2},
 Jesse A Berlin^{2},
 Marsha A Wilcox^{2} and
 Qingqin Li^{1}
DOI: 10.1186/175365613S7S109
© Wang et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
Population stratification (PS) represents a major challenge in genomewide association studies. Using the Genetic Analysis Workshop 16 Problem 1 data, which include samples of rheumatoid arthritis patients and healthy controls, we compared two methods that can be used to evaluate population structure and correct PS in genomewide association studies: the principalcomponent analysis method and the multidimensionalscaling method. While both methods identified similar population structures in this dataset, principalcomponent analysis performed slightly better than the multidimensionalscaling method in correcting for PS in genomewide association analysis of this dataset.
Background
In the past few years, the genomewide association (GWA) approach has become a widely used tool for identifying genetic loci related to disease risk. Population stratification (PS) is a major challenge in GWA studies (GWAS), because of the risk of generating false positives that represent genetic differences from ancestry rather than genes associated with a disease. Among the methods developed for correcting PS in GWAS, the principalcomponent analysis (PCA) method [1, 2] and the multidimensionalscaling (MDS) method [3, 4] are also capable of detecting population structure. The PCA method identifies principal components that represent the population structure based on genetic correlations among individuals. The MDS method detects meaningful underlying dimensions that explain observed genetic distance, e.g., pairwise identitybystate (IBS) distance, among individuals. While other methods for addressing population structure exist, we focused on these two methods in this study.
The objectives of this study were: 1) to compare the population structures identified by PCA and MDS in the rheumatoid arthritis (RA) dataset of Genetic Analysis Workshop 16 (GAW16); and 2) to evaluate the performance of these two approaches for correcting PS in GWA analyses.
Methods
GAW16 Problem 1 data
GAW16 Problem 1 data, provided by the North American Rheumatoid Arthritis Consortium (NARAC), contained genomewide data on 868 RA cases and 1,194 controls. Genotype data on 545,080 singlenucleotide polymorphisms (SNPs) were available for analysis.
Genotype data quality control
Quality control of genotype data was conducted at both the individual level and the SNP level. At the individual level, a call rate of at least 0.95 was required. Sex discrepancies were examined using the heterozygosity rate of Xchromosome. At the SNP level, a call rate of at least 0.90, a minor allele frequency of at least 0.01, and a pvalue from the HardyWeinberg equilibrium test of at least 0.05/545,080 were required.
Principalcomponent analysis
PCA was performed using the computer program EIGENSOFT 2.0 [1, 2]. Theoretically, the leading components should reflect population structure. In this case, some of the leading components appeared to be dominated by a small set of markers all mapped to a few very small chromosome regions that showed extended linkage disequilibrium (LD). To deal with this problem, we applied a modified version of the PCA as described by Fellay et al. [5]. A first round PCA was conducted using all autosomal SNPs with minor allele frequency >0.01. SNP loadings for the leading components were compared with a normal distribution to determine whether these components depended on many SNPs across the genome or if they were dominated by relatively few SNPs all mapped to a few small chromosome regions with extended LD, as would be expected when the given component reflected population structure or a more localized LD effect, respectively. To correct for the local effects, the PCA was reapplied in a reduced SNP set. In this reduced SNP set, i) SNPs with loadings that deviated from their expected normal quantiles with a distance greater than one were excluded along all leading components; ii) remaining SNPs were pruned using the "indeppairwise" option in PLINK 1.03 [3] such that all SNPs within a given window size of 100 had pairwise r^{2} < 0.2; iii) each SNP was regressed on the previous two SNPs, and the residual entered into the PCA. SNP loadings on all components deemed significant by the TracyWidom statistic [6] were reinspected to make sure that no component was dominated by a small LD region of the genome. In case there were still leading components dominated by local LD regions, the second round of PCA was repeated with adjusted parameters until no component was dominated by a small LD region. Population outliers were excluded along all significant components.
Multidimensional scaling
MDS analysis was performed using PLINK1.03 [3]. All SNPs that passed quality control were pruned such that all SNPs within a given window size of 100 had pairwise r^{2} < 0.2. Pairwise IBS distance was calculated using all autosomal SNPs that remained after pruning. Five nearest neighbors were identified for each individual based upon the pairwise IBS distance. IBS distance to each of the five nearest neighbors was then transformed into a Z score. Individuals with a minimum Z score among the five nearest neighbors less than 4 were excluded from analysis as population outliers. MDS dimensions were extracted using the "MDSplot" option.
Genomewide association analyses
Three GWA analyses were performed using PLINK 1.03 [3]. These three GWA analyses were CochranArmitage trend test without any adjustment for PS, logistic regression with the final set of significant principal components as covariates, and logistic regression with leading MDS dimensions as covariates. The genomic inflation factor [7] was calculated for each GWA analysis.
Results and discussion
Genotype data quality control
All individuals had call rates >0.95 at the individual level. An examination of sex led to the exclusion of seven individuals due to incorrect or ambiguous sex information when compared with phenotype data. At the SNP level, 5,449 SNPs with call rates <0.90, 23,205 SNPs with minor allele frequencies <0.01, 1,389 SNPs with pvalues from HardyWeinberg equilibrium test <0.05/545,080, and 10 SNPs on Ychromosome were excluded from analysis. After genotype data quality control, there were 2,055 subjects and 515,741 SNPs in the analysis dataset.
PCA, MDS, and population structure
In the first round of PCA, 59 components met the criteria for statistical significance using the Tracywisdom statistic. Nearly half of the SNPs that deviated from their expected normal quantiles with a distance of at least 1 (4,413 of 9,980) were in the HLA region (chr6: 25,000,00033,500,000 bp), a region that had been reported with higher genetic heterogeneity across different populations. After removing SNPs that deviated from their expected normal quantiles with a distance of at least one and pruning the SNPs based on LD information, 81,636 autosomal SNPs were included in the second round of PCA. This analysis resulted in eight significant components using the TracyWidom statistic. A small number of individuals (n = 9) were excluded as outliers with scores on one of the significant principal components more than six standard deviations beyond the sample mean score.
In the MDS analysis, 81,652 autosomal SNPs were used to calculate the pairwise IBS distance after SNP pruning. A small number of individuals (n = 7) were excluded as outliers with a minimum Z score among five nearest neighbors less than 4. Among the seven outliers, five were also among the nine outliers excluded by PCA. The first eight dimensions were retained for correcting the PS in GWA analysis.
Correlation between first eight principal components and first eight MDS dimensions
Top eight principal components  Top eight MDS dimensions  

dim1  dim2  dim3  dim4  dim5  dim6  dim7  dim8  
evec1  0.998 ^{ a }  0.01  0.02  0.01  0.005  0.01  0.002  0.0004 
evec2  0.00  0.98  0.17  0.03  0.01  0.01  0.01  0.002 
evec3  0.04  0.17  0.96  0.11  0.01  0.01  0.01  0.01 
evec4  0.02  0.00  0.12  0.89  0.06  0.17  0.07  0.03 
evec5  0.02  0.01  0.05  0.38  0.20  0.43  0.20  0.03 
evec6  0.01  0.01  0.02  0.02  0.13  0.00  0.08  0.17 
evec7  0.02  0.001  0.03  0.01  0.17  0.07  0.17  0.31 
evec8  0.02  0.01  0.02  0.02  0.26  0.18  0.32  0.06 
GWAS results
pValues at SNP rs2476601 from three GWA analyses and their rankings in nonHLA SNPs
Analysis method  pValue  Rank in nonHLA SNPs 

Trend test without adjustment  5.42 × 10^{12}  3 
Logistic regression adjusted for principal components  0.000018  18 
Logistic regression adjusted for MDS dimensions  0.000022  59 
Although PCA performed slightly better than MDS in correcting PS in the GWA analysis of this dataset, it would be inappropriate to conclude that PCA is a preferred approach in all GWAS. MDS is a more flexible method in general as compared with PCA. First, PCA requires that underlying data follow a multivariate normal distribution, while MDS imposes no such restriction. Second, PCA requires computation of a covariance matrix first, while MDS can be applied to any kind of distances or similarities. Pairwise IBS distance is only one example of many distance measures to which MDS can be applied. As a special case, MDS can be applied to the covariance matrix used in PCA as well. In this case, the performance of MDS in correcting PS will be equivalent to it of PCA [4].
Conclusion
In this paper, we compared the performance of PCA and MDS in identifying population structure and correcting for PS in GWAS using data provided to GAW16 participants by the NARAC. While the two methods identified similar population structures in this dataset, PCA performed slightly better than MDS in correcting for PS in the GWA analyses of this data set.
List of abbreviations used
 GAW16:

Genetic Analysis Workshop 16
 GWA:

Genomewide association
 GWAS:

GWA studies
 IBS:

Identitybystate
 LD:

Linkage disequilibrium
 MDS:

Multidimensional scaling
 NARAC:

North American Rheumatoid Arthritis Consortium
 PCA:

Principalcomponent analysis
 PS:

Population stratification
 QQ:

Quantilequantile
 RA:

Rheumatoid arthritis
 SNP:

Singlenucleotide polymorphism.
Declarations
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2: e19010.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMedGoogle Scholar
 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006, 38: 904909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81: 559575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
 Li Q, Yu K: Improved correction for population stratification in genomewide association studies by identifying hidden population structures. Genet Epidemiol. 2008, 32: 215226. 10.1002/gepi.20296.View ArticlePubMedGoogle Scholar
 Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M, Zhang K, Gumbs C, Castagna A, Cossarizza A, CozziLepri A, De Luca A, Easterbrook P, Francioli P, Mallal S, MartinezPicado J, Miro JM, Obel N, Smith JP, Wyniger J, Descombes P, Antonarakis SE, Letvin NL, McMichael AJ, Haynes BF, Telenti A, Goldstein DB: A wholegenome association study of major determinants for host control of HIV1. Science. 2007, 317: 944947. 10.1126/science.1143767.PubMed CentralView ArticlePubMedGoogle Scholar
 Tracy C, Widom H: Levelspacing distributions and the Airy kernel. Commun Math Phys. 1994, 159: 151174. 10.1007/BF02100489.View ArticleGoogle Scholar
 Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 9971004. 10.1111/j.0006341X.1999.00997.x.View ArticlePubMedGoogle Scholar
 Gregersen PK, Silver J, Winchester RJ: The shared epitope hypothesis. An approach to understanding the molecular genetics of susceptibility to rheumatoid arthritis. Arthritis Rheum. 1987, 30: 12051213. 10.1002/art.1780301102.View ArticlePubMedGoogle Scholar
 Newton JL, Harney SM, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun. 2004, 5: 151157. 10.1038/sj.gene.6364045.View ArticlePubMedGoogle Scholar
 Jawaheer D, Seldin MF, Amos CI, Chen WV, Shigeta R, Etzel C, Damle A, Xiao X, Chen D, Lum RF, Monteiro J, Kern M, Criswell LA, Albani S, Nelson JL, Clegg DO, Pope R, Schroeder HW, Bridges SL, Pisetsky DS, Ward R, Kastner DL, Wilder RL, Pincus T, Callahan LF, Flemming D, Wener MH, Gregersen PK, North American Rheumatoid Arthritis Consortium: Screening the genome for rheumatoid arthritis susceptibility genes: a replication study and combined analysis of 512 multicase families. Arthritis Rheum. 2003, 48: 906916. 10.1002/art.10989.View ArticlePubMedGoogle Scholar
 Irigoyen P, Lee AT, Wener MH, Li W, Kern M, Batliwalla F, Lum RF, Massarotti E, Weisman M, Bombardier C, Remmers EF, Kastner DL, Seldin MF, Criswell LA, Gregersen PK: Regulation of anticyclic citrullinated peptide antibodies in rheumatoid arthritis: contrasting effects of HLADR3 and the shared epitope alleles. Arthritis Rheum. 2005, 52: 38133818. 10.1002/art.21419.View ArticlePubMedGoogle Scholar
 Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, Conn MT, Chang M, Chang SY, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: A missense singlenucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330337. 10.1086/422827.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.