We analyzed the genome-wide association study data from the North American Rheumatoid Arthritis Consortium (NARAC) provided as Problem 1 for Genetic Analysis Workshop 16 [10, 11]. This dataset is composed of cases from several sources: families, sib-pairs, sporadic cases, persons with long time disease, and new onset cases. Control participants were selected from a population-based cancer study in New York, frequency-matched to case participants for self-reported ethnic origin. Genotyping was performed with the Illumina Infinium HumanHap550 (version 1.0) platform (San Diego, CA) with 545,080 single-nucleotide polymorphisms (SNPs) for all case participants and 48% of control participants; 33% of controls were genotyped using HumanHap550 version 3.0 and 20% with the HumanHap300 and HumanHap240S arrays. The multiple sources of case and control participants in these data argues for careful examination of the role of population stratification in any associations found.
We followed the basic quality control procedures outlined by Fellay et al. , excluding data from SNPs that had extensive missingness (missingness > 5%), deviations from Hardy-Weinberg equilibrium (p-value < 0.001 in controls), and low minor allele frequency (<1%). After removing duplicated and contaminated samples, information was available for 2058 individuals (868 cases; 1190 controls). Of these, 568 individuals were male and 1490 were female. A total of 501,228 SNPs were used in subsequent analyses. The average genotyping rate for subjects was 0.994. PLINK  was used for data cleaning and to calculate both the unstratified and stratified Mantel-Haenszel allelic association test. p-Values of the max(T) were computed using both the Bonferroni method and 10,000 permutation datasets.
We used the stratification score of Epstein et al. to adjust our analyses for confounding due to population stratification . The authors focus on adjusting association tests using a limited number of ancestry-informative markers and, therefore, partial least squares (PLS) was used to estimate the stratification score. Here, no such marker panel was readily available; hence, we utilized markers from across the genome. Applying PLS to these data would likely result in substantial overfitting of the stratification score, leading to a loss of power [14, 15]. In order to appropriately use this genome scale information, a different approach was needed. Thus we used a modified principal-component (PC) approach based on Fellay et al.  in place of PLS. Starting with the 501,228 SNPs that passed our quality control procedure, this modified PC approach captures the large-scale genetic variation in the data while minimizing the influence of a few regions high in linkage disequilibrium (LD) from dominating the PCs. This is accomplished by excluding SNPs from the PC analysis that reside in regions of known high LD and then further pruning the PC SNP set to minimize the LD between the remaining SNPs. After this pruning procedure 81,500 SNPs remained. Using the first few PCs, four individuals (D0009459, D0011466, D0012257, and D0012446) were found to be significant outliers, suggesting appreciable non-European ancestry. These individuals were excluded from subsequent analyses and, when the PC analysis was repeated, no further outliers were identified. The first 10 PCs were then used in a logistic model of disease to estimate each individual's stratification score--their predicted probability of being a case given the genomic information contained in their PCs. Five strata were then formed based on the quantiles of the stratification scores, for use in a stratified association analysis. We note that the computation demands presented by this procedure are quite minimal; it took approximately 30 minutes to generate the principal components and calculate the stratification score using a Linux workstation with two dual core 2.39-GHz opteron processors and 6 GB of RAM.
We measured confounding by population stratification using the variance inflation factor (VIF), defined as the median of the observed χ2 test statistics divided by the expected value of this median under the null hypothesis of no association of any SNP with rheumatoid arthritis (RA) .