Principal components ancestry adjustment for Genetic Analysis Workshop 17 data

Statistical tests on rare variant data may well have type I error rates that differ from their nominal levels. Here, we use the Genetic Analysis Workshop 17 data to estimate type I error rates and powers of three models for identifying rare variants associated with a phenotype: (1) by using the number of minor alleles, age, and smoking status as predictor variables; (2) by using the number of minor alleles, age, smoking status, and the identity of the population of the subject as predictor variables; and (3) by using the number of minor alleles, age, smoking status, and ancestry adjustment using 10 principal component scores. We studied both quantitative phenotype and a dichotomized phenotype. The model with principal component adjustment has type I error rates that are closer to the nominal level of significance of 0.05 for single-nucleotide polymorphisms (SNPs) in noncausal genes for the selected phenotype than the model directly adjusting for population. The principal component adjustment model type I error rates are also closer to the nominal level of 0.05 for noncausal SNPs located in causal genes for the phenotype. The power for causal SNPs with the principal component adjustment model is comparable to the power of the other methods. The power using the underlying quantitative phenotype is greater than the power using the dichotomized phenotype.


Background
One limitation of genome-wide association studies is that population stratification can be a confounding variable. Population stratification occurs when there are systematic ancestry differences in allele frequencies between case subjects and control subjects. If not taken into account, population stratification can cause falsepositive and/or false-negative findings [1] and can produce spurious associations [2]. Principal components analysis can be used to correct for population stratification by applying methods that infer genetic ancestry [3]. Population stratification is mainly due to the demographic history of a population, natural selection, and random fluctuations resulting from admixture. In this paper we examine the statistical properties of analysis procedures used in genome-wide association studies by adjusting principal components (PCs) across the whole genome. Another approach is to use local PC adjustment [4], but the Genetic Analysis Workshop 17 (GAW17) genotype data are not sufficiently extensive to consider this strategy.
The GAW17 data set is composed of mini-exome simulated data using 697 unrelated subjects from the 1000 Genomes Project. The quantitative phenotypes Q1 and Q2 are generated as normally distributed phenotypes. We document the p-value of the test of the coefficient of a genotype with and without adjusting for population stratification in selected genes known not to cause the phenotypes Q1 and Q2. We compare the power of the regression coefficient test when using PCs for ancestry adjustment with the power when using the seven populations given as ancestry controls for all the genes known to cause phenotypes Q1 and Q2. We study two types of phenotype, quantitative and dichotomized, test all the single-nucleotide polymorphisms (SNPs) that cause Q1 and Q2, and examine selected noncausal SNPs for these two traits.

Methods
A SNP that causes a trait is one that is specified in the function used to simulate the trait [5,6]. Any other SNP is called noncausal. SNPs on chromosomes 12, 21, and 22 are used as SNPs not causing Q1. SNPs on chromosomes 21 and 22 are used as SNPs not causing Q2. Table 1 lists the distribution of the minor allele frequencies (MAFs) of the SNPs in the genes studied.
We dichotomize the quantitative measures Q1 and Q2 so that the top 25% of each of the 200 replicates is scored as affected (1) and others as unaffected (0).The independent variables in these analyses are selected from the number of minor alleles in the ith SNP genotype (SNP i ), the participant's age (Age) and smoking status (Smoking), six indicator variables of the populations (POP 1 , …, POP 6 ), and the 10 ancestry-adjusted PC scores (GPC 1 , …, GPC 10 ). We use the FamCC software [7] to calculate these 10 PCs. All 24,487 SNPs are used in the calculations.
We use the PLINK software [8] to fit three logistic regression models to assess the association between each SNP in the genes studied and the dichotomized phenotype. The ith SNP is considered associated with the phenotype when the permutation p-value of the coefficient of SNP i reported in the PLINK logistic regression analysis is less than 0.05. Because Q1 is affected by age and smoking, the models considered are the following: (1) the SNP model, in which each SNP is adjusted for age and smoking; (2) the population adjustment model, in which each SNP is adjusted for the populations, age, and smoking; and (3) the PC adjustment model, in which each SNP is adjusted for age, smoking, and ancestry adjustment PCs. The models are defined as follows: PC adjustment model: For the population adjustment model, only six indicators are needed to represent seven populations. The Luhya population is the reference population for the dichotomized phenotype, and the CEU population (European-descended residents of Utah) is the reference population for the quantitative phenotype. Because Q2 is not associated with either age or smoking, the covariates Age and Smoking are not used in the models for Q2. We also fit the three models to the continuous phenotypes Q1 and Q2 using PLINK. Each model is fitted to the 200 replicates provided.

Results
The type I error rate (i.e., false-positive rate) for noncausal genes is the fraction of p-values from noncausal SNPs with permutation p-value less than 0.05. Table 2 contains the type I error rates for Q1 and Q2. The PC adjustment model has a type I error rate closer to 0.05 than the type I error rates for the SNP model and the population adjustment model. For Q2, the type I error rates are relatively close to the nominal value of 0.05 for each model. Tables 3 and 4 contain the results for Q1 and Q2 using all causal and noncausal SNPs in causal genes that determine that trait. For noncausal SNPs in causal genes for both Q1 and Q2, the PC adjustment model has permutation type I error rates that are closest to 0.05, although the type I error rates are slightly above the nominal value of 0.05. In Q1 the PC adjustment model has the lowest power for causal SNPs, possibly because of better control of the type I error rate. For Q2, where all null type I error rates are relatively close to the nominal rate of 0.05, the power for causal SNPs is roughly the same for the three models.

Discussion
Because the disease status of interest is dichotomous in many studies, we study these dichotomized phenotypes. Chromosomes 21 and 22 have no causal SNPs for both Q1 and Q2. Therefore we define the SNPs on these two chromosomes as noncausal SNPs. Because other GAW17 participants have reported highly significant association between SNPs on chromosome 12 and Q1,    and causal SNPs within the gene resulting from linkage disequilibrium. It may also result from multiple testing. The power of the PC adjustment model is relatively strong and increases as the MAF increases, as expected. The power of regression modeling for the quantitative phenotype is greater than the power of logistic regression modeling of the dichotomized phenotype for both Q1 and Q2.
In this study, we compare the PC adjustment model with a model including population of origin as a factor. The PC adjustment model has both a type I error rate closer to the nominal level of 0.05 and high power. This is because, in general, PCs calculated using all SNPs contain more information about demographic history, natural selection, and random fluctuation in admixture than the population to which a participant is assigned. That is, participants' genes may still hold genetic information that distinguishes them from the population from which they originated.
The data used here were simulated rather than real. We set our significance level to 0.05 because the number of replicates is 200. As a result, the expected number of null rejections is 10, which allows for meaningful statistical comparison. We also studied a nominal significance level of 0.01 (data not shown) and found similar control of the type I error rate except for SNPs with MAF < 0.005, where the type I error rate was 0.036, somewhat higher than expected. We could not study the type I error rate using typical genome-wide significance levels, such as 10 −8 .

Conclusions
The PC adjustment model with permutation p-value controls the type I error rate in the GAW17 Q1 and Q2 phenotypes. The power of the regression analysis of the quantitative phenotype is greater than the power of the analysis of the dichotomized phenotype. There is a slight decrease in power for the PC adjustment model even when MAF < 0.005.