Volume 3 Supplement 7
A method to correct for population structure using a segregation model
© Feng et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
To overcome the "spurious" association caused by population stratification in population-based association studies, we propose a principal-component based method that can use both family and unrelated samples at the same time. More specifically, we adapt the multivariate logistic model, which is often used in segregation analysis and can allow for the family correlation structure, for association analysis. To correct the effect of hidden population structure, the first ten principal-components calculated from the matrix of marker genotype data are incorporated as covariates in the model. To test for the association, the marker of interest is also incorporated as a covariate in the model. We applied the proposed method to the second generation (i.e., the Offspring Cohort), in the Genetic Analysis Workshop 16 Framingham Heart Study 50 k data set to evaluate the performance of the method. Although there may have been difficulty in the convergence while maximizing the likelihood function as indicated by a flat likelihood, the distribution of the empirical p-values for the test statistic does show that the method has a correct type I error rate whenever the variance-covariance matrix of the estimates can be computed.
To overcome a potential problem ("spurious" association) caused by population stratification in population-based association studies, several methods have been proposed recently. These approaches are favored because unrelated case-control studies are considered more powerful and easier for collecting DNA samples than family-based studies.
The "Genomic Control" (GC) method adjusts the standard chi-square statistic X2 to X2/λ, where λ can be estimated by using genomic marker data, and the adjusted statistic follows a chi-square distribution [1–4]. When the sample arises from a population in which population structure exists, the ordinary chi-square statistic may follow a noncentral chi-square distribution X2(δ) with noncentrality parameter δ. The GC method is dependent on the estimate of λ, which is dependent on the markers selected for controlling the effect of population stratification and may result in either a conservative or liberal test statistic [5, 6].
Another approach, named "Structured Association" (SA), is a Markov-chain Monte Carlo (MCMC)-based method that uses independent genomic markers to infer the number of subpopulations and the ancestry probabilities of individuals from putative unstructured subpopulations, and this inferred information is further used in the test for association [5, 7]. The method was also extended to inferring the population structure while simultaneously estimating the model parameters and testing for association . However, when the number of subpopulations is large, the SA method might be computationally intensive.
Recently, principal-component analysis (PCA)-based methods, which calculate the principal components of marker genotype data to represent the genetic background, have been widely used in association studies [6, 9–13]. Specifically, Price et al.  proposed a method of regressing both the phenotype and marker genotype values on the principal components for unrelated data, and association between the phenotype and the marker is tested by using the residual correlation. More recently, Zhu et al.  extended the method by allowing both family and unrelated samples in the association test while correcting for population stratification.
In this report, we propose a method to test the association between a binary trait and a marker by using a segregation model that allows for the family correlation structure. We apply an idea similar to Zhu et al.'s  to correct for the effect of population stratification. We apply the method to the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 50 k data set to evaluate the performance of the method.
We use a regression model to specify a phenotype as a function of the genotype of a single nucleotide polymorphism (SNP) of interest. Because our focus is on binary traits, we apply the usual approach of logistic regression. We summarize the genotype data by a principal-component analysis to extrapolate axes of genetic variation that are defined as the top eigenvectors of a covariance matrix between samples. In data sets with population structure, axes of variation often have a geographic interpretation . Thus, incorporating principal components in the model can reduce the effect caused by population stratification by adjusting the logit by an amount attributable to ancestry along each axis.
denote the variance-covariance matrix of the marker data for these unrelated individuals in our data, where denotes the overall mean of X. Let e l be the lth eigenvector corresponding to the lth largest eigenvalue of Σ, l = 1, 2, ..., M. The lth principal component for individual j of family i, t ijl , can be calculated by t ijl = (X ij - ) T e l , where i = 1, 2, ..., N, j = 1, 2, ..., k i , and i = 1, 2, ..., M. In this study we only consider the first L = 10 principal components, as suggested by Zhu et al.  and Price et al.  in our analysis.
A multivariate logistic model
where B ij = β0 + β g g ij + β1tij1+ β L t ijL + β I I ij , and δi1, δi2, δi3 and δi4 are parameters that, respectively, measure the association between parent 1 and offspring, parent 2 and offspring, sib-pairs, and spouse pairs. These δ i values take values in the range: for all B ij , B il and .
The overall likelihood L is the product of the likelihoods for all families, We consider a marker and the principal components as covariates. To estimate the unknown parameters, we used the program SEGREG in S.A.G.E. , which is based on maximum likelihood estimate (MLE) methods. For simplicity, we set δi4 = 0, i.e., husband and wife were assumed to be independent, and we further assumed all the remaining δ values are the same. Because our null hypothesis is no association, H0: β g = 0, rather than testing for a major gene, we maximized the likelihood function under the no major gene model, i.e., we assumed a single multivariate logistic distribution rather than a mixture of such distributions, which would be necessary for segregation. We used the Wald statistic to test the null hypothesis.
Application to the GAW16 Framingham Heart Study 50 k data set
We applied our method to the GAW16 Framingham Heart Study 50 k data set. We only used the second generation, i.e., the Offspring Cohort. The founders of each family were included to calculate the principal components. If no founder of a family was available, we randomly chose one individual. The spouses of the Offspring were treated as unrelated individuals. Hypertension was defined based on the data at Examination One. We defined an individual as affected if his/her systolic blood pressure was greater than or equal to 140 mm Hg, or diastolic blood pressure was greater than or equal to 90 mm Hg, or he/she was on medication, and unaffected otherwise.
The GAW16 50 k Offspring Cohort Data Set includes 3,850 individuals. Among them, 1,170 individuals were not genotyped and were not used for further analysis. Of the remaining 2,680 individuals, 89 were under the age of 18 and they were also removed because our analyzed phenotype is hypertension. There are 48,028 markers available, of which 6,051 have a missing genotyping rate of over 10% and 8,163 have minor allele frequencies less than 5%. Those 14,214 markers were dropped from any further analysis. Our analysis results were thus based on 2,591 individuals and 33,811 markers.
The performance of the method
We proposed a novel association method that adopts the idea of dealing with residual family correlations as has been used for a segregation model for binary traits, but without using the usual mixture distribution that is an essential part of segregation analysis. Meanwhile, the proposed method also incorporates the marker principal components for controlling the effect of population stratification in family data, as proposed by Zhu et al. . One advantage of the proposed method is the flexibility of incorporating different kinds of family correlation structure. Although SEGREG suggested the possibility of non-convergence for many of the markers in this study, when convergence was verified (i.e. there was no difficulty in estimating the variance-covariance matrix of the estimates), type I error was well controlled. It is well known that when the number of parameters to estimate is large, the likelihood function can be flat around the MLE, as we found in our analysis when we added age, sex, and BMI as covariates, and that there is an increased computational burden (a single maximization needed 6 minutes for the full model, while it needed only 3 minutes when age, sex, and BMI were not incorporated, on the Intel Xeon 1.6 GHz cluster). Although we described our methods using nuclear families only, the methods could be generalized in an obvious way to extended pedigrees. Indeed, the families in the GAW16 Framingham Heart Study 50 k data set Offspring Cohort (the second generation) were not nuclear families, but rather a type of extended pedigrees. Hence, our results could be thought of as pertaining to extended pedigrees plus unrelated cases and controls. Another program, UNPHASED , also analyzes similar types of data, though its focus is on haplotype data. This method does not handle population stratification directly, but could do so in a similar manner. The models in UNPHASED could incorporate covariates. For instance, one could incorporate the principal components in the model as "confounders" to adjust for the population stratification. We used the first ten principal components in our analysis. In the case that a population is admixed with a relatively small number of ancestral populations, ten principal components might be excessive. On the other hand, in the case that a population is admixed with a relatively large number of ancestral populations, ten principal components might not be enough. However, we can test whether a principal component is significant in the model. If not, we drop it.
We only applied the proposed model to the GAW16 Framingham Heart Study 50 k data set. Because the underlying disease model is unknown we are unable to evaluate the power of the proposed method. Future studies will also include power comparison with the method proposed by Zhu et al.  as well as type I error analysis under different admixed population samples using simulations.
We propose an association method using a segregation analysis based model to deal with family structure while controlling for population structure. By analyzing a real data set from the GAW16 Framingham Heart Study, we showed that the method performs well in the sense of controlling type I error rate, whenever we can be sure that the maximization of the likelihood function is successful.
List of abbreviations used
Body mass index
Genetic Analysis Workshop 16
Markov-chain Monte Carlo
Maximum likelihood estimate
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This work was supported by National Institutes of Health grants HL074166 and HL086718 from National Heart, Lung, Blood Institute, HG003054 from the National Human Genome Research Institute, RR03655 from the National Center for Research Resources, GM-28356 from the National Institute of General Medical Sciences, and P30CAD43703 from the National Cancer Institute.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Delvin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticleGoogle Scholar
- Bacanu SA, Delvin B, Roeder K: The power of genomic control. Am J Hum Genet. 2000, 66: 1933-1944. 10.1086/302929.PubMed CentralView ArticlePubMedGoogle Scholar
- Delvin B, Roeder K, Wasserman L: Genomic control, a new approach to genetic-based association studies. Theor Popul Biol. 2001, 60: 155-166. 10.1006/tpbi.2001.1542.View ArticleGoogle Scholar
- Reich DE, Goldstein DB: Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol. 2001, 20: 4-14. 10.1002/1098-2272(200101)20:1<4::AID-GEPI2>3.0.CO;2-T.View ArticlePubMedGoogle Scholar
- Prichard JK, Rosenberg NA: Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999, 65: 220-228. 10.1086/302449.View ArticleGoogle Scholar
- Zhang SL, Zhu X, Zhao HY: On a semi-parametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol. 2003, 24: 44-56. 10.1002/gepi.10196.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. Am J Hum Genet. 2000, 67: 170-181. 10.1086/302959.PubMed CentralView ArticlePubMedGoogle Scholar
- Satten GA, Flanders WD, Yang Q: Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001, 68: 466-477. 10.1086/318195.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Zhang SL, Zhao HY, Cooper RS: Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002, 23: 181-196. 10.1002/gepi.210.View ArticlePubMedGoogle Scholar
- Chen HS, Zhu X, Zhao HY, Zhang SL: Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet. 2003, 67: 250-264. 10.1046/j.1469-1809.2003.00036.x.View ArticlePubMedGoogle Scholar
- Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG, Shriver MD: Measuring European population stratification with microarray genotype data. Am J Hum Genet. 2007, 80: 948-956. 10.1086/513477.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Zhu X, Li S, Cooper RS, Elston RC: A unified association analysis approach for family and unrelated samples correcting for stratifications. Am J Hum Genet. 2008, 82: 352-365. 10.1016/j.ajhg.2007.10.009.PubMed CentralView ArticlePubMedGoogle Scholar
- Karunaratne PM, Elston RC: A multivariate logistic model (MLM) for analyzing binary family data. Am J Med Genet. 1998, 76: 428-437. 10.1002/(SICI)1096-8628(19980413)76:5<428::AID-AJMG12>3.0.CO;2-O.View ArticlePubMedGoogle Scholar
- S.A.G.E. Statistical Analysis for Genetic Epidemiology. [http://darwin.cwru.edu/sage/]
- Dudbridge F: Pedigree disequilibrium tests for multilocus haplotype. Genet Epidemiol. 2003, 25: 115-121. 10.1002/gepi.10252.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.