Volume 3 Supplement 7
Detecting population stratification using related individuals
© Hinrichs et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Although identification of cryptic population stratification is necessary for case/control association analyses, it is also vital for linkage analyses and family-based association tests when founder genotypes are missing. However, including related individuals in an analysis such as EIGENSTRAT can result in bias; using only founders or one individual per pedigree results in loss of data and inaccurate estimates of stratification. We examine a generalization of principal-component analyses to allow for the inclusion of related individuals by down-weighting the significance of individual comparisons.
At the heart of all genetic case/control association analyses lies estimation of allele frequencies. For linkage analyses and pedigree-based association analyses, allele frequency estimates are used when a parental genotype is missing. Because cryptic population stratification results in misestimates of allele frequency, this can lead to false positives for any type of analysis with missing founder genotypes [1, 2]. Current methods for identifying and controlling population stratification rely on unrelated individuals. When they are applied to pedigree data, only the founders are analyzed. This suggests that the situation in which detection of population stratification is most needed is the least tractable with current methods.
Several methods for detecting population stratification exist. Two of the most common methods are implemented in STRUCTURE and EIGENSTRAT [3, 4]. The program STRUCTURE uses a Markov-chain Monte Carlo (MCMC) method to identify natural population clusters based on multilocus genotypes. It provides probability of membership for each sample that provides a very natural interpretation. However, it is too computationally intensive to be used on genome-wide association study (GWAS) data involving hundreds of thousands or millions of markers . To handle this volume of data, a more computationally simple method is required. The program EIGENSTRAT uses a very fast linear algorithm to identify population structure. In particular, it performs a principal-component analysis (PCA) on a matrix X; an M × N matrix (where M is the number of markers and N is the number of individuals). An eigenvalue decomposition is then performed on the N × N correlation matrix and population membership is inferred from the eigenvectors. One determines how many natural ethnicities are present by examining the sizes of the eigenvalues using a graphical scree analysis or a numeric approach (such as the recently developed "acceleration factor" ).
However, the nature of the eigenvalue decomposition introduces problems when individuals are related. Because biologically related individuals are already genetically correlated, this can bias the decomposition, especially in the presence of a large number of related individuals (such as in large pedigrees). Using only unrelated individuals limits the analysis to either the founders or a sampling of unrelated individuals. Although the founders provide all of the genetic variation present in the subsequent generations and therefore represent all available information, using randomly sampled unrelated individuals results in a loss of information.
This is much more manageable. Further, the eigenvalues for these two are the same and the eigenvectors of the original formulation are simply the product of Y T and the second set of eigenvectors (followed by normalization) .
For our analyses, we use a weight based on work by McPeek and colleagues . In particular, they demonstrate the use of the kinship matrix to derive the best linear unbiased estimate (BLUE) of allele frequencies in samples of related individuals.
provides the best linear weights to compute allele frequencies for related individuals. In a fully typed pedigree, each founder is given a weight "1" and all other individuals are given weight "0." In any pedigree with a single typed individual, that individual is given weight "1." In the simple case of a nuclear pedigree with S children without genotyped parents, each child is given weight 2/(S+1). Note that as the number of typed children increase, the sum of the weights tends toward 2 - precisely the number of founders of the pedigree. This generalizes to any sized pedigree; namely, the total of the weights cannot be larger than the number of founders, since the founders were the only source of genetic material in the pedigree.
We test this method compared with the standard EIGENSTRAT method using the Framingham Heart Study data. After cleaning, the Framingham Heart Study data consists of 1180 pedigrees, including 418 singletons. The remaining 762 pedigrees have an average of 8.3 genotyped individuals, including 9 pedigrees with more than 50 genotyped individuals. The best standard of comparison would be an analysis using all founder genotypes, but because not all founder genotypes are available, we apply an algorithm to identify the maximal set of unrelated individuals. We consider the resulting population membership as the "gold standard." We also consider the set of singletons and one individual chosen at random from each pedigree. Finally, we consider the full sample with all related individuals using the standard EIGENSTRAT method and our novel method. We then assess to what degree including related individuals influences the standard method and how well the novel method reproduces the "gold standard." We also examine the total weight of all the genotypes as a measure of how much information is used.
We used the full 50 k marker set but kept only autosomal SNPs with a minor allele frequency greater than 0.05 and a genotyping rate greater than 99%, for a total of 31,068 SNPs. We dropped individuals with more than 5% missing genotypes, for a total of 6757 individuals.
Number of effective individuals for five samples and scaled principal components
We propose the use of weighted PCA implemented through the presence of a Laplacian matrix to allow detection of stratification in related individuals. Our results indicate the methodology developed by McPeek and colleagues to compute allele frequencies in related individuals can be extended to detection of ethnic stratification. This method uses all available genotypic data, with an effective sample size that approaches the number of founders in the pedigrees. This exceeds other methods of selecting unrelated individuals. Furthermore, we see evidence of bias and outliers when using small subsets of individuals. Using too few individuals for stratification may also artificially inflate evidence of stratification. It does appear that the presence of related individuals in a very large sample seems to have little effect on the stratification analysis, but this might not hold in other circumstances. Furthermore, this method has only been tested on a European American sample with a single principal component (probably identifying a continuous population spread such as northern to southern European). Because the Framingham data does not have any obvious discrete clusters, this method still must be tested in a more diverse population.
List of abbreviations used
Genome-wide association study
Markov-chain Monte Carlo
Principal component analysis.
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Additional support was obtained from the Urological Research Foundation and from NIH grants K01 AA015572, K25 GM069590, RO3 DA023166 and IRG-58-010-50 from the American Cancer Society.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Curtis D, Sham PC: Population stratification can cause false positive linkage results if founders are untyped. Ann Hum Genet. 1996, 60: 261-263. 10.1111/j.1469-1809.1996.tb00430.x.View ArticlePubMedGoogle Scholar
- Hinds DA, Stokowski RP, Patil N, Konvicka K, Kershenobich D, Cox DR, Ballinger DG: Matching strategies for genetic association studies in structured populations. Am J Hum Genet. 2004, 74: 317-325. 10.1086/381716.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Raiche G, Riopel M, Blais JG: Non graphical solutions for the Cattell's scree test. International Annual Meeting of the Psychometric Society, Montreal. 2006, [http://www.er.uqam.ca/nobel/r17165/RECHERCHE/COMMUNICATIONS/]Google Scholar
- Koren Y, Carmel L: Robust linear dimensionality reduction. IEEE Trans Vis Comput Graph. 2004, 10: 459-470. 10.1109/TVCG.2004.17.View ArticlePubMedGoogle Scholar
- Chatfield C, Collins AJ: Introduction to Multivariate Analysis. 1980, London, Chapman & HallView ArticleGoogle Scholar
- McPeek MS, Wu X, Ober C: Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004, 60: 359-367. 10.1111/j.0006-341X.2004.00180.x.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.