Volume 3 Supplement 7
Genetic Analysis Workshop 16
Detecting population stratification using related individuals
 Anthony L Hinrichs^{1}Email author,
 Robert Culverhouse^{2},
 Carol H Jin^{1} and
 Brian K Suarez^{1, 3}
DOI: 10.1186/175365613S7S106
© Hinrichs et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
Although identification of cryptic population stratification is necessary for case/control association analyses, it is also vital for linkage analyses and familybased association tests when founder genotypes are missing. However, including related individuals in an analysis such as EIGENSTRAT can result in bias; using only founders or one individual per pedigree results in loss of data and inaccurate estimates of stratification. We examine a generalization of principalcomponent analyses to allow for the inclusion of related individuals by downweighting the significance of individual comparisons.
Background
At the heart of all genetic case/control association analyses lies estimation of allele frequencies. For linkage analyses and pedigreebased association analyses, allele frequency estimates are used when a parental genotype is missing. Because cryptic population stratification results in misestimates of allele frequency, this can lead to false positives for any type of analysis with missing founder genotypes [1, 2]. Current methods for identifying and controlling population stratification rely on unrelated individuals. When they are applied to pedigree data, only the founders are analyzed. This suggests that the situation in which detection of population stratification is most needed is the least tractable with current methods.
Several methods for detecting population stratification exist. Two of the most common methods are implemented in STRUCTURE and EIGENSTRAT [3, 4]. The program STRUCTURE uses a Markovchain Monte Carlo (MCMC) method to identify natural population clusters based on multilocus genotypes. It provides probability of membership for each sample that provides a very natural interpretation. However, it is too computationally intensive to be used on genomewide association study (GWAS) data involving hundreds of thousands or millions of markers [4]. To handle this volume of data, a more computationally simple method is required. The program EIGENSTRAT uses a very fast linear algorithm to identify population structure. In particular, it performs a principalcomponent analysis (PCA) on a matrix X; an M × N matrix (where M is the number of markers and N is the number of individuals). An eigenvalue decomposition is then performed on the N × N correlation matrix and population membership is inferred from the eigenvectors. One determines how many natural ethnicities are present by examining the sizes of the eigenvalues using a graphical scree analysis or a numeric approach (such as the recently developed "acceleration factor" [5]).
However, the nature of the eigenvalue decomposition introduces problems when individuals are related. Because biologically related individuals are already genetically correlated, this can bias the decomposition, especially in the presence of a large number of related individuals (such as in large pedigrees). Using only unrelated individuals limits the analysis to either the founders or a sampling of unrelated individuals. Although the founders provide all of the genetic variation present in the subsequent generations and therefore represent all available information, using randomly sampled unrelated individuals results in a loss of information.
Methods
This is much more manageable. Further, the eigenvalues for these two are the same and the eigenvectors of the original formulation are simply the product of Y^{ T }and the second set of eigenvectors (followed by normalization) [7].
For our analyses, we use a weight based on work by McPeek and colleagues [8]. In particular, they demonstrate the use of the kinship matrix to derive the best linear unbiased estimate (BLUE) of allele frequencies in samples of related individuals.
provides the best linear weights to compute allele frequencies for related individuals. In a fully typed pedigree, each founder is given a weight "1" and all other individuals are given weight "0." In any pedigree with a single typed individual, that individual is given weight "1." In the simple case of a nuclear pedigree with S children without genotyped parents, each child is given weight 2/(S+1). Note that as the number of typed children increase, the sum of the weights tends toward 2  precisely the number of founders of the pedigree. This generalizes to any sized pedigree; namely, the total of the weights cannot be larger than the number of founders, since the founders were the only source of genetic material in the pedigree.
We test this method compared with the standard EIGENSTRAT method using the Framingham Heart Study data. After cleaning, the Framingham Heart Study data consists of 1180 pedigrees, including 418 singletons. The remaining 762 pedigrees have an average of 8.3 genotyped individuals, including 9 pedigrees with more than 50 genotyped individuals. The best standard of comparison would be an analysis using all founder genotypes, but because not all founder genotypes are available, we apply an algorithm to identify the maximal set of unrelated individuals. We consider the resulting population membership as the "gold standard." We also consider the set of singletons and one individual chosen at random from each pedigree. Finally, we consider the full sample with all related individuals using the standard EIGENSTRAT method and our novel method. We then assess to what degree including related individuals influences the standard method and how well the novel method reproduces the "gold standard." We also examine the total weight of all the genotypes as a measure of how much information is used.
Results
We used the full 50 k marker set but kept only autosomal SNPs with a minor allele frequency greater than 0.05 and a genotyping rate greater than 99%, for a total of 31,068 SNPs. We dropped individuals with more than 5% missing genotypes, for a total of 6757 individuals.
Number of effective individuals for five samples and scaled principal components
Data set^{a}  Individuals  PC1  PC2  PC3 

MaxUnrel  2014  1.0000 ^{ b }  0.5113  0.405 
Singletons  418  1.0000  0.7045  0.4808 
OnePer  1180  1.0000  0.7074  0.4475 
Full  6757  1.0000  0.4002  0.3717 
Weighted  2898.7  1.0000  0.4147  0.3652 
Discussion
We propose the use of weighted PCA implemented through the presence of a Laplacian matrix to allow detection of stratification in related individuals. Our results indicate the methodology developed by McPeek and colleagues to compute allele frequencies in related individuals can be extended to detection of ethnic stratification. This method uses all available genotypic data, with an effective sample size that approaches the number of founders in the pedigrees. This exceeds other methods of selecting unrelated individuals. Furthermore, we see evidence of bias and outliers when using small subsets of individuals. Using too few individuals for stratification may also artificially inflate evidence of stratification. It does appear that the presence of related individuals in a very large sample seems to have little effect on the stratification analysis, but this might not hold in other circumstances. Furthermore, this method has only been tested on a European American sample with a single principal component (probably identifying a continuous population spread such as northern to southern European). Because the Framingham data does not have any obvious discrete clusters, this method still must be tested in a more diverse population.
List of abbreviations used
 GWAS:

Genomewide association study
 MCMC:

Markovchain Monte Carlo
 PCA:

Principal component analysis.
Declarations
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Additional support was obtained from the Urological Research Foundation and from NIH grants K01 AA015572, K25 GM069590, RO3 DA023166 and IRG5801050 from the American Cancer Society.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Curtis D, Sham PC: Population stratification can cause false positive linkage results if founders are untyped. Ann Hum Genet. 1996, 60: 261263. 10.1111/j.14691809.1996.tb00430.x.View ArticlePubMedGoogle Scholar
 Hinds DA, Stokowski RP, Patil N, Konvicka K, Kershenobich D, Cox DR, Ballinger DG: Matching strategies for genetic association studies in structured populations. Am J Hum Genet. 2004, 74: 317325. 10.1086/381716.PubMed CentralView ArticlePubMedGoogle Scholar
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945959.PubMed CentralPubMedGoogle Scholar
 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006, 38: 904909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
 Raiche G, Riopel M, Blais JG: Non graphical solutions for the Cattell's scree test. International Annual Meeting of the Psychometric Society, Montreal. 2006, [http://www.er.uqam.ca/nobel/r17165/RECHERCHE/COMMUNICATIONS/]Google Scholar
 Koren Y, Carmel L: Robust linear dimensionality reduction. IEEE Trans Vis Comput Graph. 2004, 10: 459470. 10.1109/TVCG.2004.17.View ArticlePubMedGoogle Scholar
 Chatfield C, Collins AJ: Introduction to Multivariate Analysis. 1980, London, Chapman & HallView ArticleGoogle Scholar
 McPeek MS, Wu X, Ober C: Best linear unbiased allelefrequency estimation in complex pedigrees. Biometrics. 2004, 60: 359367. 10.1111/j.0006341X.2004.00180.x.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.