- Open Access
Identifying cryptic population structure in multigenerational pedigrees in a Mexican American sample
© Culverhouse et al.; licensee BioMed Central Ltd. 2014
- Published: 17 June 2014
Cryptic population structure can increase both type I and type II errors. This is particularly problematic in case-control association studies of unrelated individuals. Some researchers believe that these problems are obviated in families. We argue here that this may not be the case, especially if families are drawn from a known admixed population such as Mexican Americans. We use a principal component approach to evaluate and visualize the results of three different approaches to searching for cryptic structure in the 20 multigenerational families of the Genetic Analysis Workshop 18 (GAW18). Approach 1 uses all family members in the sample to identify what might be considered "outlier" kindreds. Because families are likely to differ in size (in the GAW18 families, there is about a 4-fold difference in the number of typed individuals), approach 2 uses a weighting system that equalizes pedigree size. Approach 3 concentrates on the founders and the "marry-ins" because, in principle, the entire pedigree can be reconstructed with knowledge of the sequence of these unrelated individuals and genome-wide association study (GWAS) data on everyone else (to identify the position of recombinations). We demonstrate that these three approaches can yield very different insights about cryptic structure in a sample of families.
- Unrelated Individual
- Scree Plot
- Mexican American Family
- Native American Ancestry
- Principal Component Approach
It is important for statistical geneticists to communicate with their colleagues that myriad preliminary analyses should be carried out before any formal analyses of the main hypotheses that motivated the study. Results of these preliminary analyses are crucial for making decisions about which phenotypic variables need to be conditioned on and which genotypes or individuals need to be dropped from the main analysis. These decisions need to be made before the formal analysis to keep the investigators from being influenced into making biased decisions supporting a particular hypothesis.
We believe that family studies of genome-wide sequence data, as well as studies based on unrelated individuals, should routinely examine their data for genetic heterogeneity. An early genome-wide linkage scan for prostate cancer illustrates why this could be of concern: half of the LOD score for the top genome-wide signal (1.4 out of 2.75) was due to just 2 out of the 91 families in the study. Those 2 families were African American, unlike the other 89 families, which were European American or Swedish . This concern is heightened for analyses based on sequence data, where it is likely that causative variants may be found in a small subgroup or even in a single family . In this paper, we present 3 ways to make such an initial evaluation using principal components (PCs) derived from a genome-wide screen. We illustrate these methods using the GAW18 data.
Mexican Americans are descendants of multiple ancestral populations, principally Native Americans, Europeans (primarily from the Iberian Peninsula) and Africans brought to the Americas as part of the slave trade . We note that although this group is referred to as Latino or "Mexican"-Americans in the United States (because they historically have arrived in the US from Mexico), their Native American ancestry can be from Middle- or South-America as well as from the southern US and Mexico.
Two sets of monozygotic twins were identified by the data providers. We dropped one monozygotic twin, at random, from each pair. We received these data after a cleaning algorithm had been applied by the data providers  but did not receive the original assessment of the quality of each call. We performed further cleaning to select the highest-quality markers for our principal component analysis (PCA). Complete details can be found in Hinrichs et al. . Briefly, we identified markers with high call rates in both the GWAS data and sequencing data that were unambiguously mapped to the genome. We then pruned single-nucleotide polymorphisms (SNPs) to remove those in linkage disequilibrium (r2 >0.5), which resulted in approximately 100,000 SNPs. We evaluated the resulting set of genotypes for Hardy-Weinberg equilibrium (HWE). The Q-Q plot did not reveal any deviations from expectation under the null. The final number of SNPs used here is 92,344.
It has become common practice to analyze a GWAS sample of unrelated individuals for cryptic stratification, discarding the outliers. The definition of an outlier, however, is an unresolved issue in statistical analysis. Often, outliers are removed simply by visual inspection. Sometimes a more formal test is performed using, for instance, principles from numerical taxonomy. The question asked by this study is: Are all 20 pedigrees sufficiently homogeneous with regard to ancestry to be analyzed as a group with the same model parameters (e.g., gene frequencies)? Under approximate panmixia, we expect generational regression toward the group mean, especially in large pedigrees. Thus, in general, pedigrees offer more protection against outliers than a sample of unrelated individuals. It is well known, however, that immigrant groups are more likely to randomly mate within their own subgroups during the process of acculturation. Panmixia, with regard to the larger population, better describes the behavior of later generations. We used PCs  to determine the extent of clustering and whether any families can be considered outliers.
Our goal was to evaluate structure within the sample of pedigrees rather than to estimate the ancestral contributions from Africans, Europeans, and Asians. To be sensitive to population substructures such as those known to exist in both European  and Native American populations , we focused on unsupervised Eigenstrat analyses , including only the sample data.
Given this decision, there remain multiple reasonable ways to derive PCs for the data that address the correlation within the pedigrees. We examined 3 such approaches. First, we used all the data, ignoring pedigree membership. This represents the diversity of the data as a whole but may be distorted by differing pedigree sizes. In the GAW18 data, the smallest genotyped family contained 22 individuals, and the largest consisted of 86 genotyped individuals. The second approach also preserves allele frequencies within families but weights individuals proportionally to the inverse of the pedigree size so that the families contribute equally to the determination of PCs. The third approach concentrates on the set of maximally unrelated individuals. The motivation for this approach is that, in principle, the sequence of all family members can be reconstructed from the sequence of the founders and marry-ins. Dense (and relatively inexpensive) SNP data on the remaining unsequenced members (to allow accurate inference of the location of each meiotic recombination event) can then be used to reconstruct the genotypes of the entire kindred.
Each of these approaches can give insight into the ancestral structure of pedigrees in a family-based study. We examined the resulting PCs from each of these approaches for the GAW18 families. Because the GAW18 data were not simulated with population substructure in mind, we did not attempt to correlate the differences we found to differences in phenotypes.
Approach 1: Principal components based on the original sample
Distance from family centroids to centroid of remaining data
# of SD to center of the rest of the data
0.1 to 0.8
Approach 2: Principal components based on the proportionally weighted families
Approach 3: Principal components based on a maximal set of unrelated individuals
It is well known that the presence of unrecognized stratification can lead to an increase in type I or type II errors in linkage or association analyses when model parameters are misspecified. When confronted with heterogeneity, an investigator interested in performing a linkage analysis has at least two choices. First, homogeneous subsets of the data can be analyzed separately and the resulting statistics combined. A second option is available with most linkage programs. This option requires the recoding of alleles in one subgroup (with frequency estimates appropriate to that group). This tedious procedure allows the entire sample of families to be analyzed together .
When comparing the results from our approaches to a supervised principal component derivation using the YRI, CEU, and CHB+JPT population samples from HapMap, we notice that the oldest member of pedigree 3 lies in the CEU cluster, unlike members of the other families. Because this individual had many descendants, "more European" may explain why pedigree 3 is identified as an outlier by all 3 approaches. It is less clear what history distinguishes families 2 and 5 from the rest. It is possible, although we do not have data to be certain, that their differences relate to substructure within their Native American ancestry (e.g., Zapotec vs. Tlaxcalan).
Family-based methods generally are not immune to difficulties related to cryptic population structure (although some methods, such as the TDT, are). We believe it is important to include an investigation of the potential differences among families at the beginning of analyses, similar to the methods used to identify outlier individuals. Possible responses to the detection of substructure range from removing a family from the analysis to using PCs as adjustment covariates in the analysis or simply using this information when interpreting results from an association test. If an association between a phenotype and a variant is primarily due to a single pedigree (as was found in the GAW17 data), understanding the cryptic structure of the data under one or more of these metrics may prove useful for interpreting the results.
Acknowledgements and declarations
The Genetic Analysis Workshop is supported by National Institutes of Health (NIH) grant R01 GM031575. This work was also supported by NIH grant R21 DA033827. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Smith JR, Freije D, Carpten JD, Gronberg H, Xu J, Isaacs SD, Brownstein MJ, Bova GS, Guo H, Bujnovszky P, et al: Major susceptibility locus for prostate cancer on chromosome 1 suggested by a genome-wide search. Science. 1996, 274: 1371-1374. 10.1126/science.274.5291.1371.View ArticlePubMedGoogle Scholar
- Hinrichs AL, Culverhouse RC, Suarez BK: Linkage analysis merging replicate phenotypes: an application to three quantitative phenotypes in two African samples. BMC Proc. 2011, 5 (suppl 9): S81-10.1186/1753-6561-5-S9-S81.PubMed CentralView ArticlePubMedGoogle Scholar
- Galanter JM, Fernandez-Lopez JC, Gignoux CR, Barnholtz-Sloan J, Fernandez-Rozadilla C, Via M, Hidalgo-Miranda A, Contreras AV, Figueroa LU, Raska P, et al: Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas. PLoS Genet. 2012, 8: e1002554-10.1371/journal.pgen.1002554.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Dyer T, Peralta J, Jun G, Fuchsberger C, Almeida M, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-PubMed CentralView ArticlePubMedGoogle Scholar
- Hinrichs AL, Culverhouse RC, Suarez BK: Genotypic discrepancies arising from imputation. BMC Proc. 2014, 8 (suppl 1): S17-PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Menozzi P, Piazza A, Cavalli-Sforza L: Synthetic maps of human gene frequencies in Europeans. Science. 1978, 201: 786-792. 10.1126/science.356262.View ArticlePubMedGoogle Scholar
- Suarez BK, Crouse JD, O'Rourke DH: Genetic variation in North Amerindian populations: the geography of gene frequencies. Am J Phys Anthropol. 1985, 67: 217-232. 10.1002/ajpa.1330670307.View ArticlePubMedGoogle Scholar
- Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, et al: Genes mirror geography within Europe. Nature. 2008, 456: 98-101. 10.1038/nature07331.PubMed CentralView ArticlePubMedGoogle Scholar
- Suarez BK, Duan J, Sanders AR, Hinrichs AL, Jin CH, Hou C, Buccola NG, Hale N, Weilbaecher AN, Nertney DA, et al: Genome-wide linkage scan of 409 European-ancestry and African American families with schizophrenia: suggestive evidence of linkage at 8p23.3-p21.2 and 11p13.1-q14.1 in the combined sample. Am J Hum Genet. 2006, 78: 315-333. 10.1086/500272.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.