PREST-plus identifies pedigree errors and cryptic relatedness in the GAW18 sample using genome-wide SNP data
- Lei Sun^{1, 2} and
- Apostolos Dimitromanolakis^{3}Email author
https://doi.org/10.1186/1753-6561-8-S1-S23
© Sun and Dimitromanolakis; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Abstract
Pedigree errors and cryptic relatedness often appear in families or population samples collected for genetic studies. If not identified, these issues can lead to either increased false negatives or false positives in both linkage and association analyses. To identify pedigree errors and cryptic relatedness among individuals from the 20 San Antonio Family Studies (SAFS) families and cryptic relatedness among the 157 putatively unrelated individuals, we apply PREST-plus to the genome-wide single-nucleotide polymorphism (SNP) data and analyze estimated identity-by-descent (IBD) distributions for all pairs of genotyped individuals. Based on the given pedigrees alone, PREST-plus identifies the following putative pairs: 1091 full-sib, 162 half-sib, 360 grandparent-grandchild, 2269 avuncular, 2717 first cousin, 402 half-avuncular, 559 half-first cousin, 2 half-sib+first cousin, 957 parent-offspring and 440,546 unrelated. Using the genotype data, PREST-plus detects 7 mis-specified relative pairs, with their IBD estimates clearly deviating from the null expectations, and it identifies 4 cryptic related pairs involving 7 individuals from 6 families.
Background
Mis-specified pedigree relationship and cryptic relatedness often occur in family and population data. The potential causes of such errors are numerous, including undocumented nonpaternity, nonmaternity, adoption, mating between relatives, sample duplication or swap. It is well known that such errors, if undetected, can affect the accuracy or power of both linkage and association studies, as well as have adverse effects on other aspects of the analyses such as population stratification [1–6].
Genome-wide marker data can provide accurate information on the genetic relatedness among individuals. For linkage scans, a number of statistical methods have been proposed and implemented, including RELCHECK [1], RELATIVE [7], PEDCHECK [8], SIBERROR [9], PREST [10, 11], GRR [12], and ECLIPSE [13], among others. More recently, PLINK [14] and PREST-plus [5, 6, 15] have been developed for analysing the high-throughput SNP data collected from genome-wide association studies (GWAS) or next-generation sequencing (NGS) experiments.
There are 3 main differences between PREST-plus and PLINK. First, PREST-plus uses the maximum likelihood-based IBD estimation method, which has more statistical power, whereas PLINK relies on the method-of-moments approach, which is computationally more efficient. Second, PREST-plus identifies both pedigree errors and cryptic relatedness in linkage or association studies with family data, population sample, or both, whereas PLINK is primarily suitable for detecting cryptic relatedness in GWAS with population sample. Third, PREST-plus provides a formal hypothesis testing framework, which can be useful when a potential error has been being identified, whereas PLINK provides point IBD estimation only [5, 6]. The Genetic Analysis Workshop 18 (GAW18) data, similar to many emerging large-scale GWAS and NGS studies, include multigeneration large pedigrees. The genetic relationships between individuals in such data are not limited to simple types such as parent-offspring or siblings. Consequently, we focus here on the methodology implemented in PREST-plus [5, 6, 15] and discuss PLINK [14] when relevant.
Methods
Relationship estimation and testing
where ${G}_{m}$ is the genotype at marker m and ${D}_{m}=i$ denotes the number of alleles shared IBD by the pair at that marker. The maximization is efficiently achieved by an application of the expectation-maximization (EM) algorithm [10, 16, 17]. In contrast, PLINK [14] estimates the IBD distribution using the method-of-moments approach, which is less powerful than the likelihood-based method.
Estimation of IBD distribution can often provide sufficient information to identify pedigree errors and cryptic relatedness [5, 6]. However, it is useful to provide statistical evidence beyond point estimation. To this end, we apply the maximum likelihood ratio test (MLRT) to formally evaluate whether the observed genotypes G are compatible with the null putative relationship type, R_{ 0 } [10]. Briefly, $MLRT=\text{log}\left(\phantom{\rule{2.77695pt}{0ex}}\widehat{L}\left(A\right)\phantom{\rule{2.77695pt}{0ex}}\right)-\text{log}\left(\phantom{\rule{2.77695pt}{0ex}}L\left({R}_{0}\right)\phantom{\rule{2.77695pt}{0ex}}\right),$ where $L\left({R}_{0}\right)$ is the likelihood calculated for the hypothesized R_{ 0 }, and $\widehat{L}\left(A\right)\phantom{\rule{2.77695pt}{0ex}})=\text{max}\left\{L\left({R}_{1}\right)\right\},{R}_{1}\in A,$ is the maximum likelihood calculated over a set of alternatives as given in Table 1. Statistical significance of MLRT is then assessed using simulation, because 2*MLRT does not follow the usual Chisq distribution [10]. Efficient implementation of the test to high-throughput genotype data requires pruning of the SNPs so that they are not in linkage disequilibrium (LD) [5, 6].
Data analyses
The GWAS data of GAW18 consist of 959 individuals genotyped at 472,049 SNPs. Among the 959 individuals, 4 are removed for low genotyping rate (plink -MIND >0.8), and 141 individuals are in the "UNREL.txt" file that contains the maximum set of putatively unrelated individuals. Among the 472,049 SNPs, ~50,000 remain after minor allele frequency (MAF) and LD pruning (plink -indep-pairwise 200 50 0.2 -maf 0.05). We then conduct 3 sets of analyses, with analyses 1 and 2 focusing on relationship estimation and analysis 3 performing hypothesis testing.
Analysis 1 detects pedigree errors and cryptic relatedness in the 955 genotyped individuals from the 20 SAFS families using PREST-plus (prest -geno datafamily.ped -map datafamily.map -wped -aped). It estimates the IBD distribution for any pair of individuals within a pedigree (-wped), as well as for any pair of individuals across pedigrees (-aped).
Analysis 2 detects cryptic relatedness in the 141 putatively unrelated individuals, using both PREST-plus (prest -geno dataunrel.ped -map dataunrel.map) and PLINK (plink -file plink -genome).
Analysis 3 performs formal hypothesis testing on the problematic pairs identified in analyses 1 and 2. For computational efficiency, we first randomly select ~2000 SNPs from the set of ~50,000 SNPs. We then obtain the base pair (bp) physical map of the SNPs using build 36 coordinates and their corresponding centimorgan (cM) genetic map using the Rutgers combined linkage-physical map and linear interpolation. Finally, we perform the MLRT test as implemented in PREST-plus (prest -file data2k.ped -map data2k.map -pair fID1 indID1 fID2 indID2 -mlrt -c).
Results
Analysis 1: Relationship estimation within and across the 20 SAFS families
IBD distribution $p=\left({p}_{0},\phantom{\rule{2.77695pt}{0ex}}{p}_{1},\phantom{\rule{2.77695pt}{0ex}}{p}_{2}\right)$ and kinship coefficient $\varphi ={p}_{1}/4+{p}_{2}/2$ for the relationship types (reltype) considered by PREST-plus
reltype coding in PREST-plus | Relationship type (abbreviation) | Distribution of IBD sharing | Kinship coefficient, φ | ||
---|---|---|---|---|---|
p _{ 0 } | p _{ 1 } | p _{ 2 } | |||
11 | MZ-twin (MZ) | 0.000 | 0.000 | 1.000 | 0.50000 |
10 | parent-offspring (PO) | 0.000 | 1.000 | 0.000 | 0.25000 |
1 | full-sib (FS) | 0.250 | 0.500 | 0.250 | 0.25000 |
9 | half-sib+first cousin (HSFC) | 0.375 | 0.500 | 0.125 | 0.18750 |
2 | half-sib (HS) | 0.500 | 0.500 | 0.000 | 0.12500 |
3 | grandparent-grandchild (GPC) | 0.500 | 0.500 | 0.000 | 0.12500 |
4 | avuncular (AV) | 0.500 | 0.500 | 0.000 | 0.12500 |
5 | first cousin (FC) | 0.750 | 0.250 | 0.000 | 0.06250 |
7 | half-avuncular (HAV) | 0.750 | 0.250 | 0.000 | 0.06250 |
8 | half-first cousin (HFC) | 0.875 | 0.125 | 0.000 | 0.03125 |
6 | unrelated (UN) | 1.000 | 0.000 | 0.000 | 0.00000 |
99 | other types (Others) | NA | NA | NA | NA |
Relationship estimation results for clear outliers in Figure 1 identified by analysis 1.
Estimated | ||||||||
---|---|---|---|---|---|---|---|---|
FID1^{a} | IID1^{b} | FID2^{a} | IID2^{b} | reltype^{c} | commark^{d} | p _{ 0 } | p _{ 1 } | p _{ 2 } |
3 | T2DG0300174 | 3 | T2DG0300175 | 1 | 49009 | 0.0000 | 0.0000 | 1.0000 |
4 | T2DG0400281 | 4 | T2DG0400282 | 1 | 48996 | 0.0000 | 0.0000 | 1.0000 |
4 | T2DG0400265 | 4 | T2DG0400266 | 2 | 48994 | 0.358 | 0.4511 | 0.1909 |
21 | T2DG2100946 | 21 | T2DG2100947 | 2 | 48957 | 0.3112 | 0.5566 | 0.1322 |
21 | T2DG2100952 | 21 | T2DG2100966 | 4 | 48949 | 0.9876 | 0.0109 | 0.0015 |
4 | T2DG0400207 | 4 | T2DG0400260 | 6 | 48955 | 0.4759 | 0.5157 | 0.0084 |
4 | T2DG0400207 | 4 | T2DG0400247 | 6 | 47503 | 0.6094 | 0.3723 | 0.0182 |
Analysis 2: Relationship estimation among individuals in the "UNREL.txt" file
Relationship estimation results for clear outliers in Figure 2 identified by analysis 2
PREST-plus estimated | PLINK estimated | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
FID1 | IID1 | FID2 | IID2 | reltype | commark | p _{ 0 } | p _{ 1 } | p _{ 2 } | p _{ 0 } | p _{ 1 } | p _{ 2 } |
9 | T2DG0901244 | 10 | T2DG1000566 | 6 | 48912 | 0.8159 | 0.1735 | 0.0105 | 1.0000 | 0.0000 | 0.0000 |
8 | T2DG0800497 | 9 | T2DG0901244 | 6 | 48957 | 0.8304 | 0.1696 | 0.0000 | 1.0000 | 0.0000 | 0.0000 |
21 | T2DG2100951 | 25 | T2DG2501033 | 6 | 48940 | 0.8174 | 0.1826 | 0.0000 | 0.7713 | 0.2287 | 0.0000 |
4 | T2DG0400207 | 4 | T2DG0400247 | 6 | 47503 | 0.6142 | 0.3673 | 0.0185 | 0.7460 | 0.1972 | 0.0568 |
Analysis 3: Relationship hypothesis testing of problematic pairs
FID1 | IID1 | FID2 | IID2 | null reltype | pvalue | plausible reltype | pvalue | ||
---|---|---|---|---|---|---|---|---|---|
The 7 outliers identified by analysis 1 | |||||||||
3 | T2DG0300174 | 3 | T2DG0300175 | 1 | full-sib | 0 | 11 | MZ-twins | N/A |
4 | T2DG0400281 | 4 | T2DG0400282 | 1 | full-sib | 0 | 11 | MZ-twins | N/A |
4 | T2DG0400265 | 4 | T2DG0400266 | 2 | half-sib | 0 | 9 | half-sib+first cousin | 0.254 |
21 | T2DG2100946 | 21 | T2DG2100947 | 2 | half-sib | 0 | 9 | half-sib+first cousin | 0.432 |
21 | T2DG2100952 | 21 | T2DG2100966 | 4 | avunuclar | 0 | 6 | unrelated | 0.891 |
4 | T2DG0400207 | 4 | T2DG0400260 | 6 | unrelated | 0 | 2 | half-sib | 0.328 |
4 | T2DG0400207 | 4 | T2DG0400247 | 6 | unrelated | 0 | 5 | first cousin | 0.752 |
The 4 outliers identified by analysis 2 | |||||||||
9 | T2DG0901244 | 10 | T2DG1000566 | 6 | unrelated | 0 | 8 | half-first cousin | 0.112 |
8 | T2DG0800497 | 9 | T2DG0901244 | 6 | unrelated | 0.007 | 8 | half-first cousin | 0.673 |
21 | T2DG2100951 | 25 | T2DG2501033 | 6 | unrelated | 0 | 5 | first cousin | 0.633 |
4 | T2DG0400207 | 4 | T2DG0400247 | 6 | unrelated | 0 | 5 | first cousin | 0.712 |
Discussion
The GAW18 "pedigree information was [previously] verified by estimated kinship coefficients, principal component analysis (PCA), and number of mendelian errors between parent and offspring samples." However, no other details are provided and our analyses show that pedigree errors and cryptic relatedness exist in the data.
The results presented here are based on the ~50,000 GWAS SNPs that have MAF greater than 5% and pair-wise LD less than 0.2. Our experience with PREST-plus shows that there is little improvement in estimation accuracy, once more than ~50,000 SNPs were used (typically with MAF >5%). Additional analyses with denser sets of SNPs confirmed this (results not shown). However, substantially more SNPs are needed for PLINK to achieve similar estimation efficiency. Although PREST-plus is more powerful than PLINK, there is a trade-off between statistical power and computational efficiency [5, 6]. For large data sets involving analyzing millions of pairs, PLINK could be used as a screening tool for further analysis with PREST-plus.
Both PREST-plus and PLINK are sensitive to mis-specified allele frequencies, therefore sensitive to population stratification and population admixture, in contrast to some recent work [eg, [18]]. However, results from other GAW18 study groups suggest that population admixture is not a major concern here. Nevertheless, robust relationship estimation and testing methods warrant further research.
The methods considered here focus on global estimation of IBD distribution, which is powerful and efficient to distinguish distinct relationships, for example, full-sibs versus unrelated. However, such global methods are not adequate to distinguish similar relationship types, for example, second-cousin versus unrelated. To this end, the recent local estimation methods [eg, [19]] provide useful research direction.
Conclusions
Pedigree errors and cryptic relatedness often occur in sample despite the best practice in data collection. Genome-wide marker data, collected for linkage or association studies, can provide accurate genealogy information between individuals. Using the GWAS SNP data, PREST-plus analyses the GAW18 sample that had been previously "cleaned," and it identifies 7 clearly mis-specified relative pairs in the 20 SAFS families and 4 cryptic-related pairs in the set of putatively unrelated individuals.
Declarations
Acknowledgements
Genetic Analysis Workshop is supported by NIH grant R01 GM031575. The research of LS is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institutes of Health Research (CIHR).
The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
Authors’ Affiliations
References
- Boehnke M, Cox NJ: Accurate inference of relationships in sib-pair linkage studies. Am J Hum Genet. 1997, 61: 423-429. 10.1086/514862.PubMed CentralView ArticlePubMedGoogle Scholar
- Voight BF, Pritchard JK: Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005, 1: e32-10.1371/journal.pgen.0010032.PubMed CentralView ArticlePubMedGoogle Scholar
- Thornton T, McPeek MS: ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet. 2010, 86: 172-184. 10.1016/j.ajhg.2010.01.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Zaitlen NA, Reich D, Patterson N: New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010, 11: 459-463.PubMed CentralView ArticlePubMedGoogle Scholar
- Sun L: Detecting pedigree relationship errors. Statistical Human Genetics:Methods and Protocols. Edited by: Elston R, Satagopan J, Sun S. NewYork. 2012, Humana Press, 25-46.View ArticleGoogle Scholar
- Sun L, Dimitromanolakis A: Identifying cryptic relationships. Statistical Human Genetics : Methods and Protocols. Edited by: Elston R, Satagopan J, Sun S. NewYork. 2012, Humana Press, 47-58.View ArticleGoogle Scholar
- Goring HH, Ott J: Relationship estimation in affected sib pair analysis of late-onset diseases. Eur J Hum Genet. 1997, 5: 69-77.PubMedGoogle Scholar
- O'Connell JR, Weeks DE: Pedcheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet. 1998, 63: 259-266. 10.1086/301904.PubMed CentralView ArticlePubMedGoogle Scholar
- Ehm M, Wagner M: A test statistic to detect errors in sib-pair relationships. Am J Hum Genet. 1998, 62: 181-188. 10.1086/301668.PubMed CentralView ArticlePubMedGoogle Scholar
- McPeek MS, Sun L: Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet. 2000, 66: 1076-1094. 10.1086/302800.PubMed CentralView ArticlePubMedGoogle Scholar
- Sun L, Wilder K, McPeek MS: Enhanced pedigree error detection. Hum Hered. 2002, 54: 99-110. 10.1159/000067666.View ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Grr: graphical representation of relationship errors. Bioinformatics. 2001, 17: 742-743. 10.1093/bioinformatics/17.8.742.View ArticlePubMedGoogle Scholar
- Sieberts SK, Wijsman EM, Thompson EA: Relationship inference from trios of individuals in the presence of typing error. Am J Hum Genet. 2002, 70: 170-180. 10.1086/338444.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al: Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Dimitromanolakis A, Paterson AD, Sun L: Accurate IBD inference identifies cryptic relatedness in 9 HapMap populations. Abstract no 1768 presented at the Annual meeting of the American Society of Human Genetics. 2009Google Scholar
- Thompson EA: The estimation of pairwise relationships. Ann Hum Genet. 1975, 39: 173-188. 10.1111/j.1469-1809.1975.tb00120.x.View ArticlePubMedGoogle Scholar
- Thompson EA: Pedigree Analysis in Human Genetics. 1986, Baltimore, MD, The Johns Hopkins University PressGoogle Scholar
- Thoronto T, Tang H, Hoffmann TJ, Ochs-Balcom HM, Caan BJ, Risch N: Estimating kinship in admixed populations. Am J Hum Genet. 2012, 91: 122-138. 10.1016/j.ajhg.2012.05.024.View ArticleGoogle Scholar
- Browning SR, Browning BL: Identity by decent between distant relatives:detection and applications. Annu Rev Genet. 2012, 46: 617-633. 10.1146/annurev-genet-110711-155534.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.