Identity-by-descent estimation with population- and pedigree-based imputation in admixed family data
© The Author(s). 2016
Published: 18 October 2016
In the past few years, imputation approaches have been mainly used in population-based designs of genome-wide association studies, although both family- and population-based imputation methods have been proposed. With the recent surge of family-based designs, family-based imputation has become more important. Imputation methods for both designs are based on identity-by-descent (IBD) information. Apart from imputation, the use of IBD information is also common for several types of genetic analysis, including pedigree-based linkage analysis.
We compared the performance of several family- and population-based imputation methods in large pedigrees provided by Genetic Analysis Workshop 19 (GAW19). We also evaluated the performance of a new IBD mapping approach that we propose, which combines IBD information from known pedigrees with information from unrelated individuals.
Different combinations of the imputation methods have varied imputation accuracies. Moreover, we showed gains from the use of both known pedigrees and unrelated individuals with our IBD mapping approach over the use of known pedigrees only.
Our results represent accuracies of different combinations of imputation methods that may be useful for data sets similar to the GAW19 pedigree data. Our IBD mapping approach, which uses both known pedigree and unrelated individuals, performed better than classical linkage analysis.
In the last few years, imputation approaches have been widely used in genetic studies, especially in the context of genome-wide association studies and population-based designs. This is a result of an attractive feature of imputation: it increases genetic information at low-cost. Moreover, imputation allows meta-analysis between different studies, especially when different genotyping platforms are used. The main idea of imputation is to: (a) genotype or sequence a small subset of individuals on a dense single nucleotide polymorphism (SNP) panel, (b) genotype remaining individuals on a sparse SNP panel, and (c) use statistical methods for imputation, generally based on hidden Markov models, to impute untyped SNPs from the dense panel into these individuals with only sparse (or no) genotyped SNPs.
Imputation approaches have been proposed for both family- and population-based designs. However, the performance of these approaches has not yet been thoroughly evaluated and compared in pedigree data. The availability of the International Haplotype Map Project (HapMap)  and the 1000 Genomes Project  have allowed population-based imputation to become widespread. However, these databases are not ideal for family-based imputation methods, which require that the dense SNP panel individuals be selected from pedigrees. In addition, imputation approaches for large pedigrees are more challenging, and only quite recently has an approach been proposed and implemented in the program called Genotype Imputation Given Inheritance (GIGI)  that is able to efficiently handle large pedigrees and accurately impute rare variants. The search for rare risk variants has made imputation in family-based designs more important, because of enrichment of such variants in relatively large pedigrees  since linkage analysis may be powerful for initiating rare variant identification. An attractive study design is to (a) perform linkage analysis, (b) perform imputation in the region(s) with linkage signals, and (c) perform association analysis on the imputation data in these regions. It is well-established that larger pedigrees are advantageous for this linkage analysis component.
Identity-by-descent (IBD) underpins both linkage analysis and imputation. For linkage analysis, once IBD has been determined, the pedigree structure is no longer needed for subsequent computations. For imputation, IBD within pedigrees or (ancestrally) across individuals is used as a source of correlation between individuals for family- and population-based imputation.
In this study, we evaluated and compared the performance of several family- and population-based imputation approaches in the Mexican American pedigree data provided by Genetic Analysis Workshop 19 (GAW19). We also evaluated the effect on imputation accuracy of the number of selected individuals for sequencing and the way they were selected, randomly or by tailored selection with GIGI-Pick . In addition, we evaluated the performance of a new IBD mapping approach that we propose, which combines IBD information inferred (a) using a sparse SNP panel from known pedigrees, and (b) using a dense SNP panel from unrelated individuals across pedigrees.
Number of SNPs
chr 3: 46,750 Kbp–49,250 Kbp
Mean spacing ~0.64 cM; in linkage equilibrium
MAF > 0.05; genotype completion > 99 %
Selection of dense SNP panel individuals for imputation
Among the 464 sequenced individuals, we chose to include 100 or 200 of them as if genotyped on the dense SNP panel. In proportion to the number of sequenced individuals in each pedigree, we selected individuals following 2 strategies: (a) random selection or (b) via GIGI-Pick (a method to help with informative choices) . For GIGI-Pick, we used the genome-wide coverage option, which selects subjects based on the pedigree structure. In the group of the remaining 264 or 364 individuals, we extracted genotypes of SNPs present in the SNP chip of our region of interest (~700 SNPs), to form the set of sparse SNP panel individual data set.
BEAGLE–BEAGLE, SHAPEIT–BEAGLE, and SHAPEITped–BEAGLE;
MaCH–MaCH, MaCH–minimac, SHAPEIT–minimac, and SHAPEITped–minimac;
MaCH–MaCHAdmix, SHAPEIT–MaCHAdmix, and SHAPEITped–MaCHAdmix;
IMPUTE2–IMPUTE2, SHAPEIT–IMPUTE2, and SHAPEITped–IMPUTE2.
We performed pedigree-based imputation using GIGI . This method uses inheritance vector (IV) realizations, reflecting the IBD flow in pedigrees, estimated on the sparse SNP panel (MS-2) for all individuals. The MORGAN/gl_auto program  was used to obtain these IV realizations. Based on these IV realizations, the dense SNP panel, the meiotic map, the allele frequencies of the dense SNPs, and the pedigree structure, GIGI infers the missing genotypes at untyped SNPs. Note that the first step (IBD estimation) is equivalent to the prephasing step used by population-based approaches.
Combining family- and population-based imputation
We also combined imputation obtained by GIGI and population-based approaches via a flexible framework described elsewhere . We compared 3 versions: (a) GIGI + SHAPEITped–BEAGLE, (b) GIGI + SHAPEITped–IMPUTE2, and (c) GIGI + SHAPEITped–minimac. As a metric of imputation accuracy, we calculated 2 mean correlation measures for all of the versions described above: (a) ρ 1 = sum of correlation/total number of imputed SNPs (given minor allele frequency, MAF, in imputed data > 0), and (b) ρ 2 = sum of correlation/total number of imputed SNPs in the reference panel (given MAF in reference panel data > 0). Correlation was computed for each SNP between the true and the imputed genotypes in the sparse SNP panel individuals, and then averaged across all imputed SNPs. As a metric of imputation accuracy, the use of correlation, which implicitly adjusts for the MAF of SNPs, is better than the use of concordance, which gives misleading results for rare variants. Other existing measures could also be used but there is no clear consensus in the literature of which performs best in all situations. In any case, our aim was to compare the accuracy between approaches and not to evaluate the accuracy per se for each approach. In this case, the correlation measure provides the necessary information for this comparison for both rare and common variants.
We used a set of 1000 IBD graphs [15, 16] that had been realized on the 7 pedigrees by the MORGAN/gl_auto program using the MS-2 subset of SNPs. Using the MORGAN/ibd_haplo program and MS-3, we found that many of the 17 pairs of individuals share IBD in the range 50 to 75 cM on chromosome 3, and that all pairs gave a strong signal of IBD at 69 cM. To infer IBD between the cryptically related individuals, we used a new program, ibd_stitch , which permits jointly-consistent location-specific IBD to be inferred among multiple individuals. Using ibd_stitch, 1000 IBD graphs in compact format  were realized jointly on the 21 distinct individuals. We wrote new R code to merge the joint 21-individual IBD graphs on the cryptically related individuals, with the IBD graphs on the 7 pedigrees. The merging was done in 2 groups (pedigrees 05, 06, 21, 25 and 10, 08, 07), which were chosen so that the merged graph had 2 components with an approximately equal number of individuals, with no cryptically related pairs linking the two groups. The merging was performed at the 351 MS-2 SNP positions to give 1000 IBD graphs on the combined set of 529 individuals. These merged graphs thus included location-specific between-family IBD in addition to the pedigree IBD. Given IBD graphs realized conditional on SNP data, logarithm of the odds (LOD) scores can be computed without further reference to the pedigree structures or SNP data . Additionally, the MORGAN/gl_lods program uses equivalence of IBD graphs across realizations and across locations , to ensure that each distinct LOD score contribution is computed once only.
We used the 200 simulated traits of diastolic blood pressure (DBP) with a causal gene (MAP4) at 69 cM (47,892,180 to 48,130,769 bp) on chromosome 3. Trait data were preadjusted for age, sex, and current use of antihypertensive medications, and we used a trait model previously developed  for this gene (quantitative trait locus model with parameters defined by the SNP with the biggest contribution to the simulated trait variance). Using the MORGAN/gl_lods program, we computed LOD scores at the MS-2 SNP positions, for all 200 traits, on each of the 7 component families and on the merged IBD graphs that included the between-family IBD. We can thus compare the LOD scores calculated from the IBD graphs containing pedigree data only, with the IBD graphs of merged pedigree and population IBD.
Random selection of 200 reference/dense SNP panel individuals
(0–0.01]: #SNPs = 4604 SNPs
(0.01–0.15]: #SNPs = 1765 SNPs
(0.15–0.5]: #SNPs = 979 SNPs
GIGI + SHAPEITped-BEAGLE
GIGI + SHAPEITped-minimac
GIGI + SHAPEITped-IMPUTE2
Interestingly, SHAPEITped–minimac outperformed all other population-based approaches for all MAF bins. This great performance increase was more striking for rare variants, where GIGI was also outperformed, slightly, by approximately 0.02. For this bin of MAF, BEAGLE–BEAGLE was the worst. Note that all population-based approaches performed similarly for common variants. Moreover, as expected, these approaches improved with increasing MAF, unlike GIGI whose performance slightly decreased. In these admixed pedigrees, our results also showed that MaCH–Admix (which allows for admixture), is better than MaCH–MaCH (does not allow for admixture) for rare and uncommon variants, where the difference of imputation accuracy was approximately 0.1. For common variants, both approaches led to similar results. This result stresses the need, in such data, of using imputation programs that allow for admixture in order to improve imputation accuracy.
Surprisingly, MaCH–minimac consistently outperformed MaCH–MaCH. Despite the claim that minimac and MaCH use same imputation algorithms, our results suggest otherwise, and that minimac’s undocumented algorithm may be better. To investigate this result and to determine if the difference might be a result of possible bias in calling the sequence data (eg, if sequence calling was performed using part of the minimac algorithm), we simulated sequence data on the same GAW19 pedigrees but used European ancestry data. We performed analysis with MaCH–MaCH, MaCH–minimac, SHAPEITped–minimac, and GIGI. Interestingly, GIGI performed better than all approaches for rare variants on these data (ρ 1 = 0.23, 0.27, 0.39, and 0.54, respectively). In addition, MaCH–minimac performed better than MaCH–MaCH, which means that minimac’s imputation algorithm still may be better than the one used in MaCH. The underperformance of GIGI in the GAW19 data for rare variants, and possibly also the common variants, might be the result of using an admixed population. In fact, when GIGI is not able to impute genotypes using pedigree information, it draws them from the pre-specified population MAF. If these frequencies are inaccurate, resulting poor imputation is likely.
In the GAW19 Mexican American pedigrees, there are different amounts of admixture across pedigrees, suggesting the need for using pedigree-specific allele frequencies to improve GIGI’s imputation. To investigate this, we explored GIGI’s accuracy per pedigree depending on the admixture level estimated in Blue et al. . However, we did not observe any correlation between admixture in pedigrees and GIGI’s accuracy (results not shown). Another possible explanation is that undetected Mendelian consistent errors, which were not in the simulated genotype data, could have led to poor IV estimation and hence poor imputation performance.
We also applied our framework to combine family- and population-based imputation data. We combined imputation results from GIGI and three 2-step population-based imputation approaches: (a) SHAPEITped–BEAGLE, (b) SHAPEITped–minimac, and (c) SHAPEITped–IMPUTE2. Overall, we show a slight improvement in imputation accuracy from using the combined data, for rare and uncommon variants, but less for common variants. This trend could be seen by the 2 correlation measures, but especially by ρ 2. This measure reflects the total amount of imputation information we can glean from the 2 different sources of correlation in the data. In addition, combining both population- and pedigree-based approaches resulted in an increase in the number of imputed SNPs that are polymorphic in the sample.
Influence of GIGI-Pick
IBD mapping analysis
Discussion and conclusions
We compared the performance of several imputation approaches in pedigree data provided by GAW19 organizers. Also, we proposed and evaluated the performance of a new IBD mapping approach that combines IBD information from both unrelated and related individuals, by pedigrees, in order to identify genes implicated in complex traits. We showed that using the SHAPEIT program for pre-phasing, with its option that handles pedigree structure, along with the imputation program minimac, led to the best imputation performance for both rare and common variants. This population-based imputation approach outperformed GIGI (a family-based imputation method) not only for common variants but also for rare variants. This result was not expected for rare variants, and might be specific to this admixed data set as indicated by our results from simulated sequence data derived from European samples. Beside this specific result, most of the remaining results observed in GAW19 data were generally consistent with what we observed in our earlier simulations, which makes the results generalizable. On the other hand, our new IBD mapping approach shows promise, as it appeared to perform better than classic linkage analysis that uses only known related individuals in pedigrees.
This research was supported by the National Institutes of Health (NIH) grants AG040184, AG005136, AG039700, AG049507, GM046255, and MH094293. The Genetic Analysis Workshops are supported by NIH grant GM031575.
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
All authors participated in study design and analysis. MS, AQN, FLG, EAT, and EMW drafted the manuscript, and all authors edited and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- 1000 Genomes Project Consortium, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65.View ArticleGoogle Scholar
- International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449(7164):851–61.View ArticleGoogle Scholar
- Cheung CY, Thompson EA, Wijsman EM. GIGI: an approach to effective imputation of dense genotypes on large pedigrees. Am J Hum Genet. 2013;92(4):504–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Wijsman EM. The role of large pedigrees in an era of high-throughput sequencing. Hum Genet. 2012;131(10):1555–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Cheung CYK, Marchani E, Wijsman EM. A statistical framework to guide sequencing choices in pedigrees. Am J Hum Genet. 2014;94(2):257–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Matise TC, Chen F, Chen W, De La Vega FM, Hansen M, He C, et al. A second-generation combined linkage physical map of the human genome. Genome Res. 2007;17(12):1783–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Blue E, Cheung C, Glazner C, Conomos M, Lewis S, Sverdlov S, et al. Identity-by-descent graphs offer a flexible framework for imputation and both linkage and association analyses. BMC Proc. 2014;8 Suppl 1:S19.View ArticlePubMedPubMed CentralGoogle Scholar
- Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84(2):210–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39(7):906–13.View ArticlePubMedGoogle Scholar
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34(8):816–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat Methods. 2012;9(2):179–81.View ArticleGoogle Scholar
- Liu EY, Li MY, Wang W, Li Y. MaCH-Admix: genotype imputation for admixed populations. Genet Epidemiol. 2013;37(1):25–37.View ArticlePubMedGoogle Scholar
- Thompson E. The structure of genetic linkage data: from LIPED to 1 M SNPs. Hum Hered. 2011;71(2):86–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Saad M, Wijsman EM. Combining family- and population-based imputation data for association analysis of rare and common variants in large pedigrees. Genet Epidemiol. 2014;38(7):579–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Thompson EA. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics. 2013;194(2):301–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Koepke H, Thompson E. Efficient identification of equivalences in dynamic graphs and pedigree structures. J Comput Biol. 2013;20(8):551–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Glazner C, Thompson E. Pedigree-frree descent-based gene mapping from population samples. Hum Hered. 2015;80(1):21–35.Google Scholar
- Blue EM, Brown LA, Conomos MP, Kirk J, Nato AQ, Popejoy AB, Raffa J, Ranola J, Thornton T, Wijsman EM. Estimating relationships between phenotypes and subjects drawn from admixed families. BMC Proc. 2015;9 Suppl 8:S50.Google Scholar