Imputing rare variants in families using a two-stage approach
- Samantha Lent†1Email author,
- Xuan Deng†1,
- L. Adrienne Cupples1,
- Kathryn L. Lunetta1,
- CT Liu1 and
- Yanhua Zhou1
© The Author(s). 2016
Published: 18 October 2016
Recent focus on studying rare variants makes imputation accuracy of rare variants an important issue. Many approaches have been proposed to increase imputation accuracy among rare variants, from reference panel selection to combinations of existing methods to multistage analyses. We aimed to bring the strengths of these new approaches together with our proposed two-stage imputation for family data.
Our imputation methods were tested on the region from 46.75Mb to 49.25Mb on chromosome 3. We did quality control based on the proportion of missing genotypes per variant and individual, leaving 495 individuals with 761 genome-wide association studies (GWAS) variants only, 45 with 14,077 sequence variants only, and 419 with both GWAS and sequencing data. All data were prephased using SHAPEIT2 with a duo hidden Markov model algorithm prior to performing imputation. Imputations were performed 100 times, each time masking the sequence data for 1 individual and imputing it from the GWAS data. We used well-imputed genotypes, defined as a probability of greater than 0.9, above 2 different minor allele frequency cutoffs—0.01 and 0.05—from Impute2 as input for Merlin, and compared these results to Impute2 and Merlin separately. The imputed results were evaluated using correlation measurement and the imputation quality score.
Our method improved imputation accuracy, measured by imputation quality score, for variants with minor allele frequency between 0.01 and 0.40, but failed to improve accuracy for variants with minor allele frequency less than 0.01 when we used a minor allele frequency cutoff of 0.01 for the Impute2 results. In contrast, our 2-stage approach with a minor allele frequency cutoff of 0.05 performed the worst of all methods for variants with minor allele frequency between 0.01 and 0.40.
This method gave promising results, but may be further improved by changing the inclusion criteria of Impute2 variants. More analyses are needed on a larger region with different inclusion thresholds to assess the accuracy of this approach.
Although existing population-based genotype imputation methods are very accurate for common variants, with overall best-guess error rates of 5 % to 7 % for the most common methods , they do not perform nearly as well with rare variants. Only 78 % of variants with a minor allele frequency (MAF) between 0.01 and 0.05 in the Illumina 550K panel and 57 % in the Affymetrix 500K panel can be well imputed (r2 > 0.7) using BEAGLE .
Most efforts to improve rare variant imputation have focused on how the choice of reference panel affects imputation quality. However, recently Saad et al  and Kreiner-Møller et al  have proposed methods to improve imputation using multistep procedures. Saad et al proposed using 2 imputation methods independently, 1 population based (BEAGLE) and 1 family based (Genotype Imputation Given Inheritance [GIGI]), and choosing the imputed data from the method with the highest variance in genotype probabilities for each single nucleotide polymorphism (SNP). For instance, if the probabilities for genotypes AA, AB, and BB in an individual are 0, 0, and 1.0, respectively, for BEAGLE and 0, 0.5, and 0.5 for GIGI, Saad et al’s method would choose BEAGLE for that variant, because the larger variance indicates more certainty in the call. Saad et al found that the combined method led to more accurate imputed genotypes than either method separately. Kreiner-Møller et al suggested a 2-step imputation using a local reference panel and the 1000 Genomes reference panel, implemented in MACH/Minimac [4, 5]. In the first step, they imputed the study sample to a densely genotyped local reference panel enriched for rare variants. Next, they used the best-guess genotypes from this imputation as well as the original genotypes to impute the study sample to the 1000 Genomes panel.
Our approach combined the strengths of Saad et al and Kreiner-Møller et al. We performed a 2-stage imputation, implementing Impute2 and Merlin sequentially, to test the hypothesis that increasing the density of genotypes in a sequenced reference panel using a population-based imputation before performing a family-based imputation would lead to higher imputation accuracy in a related genome-wide association studies (GWAS) study panel.
Our sample consisted of 959 Mexican Americans from 20 families. All 959 subjects were genotyped on the Illumina platform, and 464 of these individuals were also sequenced. We removed all SNPs with more than 5 % missing data and all individuals with more than 5 % missing data (N = 45) from the GWAS samples, and limited our analysis to the 46.75 Mb to 49.25 Mb region on chromosome 3. This yielded 914 people with GWAS data and 761 Illumina variants. For the sequenced data, we removed any variant with more than 10 % missing data, leaving 14,077 sequenced variants. All sequenced individuals had less than 5 % missing data. Thus, all 959 individuals were included in the analyses: 495 with GWAS only, 45 with sequencing only, and 419 with both GWAS and sequencing.
All data were prephased using SHAPEIT2 prior to performing imputation . We used the duo hidden Markov model (duoHMM) algorithm in SHAPEIT, which uses pedigree information from trios to improve phasing and eliminate Mendelian errors. GWAS and sequence data were phased in separate runs.
We performed 100 imputations each with 3 different methods: population-based imputation with Impute2 2.3.1, family-based imputation with Merlin 1.1.2, and a combination of the two [7, 8]. For each of these 100 imputations, we masked the sequence data of 1 individual, using the individual’s GWAS data instead, and imputed the sequenced variants not in the GWAS data. After the imputation, we compared this individual’s imputed genotypes to his or her true sequenced genotypes. We chose which sequenced subjects to leave out by randomly ordering all 419 subject IDs—excluding the 45 participants with sequence data but no GWAS data—and choosing the first 100.
Distribution of family size
No. of families
Summary statistics of correlation and IQScomparing the imputation with dense markers and sparse markers
Masked individuals with GWA
Masked individuals with GWA in LE
Impute with –cluster option
Masked individuals with GWA
Masked individuals with GWA in LE
Impute with –cluster option
Finally, for the combined imputation method, we selected the best-guess genotypes for all SNPs with MAF greater than 2 different cutoffs—0.01 and 0.05—and posterior probability of the best-guess genotype greater than 0.9, and used these genotypes as well as the GWASSNPs as input for Merlin. Merlin automatically excluded from imputation any variant with Mendelian-inconsistent genotyping errors, but it is possible that Impute2 introduced Mendelian-consistent genotyping errors. However, the 2-stage and Merlin-only results were almost identical for variants with MAFs below the cutoff, which leads us to believe that these potential errors introduced by Impute2 did not negatively affect imputation quality in our sample.
Tabulation of genotypes used for IQS calculation
This statistic adjusts for agreement by chance by subtracting the expected cell counts along the diagonal (which indicates agreement) from the observed proportion of agreement. In cases where the expected agreement is high, such as with variants with low MAFs, the second term in the numerator is higher, thus lowering the Kappa statistic. Lin et al extended this idea to incorporate the uncertainty of imputation by using the posterior probabilities of all 3 genotypes instead of the best-guess genotype, thus allowing the cells in Table 3 to have noninteger values. Cohen’s Kappa and the IQS are equivalent when all cells in Table 3 are integers (ie, when all posterior probabilities are 0 or 1), but differ when there is uncertainty in the imputation. Consequently, IQS is useful for rare variants because, unlike concordance, it accounts for allele frequency and adjusts for chance agreement. Furthermore, IQS can be computed using dosages, which gives more information about imputation quality than best-guess genotypes. Lin et al have compared the performance of IQS and concordance for population-based imputations implemented in Impute2. The authors show that concordance increases with decreased MAF, whereas IQS drops as MAF decreases. The decreasing imputation quality with decreasing MAF is expected, as rare variants do not impute well , making IQS a better measure of imputation quality.
Summary of Imputation Quality by MAF
Combined (0.05) a
Combined (0.01) a
Combined (0.05) a
Our combined method with a MAF cutoff of 0.01 performed better than either Merlin or Impute2 alone for variants with MAFs between 0.01 and 0.4, and our combined method with a MAF cutoff of 0.05 performed better than either Merlin or Impute2 alone for variants with MAFs >0.05. Because the performance suffers below our MAF cutoffs, this suggests that we should not filter Impute2 results by MAF at all, but filter only by best-guess genotype probability.
One potential limitation of this study is that families with more sequence data were more likely to be selected in our set of 100 individuals. We would expect higher imputation accuracy in these families, as there were more individuals included in the reference panels for imputation. More work needs to be done to determine exactly how much the number and relationships of sequenced family members available affect imputation quality. This was beyond the scope of our project, but may be useful in helping investigators choose which family members to sequence.
It is unclear from these results whether the sequential nature of the imputation increases accuracy. In the future, we should compare our method to a method combining independent results from Merlin and Impute2, both based on best-guess genotype probability and Saad et al’s proposed vote strategy . Furthermore, future studies should be done on a larger region and larger sample size, and potentially include different probability thresholds for the Impute2 results.
Our 2-stage method with a MAF inclusion cutoff of 0.01 for Impute2 results achieved better IQSs than either Impute2 or Merlin alone, and similar correlation values, for variants with MAFs between 0.01 and 0.4. This method could be further improved by including all Impute2 imputed genotypes above a certain quality threshold regardless of MAF. Other probability thresholds should be tested, and this 2-stage method should be compared to results using Merlin and Impute2 independently to examine whether the sequential nature of the procedure increases accuracy above and beyond the increase obtained by combining population- and family-based methods.
The GAW19 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. SL was supported by the National Institute of General Medicine grant T32 GM074905.
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
All authors contributed to the design of the overall study. SL and XD conducted all analyses and drafted the manuscript. YZ, LAC, KLL, and CTL provided advice and critically revised the manuscript. All authors read and approved the final manuscript.
The authors declare they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11(7):499–511.View ArticlePubMedGoogle Scholar
- Li L, Li Y, Browning SR, Browning BL, Slater AJ, Kong X, et al. Performance of genotype imputation for rare variants identified in exons and flanking regions of genes. PLoS Genet. 2011;6(9):e24945.View ArticleGoogle Scholar
- Saad M, Wijsman E. Combining family- and population-based imputation data for association analysis of rare and common variants in large pedigrees. Genet Epidemiol. 2014;38(7):579–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Kreiner-Møller E, Medina-Gomez C, Uitterlinden A, Rivadeneira F, Estrada K. Improving accuracy of rare variant imputation with a two-step imputation approach. Eur J Hum Genet. 2015;23(3):395–400.View ArticlePubMedGoogle Scholar
- 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56.View ArticleGoogle Scholar
- O’Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014;10(4):e1004234.View ArticlePubMedPubMed CentralGoogle Scholar
- Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5(6):e1000529.View ArticlePubMedPubMed CentralGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002;30(1):97–101.View ArticlePubMedGoogle Scholar
- Burdick JT, Chen WM, Abecasis GR, Cheung VG. In silico method for inferring genotypes in pedigrees. Nat Genet. 2006;38(9):1002–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Abecasis GR, Wigginton JE. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am J Hum Genet. 2005;77(5):754–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Lin P, Hartz SM, Zhang Z, Saccone SF, Wang J, Tischfield JA, et al. A new statistic to evaluate imputation reliability. PLoS One. 2010;5(3):e9697.View ArticlePubMedPubMed CentralGoogle Scholar
- Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.View ArticleGoogle Scholar
- Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010;44:293–308.View ArticlePubMedGoogle Scholar