Volume 3 Supplement 7
Assessment of genotype imputation methods
© Biernacka et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Several methods have been proposed to impute genotypes at untyped markers using observed genotypes and genetic data from a reference panel. We used the Genetic Analysis Workshop 16 rheumatoid arthritis case-control dataset to compare the performance of four of these imputation methods: IMPUTE, MACH, PLINK, and fastPHASE. We compared the methods' imputation error rates and performance of association tests using the imputed data, in the context of imputing completely untyped markers as well as imputing missing genotypes to combine two datasets genotyped at different sets of markers. As expected, all methods performed better for single-nucleotide polymorphisms (SNPs) in high linkage disequilibrium with genotyped SNPs. However, MACH and IMPUTE generated lower imputation error rates than fastPHASE and PLINK. Association tests based on allele "dosage" from MACH and tests based on the posterior probabilities from IMPUTE provided results closest to those based on complete data. However, in both situations, none of the imputation-based tests provide the same level of evidence of association as the complete data at SNPs strongly associated with disease.
Indirect association as a result of linkage disequilibrium (LD) is a key factor in genetic association studies. Because of LD, a disease-susceptibility single-nucleotide polymorphism (SNP) need not be genotyped, as long as it is tagged by a SNP or set of SNPs that are genotyped. This concept has been further exploited by the introduction of methods to impute missing genotypes at untyped markers, based on known genotypes at typed markers and information about LD within the region from a reference panel [1–4]. Such imputation methods can also be applied in the context of combining data across studies with different sets of correlated SNPs genotyped in different studies.
Two recent studies compared imputation accuracy of several methods [5, 6]; however, these studies did not assess performance of association tests based on the imputed genotypes. In this paper, we compare the performance of several imputation methods when combining two datasets that have been genotyped at different sets of markers or when completely missing (i.e., "untyped") markers are analyzed. Four commonly used software packages were evaluated: IMPUTE , MACH , PLINK , and fastPHASE . Imputation error rates and performance of association tests using the imputed data were compared. The Genetic Analysis Workshop (GAW) 16 Problem 1 dataset provided by the North American Rheumatoid Arthritis Consortium (NARAC) was used.
The NARAC data consisted of 868 cases of rheumatoid arthritis (RA) and 1194 controls genotyped on the 550 k Illumina SNP chip. Four regions were selected on chromosome 1, each consisting of 30 consecutive SNPs, representing regions with disease association (PTPN22 [9, 10] and PADI4 [11, 12]) and without disease association, and with high or low LD. SNPs deviating from Hardy-Weinberg equilibrium (HWE) (p < 0.001) or with call rates below 95% were removed before analysis.
Two scenarios were considered: 1) imputation of "untyped" markers and 2) imputation to combine two datasets.
A set of genotyped SNPs were removed completely and subsequently imputed for all subjects. LD plots for the regions as well as a list of removed SNPs are provided by Fridley et al. in this volume . For null regions 1 and 2, seven and eight SNPs were removed, respectively. For the PTPN22 region, two datasets were created with four SNPs excluded in addition to either the most strongly associated SNP (rs2476601) or the two SNPs flanking rs2476601. A similar approach was taken for the PADI4 region, with rs6683201 or the two SNPs flanking rs6683201 removed in addition to five other SNPs.
To represent the combined analysis of data from two studies, cases and controls were randomly assigned to two study populations, resulting in 434 cases and 597 controls per group. Genotypes at 10 randomly selected SNPs from each region were removed for all individuals in the first group. A second non-overlapping set of 10 random SNPs were deleted in the second group. Thus, in each region, 10 SNPs were genotyped in both cohorts, while 10 were genotyped only in cohort 1 and were imputed in cohort 2, and 10 were genotyped in cohort 2 and imputed in cohort 1.
Imputation was performed using IMPUTE v 0.4.1 , MACH v 1.0.16 , fastPHASE v 1.2.3 , and PLINK v 0.99 . Haplotypes of the 60 HapMap CEU founders were used as the reference data to run IMPUTE, MACH, and PLINK for scenarios 1 and 2, and to run fastPHASE for scenario 1. For fastPHASE, under scenario 2, only the samples from the NARAC data were used. Programs were run with default options, except to ensure convergence of MACH, each dataset was run with 150 iterations ("--rounds 150"option). In addition the option "--dose" was used with MACH. For imputation of untyped SNPs (scenario 1), the IMPUTE options "-exclude_SNPs file-impute_excluded" were used, while for imputation under scenario 2 the "-pgs" option was used. Full details of the commands used may be obtained from the authors by request.
Our assessment of error rates focused on the proportion of incorrect genotypes obtained by imputing the most likely genotype for each missing value, regardless of the confidence in the imputation. Associations were assessed assuming log-additive allelic effects on RA risk. p-Values were calculated using the complete data and each set of imputed data. In addition, for scenario 2, association analyses using the "non-missing data" (genotypes available for only one group) were performed. Association tests based on imputed data used "allele dose" from MACH (the estimated number of minor alleles ranging from 0 to 2), the most likely genotypes imputed using fastPHASE and PLINK, and the posterior probabilities from IMPUTE. For IMPUTE, association tests were performed using the accompanying program SNPTEST, with the "-proper-frequentist 1" options.
Mean error rates by imputation method and scenario
Scenario 1: Imputation of untyped SNPs
By max pairwise
r2 < 0.5
r2 ≥ 0.5
Scenario 2: Imputation to combine two datasets
By max pairwise
r2 < 0.5
r2 ≥ 0.5
Mean (SD) differencea in -log10(p-value) based on a test of association using complete data and a test of association using the imputed data
Comparison of p-values from association tests based on the original (complete) data with those that use the imputed data reveals that for SNPs with small association p-values, the imputed-data p-value tends to be larger than the complete-data p-value, consistent with loss of power. This is especially evident at SNP rs24776601 in PTPN22, which is strongly associated with RA in the complete data. At this SNP, MACH and IMPUTE provided strongest evidence of association when it was assumed that the SNP had not been genotyped at all (Figure 2), while IMPUTE calculated to the smallest p-value when it was assumed that the SNP had been genotyped for half the subjects (Figure 3). In both situations, all imputation-based tests provided substantially less evidence of association than the complete data.
We compared the performance of four commonly used packages for imputation of missing genotype data as well as subsequent tests of association. A key disadvantage of fastPHASE is that it only provides the most likely genotype, while MACH provides an estimate of allele dose, and IMPUTE and PLINK provide estimates of posterior probabilities of all possible genotypes. In agreement with published studies [5, 6], when imputing the most likely genotype for each missing value, using MACH and IMPUTE generated lower overall error rates than the other approaches. As expected, imputation was more accurate for SNPs in higher LD with genotyped SNPs. Our method of calculating the error rate did not take into account whether one or two of the alleles are incorrectly imputed. A measure of imputation accuracy that reflects the number of correctly imputed alleles, or uses the posterior probabilities of possible genotypes, could be considered.
On average, association tests based on imputed data gave similar results to the test based on the complete ("unknown") data. However, at the strongest association peak, the imputation-based tests were much less significant than the complete-data test, indicating that using imputation methods followed by association testing can severely underestimate significance at association peaks. This finding may be partially due to the fact that the reference haplotypes used for imputation are representative of a population-based sample that is comparable to the control sample. Dense genotyping of a subset of cases and controls from a given study and use of the resulting haplotypes as the reference data may improve the power of association tests based on imputed data. Further investigation of such an approach is warranted. Although imputation-based tests can underestimate the significance at strongly associated SNPs, they can also lead to results more significant than tests for nearby markers that were genotyped and are indirectly associated with the trait. As with any imputation-based analysis, such results should be interpreted cautiously and the region should be further investigated.
All methods performed well for SNPs in high LD with genotyped SNPs. However, MACH and IMPUTE generated lower overall imputation error rates and more reliable association test results than fastPHASE and PLINK. Further investigation of the relative merits of using allele doses or posterior genotype probabilities is warranted. The fact that imputation-based tests can severely underestimate significance at strong association peaks warrants caution in using these methods to exclude SNPs from further follow-up.
List of abbreviations used
Genetic Analysis Workshop
North American Rheumatoid Arthritis Consortium
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Partial funding for this study was provided by the S.C. Johnson Genomics of Addiction Program at Mayo Clinic (JMB and RT) NIH grants HL87660 (JL and MdA) and R01 CA122443 (ELG).
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Servin B, Stephens M: Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007, 3: e114-10.1371/journal.pgen.0030114.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007, 39: 906-913. 10.1038/ng2088.View ArticlePubMedGoogle Scholar
- Nicolae DL: Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet Epidemiol. 2006, 30: 718-727. 10.1002/gepi.20182.View ArticlePubMedGoogle Scholar
- Li Y, Abecasis GR: Mach 1.0: rapid haplotype reconstruction and missing genotype inference [abstract 2290/C]. Am J Hum Genet. 2006, S79: 416-Google Scholar
- Nothnagel M, Ellinghaus D, Schreiber S, Krawczak M, Franke A: A comprehensive evaluation of SNP genotype imputation. Hum Genet. 2009, 125: 163-171. 10.1007/s00439-008-0606-5.View ArticlePubMedGoogle Scholar
- Pei YF, Li J, Zhang L, Papasian CJ, Deng HW: Analyses and comparison of accuracy of different genotype imputation methods. PLoS ONE. 2008, 3: e3551-10.1371/journal.pone.0003551.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78: 629-644. 10.1086/502802.PubMed CentralView ArticlePubMedGoogle Scholar
- Carlton VE, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, Kastner DL, Seldin MF, Criswell LA, Gregersen PK, Beasley E, Thomson G, Amos CI, Begovich AB: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis. Am J Hum Genet. 2005, 77: 567-581. 10.1086/468189.PubMed CentralView ArticlePubMedGoogle Scholar
- Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, Conn MT, Chang M, Chang SY, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330-337. 10.1086/422827.PubMed CentralView ArticlePubMedGoogle Scholar
- Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, Gregersen PK, Klareskog L, Rioux JD: Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4. Am J Hum Genet. 2005, 77: 1044-1060. 10.1086/498651.PubMed CentralView ArticlePubMedGoogle Scholar
- Worthington J, John S: Association of PADI4 and rheumatoid arthritis: a successful multidisciplinary approach. Trends Mol Med. 2003, 9: 405-407. 10.1016/j.molmed.2003.08.007.View ArticlePubMedGoogle Scholar
- Fridley B, McDonnell S, Rabe K, Tang R, Biernacka J, Rider D, Goode E: Single versus multiple imputation of genotypic data. BMC Proc. 2009, 3 (suppl 7): S7-10.1186/1753-6561-3-s7-s7.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.