Single versus multiple imputation for genotypic data
© Fridley et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Skip to main content
Volume 3 Supplement 7
© Fridley et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Due to the growing need to combine data across multiple studies and to impute untyped markers based on a reference sample, several analytical tools for imputation and analysis of missing genotypes have been developed. Current imputation methods rely on single imputation, which ignores the variation in estimation due to imputation. An alternative to single imputation is multiple imputation. In this paper, we assess the variation in imputation by completing both single and multiple imputations of genotypic data using MACH, a commonly used hidden Markov model imputation method. Using data from the North American Rheumatoid Arthritis Consortium genome-wide study, the use of single and multiple imputation was assessed in four regions of chromosome 1 with varying levels of linkage disequilibrium and association signals. Two scenarios for missing genotypic data were assessed: imputation of untyped markers and combination of genotypic data from two studies. This limited study involving four regions indicates that, contrary to expectations, multiple imputations may not be necessary.
Due to the growing need to combine data across multiple studies, several analytical tools for imputation and analysis of missing genotypes have been developed and assessed [1–4]. These methods are particularly useful in the context of failed genotyping and combining data across multiple platforms, and recently have been extended to untyped markers using a reference sample [2–4]. Current imputation methods typically rely on single imputation (SI); however, SI ignores the variation in estimation due to the imputation. Therefore, one is unable to determine the variation in association results due to the imputation technique.
An alternative to SI is multiple imputation (MI) in which multiple imputed or "augmented" datasets are created and then analyzed using standard statistical methods and models [5, 6]. In this paper, we compare the use of SI and MI using the software MACH  to impute genotype "dosage" between 0 and 2. In a companion Genetic Analysis Workshop (GAW) 16 analysis, we assessed four commonly used imputation packages (MACH , fastPHASE , IMPUTE , PLINK ) and concluded that using MACH or IMPUTE led to the lowest imputation error rates , consistent with other reports that MACH and IMPUTE yield similar imputation accuracy [9, 10]. We chose to use MACH rather than IMPUTE for this comparison of SI versus MI because MACH required less memory to run, and we considered it to be more "user-friendly". The comparison of SI and MI was completed using the North American Rheumatoid Arthritis Consortium (NARAC) data . We examine the variation in imputation and implication on association results.
Analyses under two scenarios were completed; for both scenarios, we have "true" genotypes. Scenario I mimicked the situation in which completely untyped markers were imputed. In this scenario, a set of SNPs genotyped in the NARAC cohort were selected to be removed based on various criteria (e.g., minor allele frequency (MAF), significance, LD) and were then imputed in the entire cohort. For both 'associated' regions (Figure 1, C and 1D), two sets of SNPs were imputed, resulting in a total of six datasets for analysis (two for each associated region, one for each null region). The risk SNP was defined as the SNP with the strongest evidence of association (rs2476601 in PTPN22, rs6683201 in PADI4). In the first set, the risk SNP was imputed; in the second set, the two markers flanking the risk SNP were imputed.
Scenario II mimicked the situation in which two studies genotyped different set of SNPs; 1/3 of the SNPs were genotyped only in Study I, 1/3 of the SNPs were genotyped only in Study II, and the remaining 1/3 of the SNPs were genotyped in both Study I and Study II. We created the two studies by randomly splitting the NARAC data, ensuring equal numbers of cases and controls in each study. Likewise, the SNPs were randomly chosen to be genotyped in Study I, Study II, or both studies.
Each of the four regions for both scenarios were analyzed five times using MACH version 1.0.16  with phased HapMap haplotypes for the 60 CEU founder participants as the reference haplotypes . For each region, 150 iterations were used to insure convergence, where minor allele "dosage" (expected mean genotype) was imputed. Syntax used for running MACH was the following: mach1 -d region.dat -p region.ped -h region.haplos -s region.snps --rounds 150 --greedy --geno --dosage --quality --mask 0.02 --seed 487 > mach.out.
Associations between SNP genotypes and RA risk were then assessed using logistic regression to estimate odds ratios (ORs), 95% confidence intervals (CIs), and p-values. Tests for association assumed an ordinal (log-additive) genotypic effect on RA risk. Inference for parameters from multiple imputations was completed as follows: let θ represent the parameter of interest, and K represent the number of imputed datasets (e.g., K = 5). The overall point estimate of θ is the mean of the K point estimates based on the imputed datasets. The estimated variance of is defined as , where W and B represent the within and between imputation variation. Inference for θ is then based on the t-distribution with df = (k - 1)(1 + (1/(K + 1))(W/B))2 .
In terms of impact on testing for association using SI and MI, results were very similar between SI and MI. For Scenario I, the median difference in -log10(p-values) was 0.005 with IQR of 0.077, while for Scenario II, the median difference was 0.002 with and IQR of 0.044. Scenario I had slightly greater variation in p-values between SI and MI as compared to Scenario II. Next, we evaluated the variation in imputed genotypes from two imputation runs (run 1 and run 2), summarized by SNP and by subject, for Scenario I and II. The median difference in imputed genotypes, summarized by SNP, was 0.0002 (IQR = 0.002) and 0.0003 (IQR = 0.001) for Scenario I and II, respectively. The median difference (IQR) between imputed genotypes, when summarized by subject, for Scenario I and II was 0.0005 (IQR = 0.011) and 0.0003 (IQR = 0.007).
We have demonstrated the use of SI and MI for the imputation of missing genotypes or untyped markers using a reference panel. In doing so, we utilized MACH , a common method that relies on LD and haplotype estimation via a hidden Markov model. A companion GAW16 paper assessed four commonly used imputation packages and concluded that using MACH or IMPUTE led to the lower imputation error rates than using fastPHASE or PLINK . Care should be taken to select the most appropriate imputation method as well as to determine whether to use SI or MI.
Another consideration of whether one should employ MI is computation time. For the analyses presented, MACH was run on a Beowolf-style Linux cluster with compute nodes running CentOS 4.3 Linux x86-64 allowing 8-16 GB memory per job. Scenario I run-times (single imputation) ranged from 13-18 minutes. However, when MACH used the raw genotype data for the reference samples instead of the phased haplotypes, the run-time increased to more than 30 days [mach1 -d regionPool.dat -p regionPool.ped --rounds 150 --compact --geno --dosage --quality --mask 0.02 --seed 1776 > mach.out]. In contrast, the run-times for Scenario II, based on phased haplotypes, was around 10 minutes with little variation in run times between the four regions.
Ignoring variation due to imputation results in under-estimation of the variance in the parameter estimate, and hence an inflated type I error. For imputation of untyped markers (Scenario I), we observed larger variation in results as compared to imputation of missing genotypes (Scenario II). For Scenario II, we observed small differences in association results based on SI or MI, especially in regions of higher LD. In genome-wide association studies in which SI is often implemented for over two million markers, one appropriate approach is to use SI as the initial analysis and employ MI for any regions of interest detected with SI to assess variation due to imputation. On the basis of this study involving four regions, single imputation is reasonable, especially in regions of high LD where imputed genotype "dosage" is used in the analysis.
Genetic Analysis Workshop
Minor allele frequency
North American Rheumatoid Arthritis Consortium
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This work was supported in part by NIH grant R01 CA122443.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.