Volume 3 Supplement 7
Application of imputation methods to the analysis of rheumatoid arthritis data in genome-wide association studies
© Childers et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Most genetic association studies only genotype a small proportion of cataloged single-nucleotide polymorphisms (SNPs) in regions of interest. With the catalogs of high-density SNP data available (e.g., HapMap) to researchers today, it has become possible to impute genotypes at untyped SNPs. This in turn allows us to test those untyped SNPs, the motivation being to increase power in association studies. Several imputation methods and corresponding software packages have been developed for this purpose. The objective of our study is to apply three widely used imputation methods and corresponding software packages to a data from a genome-wide association study of rheumatoid arthritis from the North American Rheumatoid Arthritis Consortium in Genetic Analysis Workshop 16, to compare the performances of the three methods, to evaluate their strengths and weaknesses, and to identify additional susceptibility loci underlying rheumatoid arthritis. The software packages used in this paper included a program for Bayesian imputation-based association mapping (BIMBAM), a program for imputing unobserved genotypes in case-control association studies (IMPUTE), and a program for testing untyped alleles (TUNA). We found some untyped SNP that showed significant association with rheumatoid arthritis. Among them, a few of these were not located near any typed SNP that was found to be significant and thus may be worth further investigation.
Advances in the understanding of a disease's pathogenesis often lead to improvements in strategy for the prevention, diagnosis, and/or treatment of the disease. Moreover, studies have shown that genetic factors play an important role in the pathogenesis of many complex human diseases. Therefore, improving public health and preventing disease provides sufficient motivation for dissecting the genetic etiology of complex human diseases. The genome-wide association study (GWAS) may be seen as a first step towards such dissections and have drawn considerable attention (with some success) in recent years.
Indeed, many GWAS have resulted in identifying at least one candidate gene that may seem likely, considering the biological properties of the gene, to have an effect on the disease . In a typical GWAS, a large number of population samples of cases and controls are genotype at hundreds of thousands of single-nucleotide polymorphisms (SNPs). However, even at these numbers, the SNPs that are genotyped in GWAS will only account for a small proportion of cataloged SNPs. In particular, it is likely that disease susceptibility variants are not directly assayed. With the availability of a high-density panel of SNPs such as from HapMap , it is possible to gain additional power by testing untyped SNPs based on data at the genotyped SNPs. Testing untyped SNPs can facilitate the selection of SNPs to be genotyped in follow-up studies and can allow for comparison of findings or joint analysis of data from different studies that use different SNP panels and genotyping platforms.
Several methods have recently been developed and their corresponding software packages implemented to test untyped SNPs [3–5]. Although these methods differ in specific strategies used to impute genotypes at untyped SNPs, they generally follow three steps. In the first step, linkage disequilibrium (LD) patterns are dissected and/or haplotypes and their frequencies are inferred from genotypes of reference samples, such as genotypes from the HapMap project. In the second step, genotypes at untyped SNPs are imputed based on genotypes in observed data and their correlation with typed SNPs in reference samples. In the final step, association tests are performed on all typed and untyped SNPs. In this paper, we selected three software packages based on imputation methods, including Bayesian imputation-based association mapping (BIMBAM), imputing unobserved genotypes in case-control association studies (IMPUTE), and testing untyped alleles (TUNA) to analyze data from a GWAS of rheumatoid arthritis (RA) from North American Rheumatoid Arthritis Consortium (NARAC) provided to Genetic Analysis Workshop 16 (GAW16). These software packages were selected in this study because they are publicly available and can readily perform imputations and association tests in a genome-wide scale. We report our findings, compare the performances of the three programs, and discuss their advantages and disadvantages.
The case-control data was obtained from the NARAC provided for GAW16. It contains genotypes of NARAC (868 cases and 1,194 controls at 545,080 SNPs) after removing duplicated and contaminated samples. Because the three software packages were implemented for autosomes, only SNPs from 22 autosomes were used. SNPs with minor allele frequency (MAF) less than 0.01 and SNPs with p-value of Hardy-Weinberg equilibrium test in controls less than 0.0001 were removed. A total of 515,050 SNPs remained in our analysis. The Phase II genotype data of 60 CEU samples from the HapMap project http://www.hapmap.org/ was downloaded and used as reference data to impute genotypes at untyped SNPs.
BIMBAM  uses the methods implemented in fastPHASE  to impute the genotypes at untyped SNPs. The Bayes factors (BFs) are computed under linear or logistic regression of phenotypes on genotypes. Specifically, for binary (0/1) phenotypes, the BFs are computed under a logistic regression model, logit(Pr(Y i = 1)) = log(Pr(Y i = 1)/Pr(Y i = 0)) = μ + aX i + dI(X i = 1), where Y i denotes the phenotype for individual i, X i denotes the genotype for individual i (coded as 0, 1, or 2), a denotes the additive effect, and d denotes the dominance effect. The BFs are computed under the same priors for μ, a, and d as in prior D2 .
The computer program IMPUTE  uses a hidden Markov Model to determine the genotype probabilities for each individual in the study at untyped SNPs that are available in reference samples. The Cochran-Armitage trend test for associations is then implemented on the resulting file using the software SNPTEST . A key feature of IMPUTE is that it can use genotype probabilities rather than deterministic genotypes. The test of additive association was used in our analysis.
The imputation-based analysis implemented in TUNA  uses a multi-locus measure of LD, similar in interpretation to r2, to determine the best set of genotyped markers that can be used for estimating the genotype frequencies of an untyped SNP in cases and controls separately. The statistical test for association implemented in TUNA aims to find differences in the allele frequencies in cases and controls . It has an asymptotic chi-square distribution with one degree of freedom under the null hypothesis.
Simulations were performed based on haplotypes and their frequencies of CHI3L2 gene from CEU and YRI samples from the HapMap. After removing SNPs with MAF less than 0.05, 25 and 17 SNPs were remained for CEU and YRI samples, respectively. One SNP with MAF of about 0.15 was selected as disease SNP and 400 cases and 400 controls were generated based on the genotypic relative risk of 1.0 (to assess the type I error) or 1.5 (to assess power) and a prevalence of 0.10. Genotypes at disease SNP and half of SNPs were removed to test the performance of imputation methods. The Phase II data of CEU 60 samples was used as reference, which allows us to investigate the performance of imputation methods when the LD pattern was misspecified. The simulation procedure was repeated 1,000 times.
Results from the analysis of rheumatoid arthritis data
Number of SNPs
Number of Significant SNPs
Results before significant SNPs on chromosome 6 removed
Results after significant SNPs on chromosome 6 removed
A list of significant untyped SNPs. The significant untyped SNPs that are not located with 2 Mb of any significant typed SNP.
log10 (BF) or - log10 (p) (rank)b
Type I error, power, and difference between MAF from imputed and simulated genotypes
9.6 × 10-4
6.2 × 10-4
2.43 × 10-2
2.20 × 10-2
In this paper, we applied three imputation methods, BIMBAM, IMPUTE, and TUNA to a GWAS of RA data. All of these methods identified some untyped SNPs that showed significant association with RA. A few of these are also not located near a significant typed SNP. This provides some reason for being selected in follow-up studies to identity novel genes that are associated with RA.
Many imputation-based methods have recently been proposed, thus it is necessary to compare their strengths and weaknesses. Indeed, each of these methods has certain advantages over the other. For one, TUNA is computational efficient because it only uses a small number of typed SNPs to estimate the genotype frequencies of the untyped SNPs. Many of the untyped SNPs were not tested by TUNA. This further increases its efficiency-TUNA took only about 12 hours to finish all computations. However, TUNA may miss significant SNPs. BIMBAM uses Bayesian methods, runs efficiently and allows for the input and output of zipped files rather than large text files. The use of BFs has some advantages over the standard p-value approach  but results are not easily comparable with measures of significance. IMPUTE has the advantage of allowing the user to decide the test performed and use imputed genotype probabilities. However, IMPUTE took more than 400 hours to complete our computations. Such a heavy computational demand stands as a roadblock for its general application in GWAS.
We evaluated their performances in terms of the accuracy of imputed genotypes and the power and type I error rate of the subsequent association testing using simulated data. Our results indicated that when an inappropriate reference sample is used, the power may decrease but the type I error rate is maintained. However, when an appropriate reference sample is used, the genotypes at untyped SNPs were accurately imputed and the power was increased. Thus, in this case, it may be desirable to use these imputation methods. On the other hand, we found some untyped SNPs were identified as significant with one method but not with any of other methods from the analysis of real data. Therefore, results of untyped SNPs from these imputation methods must be used with caution.
List of abbreviations used
Bayesian Imputation-Based Association Mapping
Genetic Analysis Workshop 16
Genome-wide association study
Imputing unobserved genotypes in case-control association studies
Minor allele frequency
Major histocompatibility complex
North American Rheumatoid Arthritis Consortium
Testing untyped alleles
This research was supported by grants R01 GM073766 (GG, GK), R01-GM74913 (KZ), R01 GM081488 (NL) from the National Institute of General Medical Sciences and grant T32 HL072757 (DKC) from the National Heart, Lung, and Blood Institute. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, Minichiello MJ, Fearnhead P, Yu K, Chatterjee N, Wang Z, Welch R, Staats BJ, Calle EE, Feigelson HS, Thun MJ, Rodriguez C, Albanes D, Virtamo J, Weinstein S, Schumacher FR, Giovannucci E, Willett WC, Cancel-Tassin G, Cussenot O, Valeri A, Andriole GL, Gelmann EP, Tucker M, Gerhard DS, Fraumeni JF, Hoover R, Hunter DJ, Chanock SJ, Thomas G: Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007, 39: 645-649. 10.1038/ng2022.View ArticlePubMedGoogle Scholar
- The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449: 851-861. 10.1038/nature06258.PubMed CentralView ArticleGoogle Scholar
- Marchini J, Howie B: Comparing algorithms for genotype imputation. Am J Hum Genet. 2008, 83: 535-539. 10.1016/j.ajhg.2008.09.007.PubMed CentralView ArticlePubMedGoogle Scholar
- Nicolae DL: Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet Epidemiol. 2006, 30: 718-727. 10.1002/gepi.20182.View ArticlePubMedGoogle Scholar
- Servin B, Stephens M: Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007, 3: e114-10.1371/journal.pgen.0030114.PubMed CentralView ArticlePubMedGoogle Scholar
- BIMBAM. [http://stephenslab.uchicago.edu/software.html]
- IMPUTE. [http://www.stats.ox.ac.uk/~marchini/software/gwas/impute.html]
- TUNA. [http://www.stat.uchicago.edu/~wen/tuna/]
- Maystadt I, Rezsöhazy R, Barkats M, Duque S, Vannuffel P, Remacle S, Lambert B, Najimi M, Sokal E, Munnich A, Viollet L, Verellen-Dumoulin C: The nuclear factor kappaB-activator gene PLEKHG5 is mutated in a form of autosomal recessive lower motor neuron disease with childhood onset. Am J Hum Genet. 2007, 81: 67-76. 10.1086/518900.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.