Volume 5 Supplement 9
Assessing the impact of missing genotype data in rare variant association analysis
© Mägi et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Human genome resequencing technologies are becoming ever more affordable and provide a valuable source of data about rare genetic variants in the human genome. Such rare variation may play an important role in explaining the missing heritability of complex human traits. We implement an existing method for analyzing rare variants by testing for association with the mutational load across genes. In this study, we make use of simulated data from the Genetic Analysis Workshop 17 to assess the power of this approach to detect association with simulated quantitative and dichotomous phenotypes and to evaluate the impact of missing genotypes on the power of the analysis. According to our results, the mutational load based rare variant analysis method is relatively robust to call-rate and is adequately powered for genome-wide association analysis.
The success of genome-wide association studies (GWAS) to identify novel loci that contribute to complex human traits has been well publicized . However, despite these successes, much of the genetic component of these traits remains unexplained. Most genotyping products that are used in GWAS have been designed to capture common human genetic variation , and with ever increasing sample sizes and meta-analysis, we might expect to identify associations with common variants with ever decreasing effect size. However, it seems unlikely that the common disease/common variant model will entirely explain the missing heritability of complex traits. One widely unexplored paradigm that may contribute to this unexplained genetic component is a model of multiple rare causal variants, defined here as those having a minor allele frequency (MAF) less than 5%, each with modest effect but residing within the same gene. Such an association has recently been identified between rare variants in the IFIH1 gene and type 1 diabetes .
Until recently, the availability of data appropriate for rare variant association analysis has been extremely limited. However, with improvements in the efficiency of deep resequencing technologies, discovery and analysis of rare variants is becoming increasingly cost-effective and financially feasible in large disease- or population-based cohorts at the level of specific genes or even exome-wide. Furthermore, large-scale whole-genome resequencing efforts, such as the 1000 Genomes Project , continue to make their data available to the research community. These resources are likely to provide near complete catalogs of low-frequency genetic variation and of many other rarer variants in a variety of populations across ethnic groups. These data can provide deep and high-density reference panels, potentially allowing for imputation of rare variants that are not typically directly assayed or otherwise captured by genotyping products in GWAS .
One common approach to the joint analysis of rare variants within the same gene is to focus on their mutational load, searching for accumulations of minor alleles across individuals with the same or similar phenotype [6, 7]. Simulation studies have demonstrated that such an approach has much greater power to detect rare variant associations than traditional single-SNP analyses [6, 7]. However, these studies typically assume no genotyping and/or sequencing failures, which, particularly for rare variants, may affect the results of their downstream analysis. In this study, we undertake simulations to assess the effects of missing genotype data on rare variant association analysis. We use the simulated data from Genetic Analysis Workshop 17 (GAW17), which includes genotypes at exonic rare variants within a subset of genes across the genome, generated from the 1000 Genomes Project . We make use of a simple model of random missing genotypes and evaluate the effect of failure rate on the power of mutational load rare variant association with quantitative and dichotomous traits. Analysis of pilot data from the 1000 Genomes Project shows that the mutant (rare) allele is more difficult to call than the reference (common) allele. To mimic this allele-specific failure rate, we have incorporated into our analysis a more complex model of missing data in which the call rate is determined by the underlying genotype.
Rare variant mutational load analysis
where g is the link function, x i denotes a vector of covariate measurements for the ith individual with corresponding regression coefficients β, and parameter λ is the expected increase in the phenotype for an individual carrying a full complement of minor alleles at rare variants compared to an individual carrying none. Thus we construct a likelihood ratio test of association of the mutational load of rare variants with disease by comparing the maximized likelihoods of two models by means of analysis of deviance: (1) the null model where λ = 0 and (2) the alternative model where λ is unconstrained. The contribution of the ith individual to the likelihood is weighted by n i to allow for differential call rates between samples.
The described method has been implemented using the GRANVIL software, which is freely available for download from http://www.well.ox.ac.uk/GRANVIL. The software can be applied to quantitative traits and dichotomous phenotypes and can adjust for potential confounders as covariates. Users must provide a list of genes, with start and stop positions, together with a map file for variant locations.
The data provided by GAW17 contain genotype data for 697 individuals from the 1000 Genomes Project . Individuals were chosen from different populations with European, Asian, and African origin. Overall, 24,487 variants from 3,205 gene regions were provided with MAFs in the range 0.07–16.6%. Three normally distributed quantitative traits and a dichotomous disease phenotype were simulated for each individual on the basis of their genotype data. Q1 and Q2 phenotypes were determined by genotypes in 9 and 13 genes, respectively. Q4 was not determined by any variants among the genes provided. Disease liability was generated using a function of Q1, Q2, and Q4 phenotypes in addition to variants in a further 15 genes. Two hundred replicates of data were simulated, each on the basis of the same underlying genotypes and each stored in a separate phenotype file. Full details of the GAW17 data and simulation approach used to generate the phenotype data are reported elsewhere .
We make use of the simulated GAW17 data to investigate the type I error rate and power of GRANVIL to detect association with quantitative traits Q1, Q2, and Q4 and the dichotomous disease (CC) phenotype. We consider rare variants to have MAF < 5%. GRANVIL gives equal weight to all these rare variants in the gene, irrespective of their potential functional role. We therefore performed two analyses of each replicate of phenotype data: (1) including all rare variants, irrespective of function; and (2) restricting rare variants to those that are nonsynonymous. We used GRANVIL to test for association with the mutational load in each gene containing at least two rare variants. Phenotype data for individual NA07347 was excluded because of extreme deviation from the mean in most replicates . For each analysis, all phenotypes were adjusted for sex, age, and smoking status. GAW17 individuals were ascertained from three major ethnic groups: (a) European origin (European Americans [CEPH], Tuscan); (b) Asian descent (Denver Chinese, Han Chinese, Japanese); and (c) African ancestry (Yoruba and Luhya). Population stratification analysis revealed separate clusters for these major ethnic groups (data not shown). To avoid problems arising from stratification, we thus performed GRANVIL analyses for each ethnic group separately and combined the results for each gene using inverse-variance fixed-effects meta-analysis of the parameter λ, implemented in the GWAMA software .
To assess the effect of genotype call rate on type I error rates and power, we randomly removed rare variant genotypes from individuals to simulate missing data. We considered a range of missing data rates: 0.1%, 0.5%, 1%, 5%, and 10% of all available genotypes. To take account of the possibility of allele-specific failure rates, we also considered a more complex model in which heterozygous and rare homozygous genotypes were more difficult to call. Specifically, we randomly removed 1% of common homozygous genotypes, 5% of heterozygous genotypes, and 10% of rare homozygous genotypes. For each model of missing genotype data, we generated 1,000 replicates of data, each from a randomly selected phenotype file from GAW17.
The power (type I error rate) was assessed at a nominal Bonferroni-corrected threshold of p ≤ 3.86 × 10−5 (0.05/1,297 genes having at least two rare variants). We assessed power by considering all genes known to be causal for the respective phenotype and calculated type I error rate by considering all noncausal genes .
Analysis of all rare variants and analysis of only nonsynonymous rare variants generated qualitatively similar results. We thus present here the results for the analysis of nonsynonymous variants.
One of the key advantages of testing for association of the mutational load within a gene is that we can take account of multiple rare variants simultaneously . Our results demonstrate that we have high power to detect association with rare variants in some of the causal genes for Q1, Q2, and the disease (CC) phenotype. Furthermore, our results suggest that missing genotype data, with call rates as low as 90%, have little effect on power. The mutational load association analysis implemented in GRANVIL weights the contribution of each individual to take account of missing genotypes. Our results suggest that GRANVIL is robust to call rates as low as 90%. There was evidence of increased type I error rates for several noncausal genes, particularly for Q1. However, this reflects long-range linkage disequilibrium between rare variants rather than sensitivity to missing genotype data.
We considered two models of missing genotype data: random failure and an allele-specific model that gives greater probability to uncalled heterozygous and rare homozygous genotypes. Our results were consistent across these two models. This is presumably because for rare variants most of the genotypes are common homozygotes and are thus more robust to call rates determined by the presence of a minor allele.
In this paper, we considered the effect of missing genotype data on the power and type I error rates of a method that tests for association of the mutational load of rare variants within genes. However, sequence and genotyping errors also play an important role in the performance of any association approach for common or rare variants. Analysis of the pilot data from the 1000 Genomes Project suggests greater concordance with HapMap for common homozygous genotypes (more than 99%) than for heterozygous or rare homozygous genotypes (95–98%). The simulated GAW17 data could also be used to assess the effect of a range of sequencing and genotyping error models on the performance of rare variant mutational load analyses.
The results of our analysis of the simulated GAW17 data suggest that the GRANVIL approach for testing association with the mutational load of rare variants within a gene is relatively robust to missing genotype data, occurring either at random or with differential allele-specific failures. Our power to detect association with causal genes was not dramatically affected by call rate. Similarly, the type I error rate for noncausal genes is relatively unaffected by the rate of missing genotypes but is somewhat inflated by the extent of long-range linkage disequilibrium between noncausal genes.
RM is funded by the European Commission under the Marie Curie Intra-European Fellowship. APM acknowledges funding from the Wellcome Trust (grant WT081682/Z/06/Z).
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty, and challenges. Nat Rev Genet. 2008, 9: 356-369. 10.1038/nrg2344.View ArticlePubMedGoogle Scholar
- Barrett JC, Cardon LR: Evaluating coverage of genome-wide association studies. Nat Genet. 2006, 38: 659-662. 10.1038/ng1801.View ArticlePubMedGoogle Scholar
- Nejentsev S, Walker N, Riches D, Egholm M, Todd JA: Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009, 324: 387-389. 10.1126/science.1167728.PubMed CentralView ArticlePubMedGoogle Scholar
- 1000 Genomes Consortium, Altshuler DL, Durbin RM, Abecasis G, Bentley DR, Chakravarti A, Clark AG, Collins FS, De La Vega FM, Donnelly P, et al: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009, 5: e1000529-10.1371/journal.pgen.1000529.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Leal S: Novel methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Morris AP, Zeggini E: An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010, 34: 188-193. 10.1002/gepi.20450.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- He Y, Calixte R, Nyirabahizi E, Brennan JS, Jiang Y, Zhang H: A new LASSO and K-means based framework for rare variant analysis in genetic association studies. BMC Proc. 2011, 5 (suppl 9): S116-10.1186/1753-6561-5-S9-S116.PubMed CentralView ArticlePubMedGoogle Scholar
- Mägi R, Morris AP: GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics. 2010, 11: 288-10.1186/1471-2105-11-288.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.