- Open Access
A variance component-based gene burden test
BMC Proceedings volume 8, Article number: S49 (2014)
We propose a novel variance component approach for the analysis of next-generation sequencing data. Our method is based on the detection of the proportion of the trait phenotypic variance that can be explained by the introduction of a new variance component that accounts for the local gene-specific departure of the empirical kinship relationship matrix, estimated from single-nucleotide polymorphism (SNP) genotypes, from their theoretical expectation based on the genealogical information in the pedigree. We tested our method with simulated phenotypes and imputed SNP genotypes from the Genetic Analysis Workshop 18 data set. We observed considerable variation in the differences between theoretical and gene-specific kinship estimates that proved to be informative for our test and allowed us to detect the MAP4 causal gene at a genome-wide significance level. The distribution of our test statistic show no inflation under the null hypothesis and results from a random set of genes suggest that the detection of MAP4 is both sensitive and specific. The use of 2 different strategies for the selection of the SNPs used to derive the gene-specific empirical kinship relationship matrices provides us with suggestive evidence that our method is performing as an empirical test of linkage.
Complex phenotypes are thought to be determined by the aggregate effects of many rare causal variations [1–3]. Detection of the true causal variations present in next-generation sequencing data sets [4, 5] is challenging because their faint signals are difficult to separate from background noise. Most of the current analytical methods try to improve the signal-to-noise ratio by reducing the number of statistical tests needed for a significant signal to be detected.
A common approach to alleviate the multiple-testing problem is to collapse, commonly by membership of a variant in a known annotated gene or pathway, the information conveyed by individual variants into a single measure, like a principal component or a weighted rank, that can then be tested . However, a common limitation of many approaches is that the aggregation of the variants into a single measure often involves an arbitrary definition of the directionality of each variant's fixed effects.
We present a novel random-effect-variance component-based approach that uses gene-specific relationship matrices to collapse variants into a per-gene genetic contribution effect.
The Genetic Analysis Workshop 18 (GAW18) data , based on whole genome sequencing data for the odd-numbered chromosomes of 464 individuals released by the T2D-GENES Consortium, was used to test our method. Specifically, we used pedigrees, minor allele-based single-nucleotide polymorphism (SNP) dosages, and the SIMPHEN.1 simulated phenotypes in the GAW18 data set.
Definition of the gene loci
The transcription start site and the stop codon coordinates for the longest transcript associated with a gene were obtained from the UCSC's human genome release 19 (hg19) known gene table.
Gene-specific SNP dosages
To investigate if the procedure used to select the SNPs that were collected on a per-gene locus basis affected our test results, we used 2 different SNP selection approaches: the intragenic and the nonsyn strategies. The intragenic strategy consisted of the selection of all SNPs within the bounds of a gene. The nonsyn strategy consisted of the selection of the subset of intragenic SNPs that were annotated as being nonsynonymous coding changes using ANNOVAR . GAW18 SNP dosages from the imputed genotypes where then collected into separate, gene-specific, dosage files for SNPs selected using the intragenic and nonsyn strategies.
Gene-specific empirical kinship matrices
Gene-specific dosages were transformed into genotypes and processed with KING , a method for relationship inference from large SNP genotype data sets that is robust to population substructure, to produce a gene-specific matrix of empirical kinship coefficients.
Control for unknown population substructure
To control for possible population stratification, principal component loadings were calculated using the prcomp function in R , with data from 117 unrelated individuals for approximately 29,000 haplotype tagging SNPs in low mutual linkage disequilibrium, and then projected onto the full set of genotyped individuals. The first 5 principal components explained 5% of the total phenotypic variance and were added as covariates to our variance component model.
Trait and covariates
We used the simulated phenotypic data at the first exam for the systolic blood pressure (SBP_1) trait. The sex (SEX), age (AGE_1), and smoke (SMOKE_1) status at the first exam phenotypes were introduced as covariates into our variance component model. The Q1 trait was used to assess the distribution of our test statistic under the null hypothesis.
Variance component model
Our method uses gene-specific relationship matrices (GSRMs) to extract the proportion of the trait's variance explained by a single gene as a result of the departure of its localized empirical kinship estimates (EKEs) from their pedigree-derived theoretical kinship expectations (TKEs). A new variance component parameter () was introduced into a standard variance component model
where is the covariance matrix, is the total phenotypic variance; , , and , respectively, represent the proportion of that can be attributed to the residual additive effect of polygenes, a gene-specific effect; and a random environmental effect, , is the TKE kinship matrix, is the EKE kinship matrix, and is the identity matrix. This partitioning of the trait variance was estimated using an extension of the polygenic command from SOLAR  independently for each gene. The significance of each estimate was obtained from a likelihood ratio test against the null model
Because the variance component is tested on its boundary, the likelihood ratio test statistic is distributed as a ½:½ mixture of a 1 degree of freedom (DF) chi-square and a point mass at zero .
We compared the observed gene-specific EKE values obtained from the imputed SNP dosages with the TKE values derived from the pedigree and found substantial differences between them (Figure 1). The negative skew in Figure 1 shows that gene-specific EKE values are larger than their TKE counterparts and it shows that for certain genes individuals appear to be more closely related than expected from their relatedness in the pedigree.
We then performed variance component analyses using GSRMs with intragenic and nonsyn EKE values for 12 of the causal SBP_1 genes in the simulated data set (Table 1) and a random gene sample (Table 2). We detected a clear and significant signal from the MAP4 causal gene using both the intragenic and nonsyn strategies, that reached genome-wide significance (after a conservative Bonferroni correction for 30,000 tests, p <1.6 × 10−6) in the nonsyn (Table 1). The magnitude of the MAP4 signal is strong enough for it to be specifically detected as the top result in a random sample of 100 genes (Table 2). Other causal genes also rank among the top results, but their signals are weaker (Table 2). Figure 2 suggests that our approach has the sensitivity to separate true-positive signals from false-positive ones, as there is no inflation or deflation of the p values that we obtained for the estimates of the gene effects evaluated under the null hypothesis.
We performed variance component analyses using a novel approach to estimate the proportion of the trait phenotypic variance that can be attributed to a single gene. We first collapsed the genotypes from SNP variants into a GSRM that more closely approximates the correlations between related individuals at a gene-specific level. Figure 1 shows that there is substantial variation among genes in terms of the differences between TKE and gene-specific EKE values that had the potential to explain part of the trait variance. Thus, we then obtained gene-specific estimates of the parameter and its significance from SOLAR, using the empirical GSRM.
Our results showed that the gene with the highest effect on the simulated SBP_1 trait was detected at a significance level that surpasses a conservative multiple testing threshold for the p values. Figure 2 shows that our test statistic was not inflated when evaluated under the null hypothesis using the Q1 trait and a random sample of genes. MAP4 was also consistently detected using the intragenic and nonsyn strategies (see Figures 1 and 2), with other causal genes ranking within our first top 10 results. This seems to suggest that our test is sensitive and specific enough for the detection of true-positive signals without enrichment of false-positive ones.
As a consequence of using a different strategy to select the SNPs for the estimation of the empirical GSRM, our results for MAP4 improved. MAP4 results were an order of magnitude less significant for the intragenic than for the nonsyn strategy. We believe that this is the result of rare functional alleles driving the EKE of the GSRM matrices for the nonsyn strategy without the noise introduced by shared noncoding alleles. In effect, the nonsyn GSRM matrices better approximate the gene's probability of identity-by-descent sharing, thus making our test a gene-specific empirical test of linkage that is also robust to the heterogeneity of the causal variants.
Finally, we want to note that our method is not restricted either to a particular measure of genetic identity or to its estimation on a gene-specific basis; identity-by-state and genomic regions, even if they are nonsyntenic , can potentially be used instead.
We were able to obtain encouraging, proof-of-concept results from the application of our method to GAW18 data. We observed differences between the TKEs and their gene-specific empirical estimations. We obtained genome-wide significant results on the SBP_1 simulated trait for MAP4 that seem to indicate that our test is both specific and sensitive enough, and which also suggest that our method is behaving as a gene-specific empirical test of linkage.
Pritchard JK: Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet. 2001, 69: 124-137. 10.1086/321272.
Pritchard JK: The allelic architecture of human disease genes: common disease-common variant...or not?. Hum Mol Genet. 2002, 11: 2417-2423. 10.1093/hmg/11.20.2417.
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461: 272-276. 10.1038/nature08250.
Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, Axelrod N, Busam DA, Strausberg RL, Venter JC: Genetic variation in an individual human exome. PLoS Genet. 2008, 4: e1000160-10.1371/journal.pgen.1000160.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.
Dering C, Hemmelmann C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, 35 (Suppl 1): S12-S17.
Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-
Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38: e164-10.1093/nar/gkq603.
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM: Robust relationship inference in genome-wide association studies. Bioinformatics. 2010, 2867-2873. 26
R: A language and environment for statistical computing. R Foundation for Statistical Computing (Vienna, Austria). 2012, R Core Team, ISBN 3-900051-07-0, [http://www.R-project.org/]
Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.
Self SG, Liang K-Y: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc. 1987, 82: 605-610. 10.1080/01621459.1987.10478472.
Almeida MMM, Peralta JM, Farook V, Puppala S, Kent JW, Duggirala R, Blangero J: Pedigree-based random effect burden tests to screen gene pathways. BMC Proc. 2014, 8 (suppl 2): S100-
T2D-GENE is supported by NIH grants U01 DK085524, U01 DK085501, U01 DK085526, U01 DK085584, and U01 DK085545. The San Antonio Family Heart Study is supported by P01 HL045222; the San Antonio Family Diabetes Study is supported by R01 DK047482; the San Antonio Family Gallbladder Study is supported by R01 DK053889. SOLAR is supported by National Institute of Mental Health grant MH059490. The supercomputing facilities used for this work at the AT&T Genetics Computing Center were supported in part by a gift from the SBC Foundation.
The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
The authors declare that they have no competing interests.
JB designed the overall study; JMP, MA, and JK conducted statistical analyses. JMP drafted the manuscript. All authors read and approved the final manuscript.