Pedigree-based random effect tests to screen gene pathways
- Marcio Almeida†1Email author,
- Juan M Peralta†1, 2,
- Vidya Farook†1,
- Sobha Puppala†1,
- John W KentJr†1,
- Ravindranath Duggirala†1 and
- John Blangero†1
© Almeida et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
The new generation of sequencing platforms opens new horizons in the genetics field. It is possible to exhaustively assay all genetic variants in an individual and search for phenotypic associations. The whole genome sequencing approach, when applied to a large human sample like the San Antonio Family Study, detects a very large number (>25 million) of single nucleotide variants along with other more complex variants. The analytical challenges imposed by this number of variants are formidable, suggesting that methods are needed to reduce the overall number of statistical tests. In this study, we develop a single degree-of-freedom test of variants in a gene pathway employing a random effect model that uses an empirical pathway-specific genetic relationship matrix as the focal covariance kernel. The empirical pathway-specific genetic relationship uses all variants (or a chosen subset) from gene members of a given biological pathway. Using SOLAR's pedigree-based variance components modeling, which also allows for arbitrary fixed effects, such as principal components, to deal with latent population structure, we employ a likelihood ratio test of the pathway-specific genetic relationship matrix model. We examine all gene pathways in KEGG database gene pathways using our method in the first replicate of the Genetic Analysis Workshop 18 simulation of systolic blood pressure. Our random effect approach was able to detect true association signals in causal gene pathways. Those pathways could be easily be further dissected by the independent analysis of all markers.
The whole genome sequencing (WGS) platforms are changing the "game" in several scientific fields by offering a cost-effective way to harvest all genetic information of an individual or populations of interest. Genome-wide association studies (GWAS) approaches have been widely used in the identification of genetic markers associated with phenotypes of interest based on the common variant-common disease assumption . The results of epidemiologic GWAS, while successful in the sheer number of quantitative trait loci that have been localized, have been relatively disappointing in their failure to identify the underlying causal genes for complex diseases like cancer, hypertension, and diabetes [1, 2]. Generally, the significantly associated single-nucleotide polymorphisms (SNPs) are responsible for a very small proportion of these traits' heritability; several sources of this missing heritability have been proposed, with rare variation notable among them. WGS platforms, when applied to a large human pedigree, offer a unique possibility to study rare variant segregation and its role in a phenotype of interest . Rare variants are clear candidates to explain part of the missing heritability paradox, but the low number of copies requires the development of alternative approaches to account for their association . More importantly, rare variants of larger effect are far more likely to rapidly lead to causal gene identification than common variants of very small effect.
A gene pathway is an organized group of genes acting together to perform a specific cellular or physiological function. This high level of organization allows a fast and accurate response to any insult that a cell or tissue would suffer. This information is used for candidate genes annotation and several databases offer pathways information like GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) [4, 5]. The causal component of a gene pathway will involve all phenotypically relevant functional variants of the genes in the pathway. To develop an omnibus test of whether a gene pathway plays a role in influencing a given phenotype, we must aggregate information into a single parameter.
A standard way to examine genetic variation across individuals is to examine genetic relationship matrices (GRMs) that quantify pairwise genetic similarity by the number of shared IBD (identical-by-descendent) or IBS (identical-by-state) alleles. Standard quantitative genetic methods use IBD-derived coefficients to structure pairwise phenotypic covariances to obtain estimates of overall heritability and linkage . Here, we present a new approach that uses a pathway-specific GRM (PSGRM) for each available KEGG pathway to structure a pathway-specific variance component. A given PSGRM is estimated using the information provided by the complete set of sequence variant in or near genes of a specific gene pathway. Using the variance component model implemented in SOLAR, we employ a likelihood ratio test (LRT) of the PSGRM model against a null model only using the expected IBD matrix derived from the pedigree relationships . PSGRM models that deviated from the null model and accounted for a significant portion of the trait's heritability are selected. Genes in those pathways that are shown to potentially harbor functional genetic variants could then be tested independently for causality. We also tested how minor allele frequency (MAF) spectra contribute to the PSGRM estimation. Our approach reduces the number of multiple independent hypotheses tested in a WGS and will reduce the computational burden. The explicit addition of a priori biological knowledge will aid candidate SNP annotation and interpretation of results from a WGS or GWAS study.
Genetic Analysis Workshop 18 data acquisition
Our analyses were performed on the Genetic Analysis Workshop 18 (GAW18) data set using simulated systolic blood pressure as our trait of interest . All analyses were performed using the first simulated phenotype replicate (SIMPHEN.1.csv).
KEGG database acquisition and gene mapping
The latest KEGG database release was obtained (10/2012). The database information was parsed and each gene assigned to its respective pathway. A given gene could be a member of more than 1 pathway. Each gene was mapped using the longest known splicing isoform using the latest RefSeq database (09/2012). For each mapped gene, we defined limits 5 kilobases (kb) upstream of the transcription start site and downstream of the stop codon, and all single-nucleotide variants located inside this region were selected. Using the sequencing data provided for GAW18, we collected all genotypes for those variants in all individuals. All variants within a gene were merged into a single pathway file.
All sequence variation in a gene pathway was used as an input to KING software  for the estimation of the PSGRM. KING uses an efficient algorithm implementation that allows fast kinship estimation for all pairs of individuals. The PSGRMs were calculated using the "robust" algorithm implementation, taking into account the potential for underlying genetic heterogeneity.
Gene pathway burden test
A new variance component parameter was introduced into a standard pedigree-based variance component model , where is the phenotypic covariance matrix, is the total phenotypic variance, , , and respectively represent the proportion of that can be attributed to the residual additive effect of polygenes, the gene pathway-specific variation, and a random environmental effect. Several critical structuring kernels are employed to model the covariances between individuals: is the expected or theoretical kinship matrix integrated over the genome, is the empirically estimated pathway-specific kinship matrix (the PSGRM), and is the identity matrix. Such kernel-based approaches for summarizing the effects of multiple genetic variants have been proposed decades previously  and has grown in popularity recently [10, 11] Maximum likelihood estimates (assuming a multivariate normal probability density) and LRTs of the parameter were obtained using an extension of the polygenic command in SOLAR independently for each gene pathway . The significance of each gene pathways variance component estimate was obtained from a LRT against the null model . Because the variance component is tested on its boundary, the LRT statistic is distributed as a 50:50 mixture of a 1-degree-of-freedom chi-square and a point of mass zero  although this is conservative . For both models, we calculated the heritability of systolic blood pressure using the first simulated phenotypic replicate. We used sex, age, and smoking status as covariates in our model. To control for possible population stratification, we also included as covariates the first 5 genetic principal components calculated from a subset of ~29,000 low, mutual, linkage disequilibrium (LD) SNPs from the GAW18 GWAS data. Those principal components are responsible for 5% of the trait's total variance and were calculated in 117 founder samples using routine prcomp in R  and were then projected to the rest of the genotyped pedigree so as to not incorrectly capture systematic pedigree information.
KEGG database organization and initial processing of GAW18 data
Initially, we downloaded the latest KEGG database release; this database is composed of 223 gene pathways for which we could classify 2428 genes located on the odd-numbered chromosomes. Each gene was mapped using the hg19 human genome assembly as a reference. Using the GAW18 WGS data set, we selected all variants near each gene (see Methods for details) of a gene pathway and all genotypes were collected. To further evaluate the sensitivity of the gene-pathway-specific LRT, we created 3 additional "virtual" pathways: (a) all causal genes used in the simulation, (b) major causal gene MAP4 and 9 randomly chosen genes, and (c) MAP4 plus 49 randomly chosen genes. The gene MAP4 exhibits the largest cumulative effect on the systolic blood pressure (SBP) phenotype and the number of random genes chosen mimics typical sizes of gene pathways.
PSGRM estimation and testing
Likelihood ratio tests of gene pathways.Results of likelihood ratio test using the pathway-specific variance component term . The table only shows pathways with p-values lower than 0.01.
# Causal genes
Analysis of rare variant contribution to PSGRM testing
Tests of calculated using different frequency spectra. Association results of the CAUSAL sets using variants with different MAF spectra. The columns "Complete set", "MAF< 0.05" and "MAF 0.0" represent respectively the use all variants near genes, all variants with MAF< 0.05 and all variants with MAF < 0.01.
WGS allows the complete enumeration of genetic variation. The sensitivity of WGS leads to the detection of a large number of rare variants not present in a standard high-density SNP array. Rare variants represent a partial explanation for the missing heritability paradox as their abundance and their expected genetic effect sizes are much higher than common variants . Large extended pedigrees, like those in the GAW18 cohort, increase the probability of detecting rare variant effects, as variants that are rare in the population may be enriched by segregation in specific lineages . Conventional genome-wide association-style multiple testing of the immense number of rare variants would require the use of stringent significance thresholds and reduce power. These barriers require the development of alternative methods to reduce the dimensionality of analysis. Here we present a hybrid method merging 2 approaches by calculating GRMs in predefined gene sets pathways. The definition of gene sets is flexible and allows the use of other sources of information, or even a candidate genes set defined by previous experiments. The PSGRM approach employed involves the genetic relationship estimation by analyzing local patterns of alleles shared in a gene pathway to reveal latent pathway-specific relationships across pedigrees and unrelated individuals. We focus on IBD estimation but many different IBS-related kernels could be utilized and should work better for studies that only have unrelated individuals focusing on association signal alone. Our IBD-based test is more similar to an extension of classical quantitative trait linkage but using empirical IBD estimates and generalizing them beyond a single genomic location. For the test to have power, there must be differences between the global IBD estimate and the pathway-specific estimate. The proposed single-degree-of-freedom test reduces the computational burden and allows a fast screen of gene pathways. The definition of a 5-kb region near a gene should be enable harvesting of a majority of regulatory elements playing a role in the expression regulation of the target gene; however, future studies should test larger distances also. For any empirically estimated GRM kernel, LD could be a source of bias and new approaches for calculating GRMs accounting for LD interference should be tested in the future .
The analysis of WGS requires the development of new statistical and computational methods. Reduction in the number of hypotheses tested is critically important if reasonably sized samples of families are to identify causal genes. Our simple random-effect-based pathway-specific tests leads to potentially powerful tests of gene pathways that can then be further tested in more detail. The GAW18 simulated phenotypes offer a unique opportunity for methodological comparison and our method exhibited promising statistical features and can be easily extended to any other set of genes defined in an a priori manner.
T2D-GENE is supported by NIH grants: U01 DK085524, U01 DK085501, U01 DK085526, U01 DK085584 and U01 DK085545. The SAFHS is supported by P01 HL045222; the SAFDS is supported by R01 DK047482; the SAFGS is supported by R01 DK053889. SOLAR is supported by NIMH grant MH059490 and the supercomputing facilities used for this work at the AT&T Genetics Computing Center were supported in part by a gift from the SBC Foundation.
The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Mohlke KL, Scott LJ: What will diabetes genomes tell us?. Curr Diab Rep. 2012, 12: 643-650. 10.1007/s11892-012-0321-4.PubMed CentralView ArticlePubMedGoogle Scholar
- Gibson G: Rare and common variants: twenty arguments. Nat Rev Genet. 2012, 13: 135-145. 10.1038/nrg3118.PubMed CentralView ArticlePubMedGoogle Scholar
- Blangero J: Localization and identification of human quantitative trait loci: king harvest has surely come. Curr Opin Genet Dev. 2004, 14: 233-240. 10.1016/j.gde.2004.04.009.View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999, 27: 29-34. 10.1093/nar/27.1.29.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: Human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-PubMed CentralView ArticlePubMedGoogle Scholar
- Manichaikul A: Robust relationship inference in genome-wide association studies. Bioinformatics. 2010, 26: 2867-2873. 10.1093/bioinformatics/btq559.PubMed CentralView ArticlePubMedGoogle Scholar
- Hopper JL, Mathews JD: Extensions to multivariate normal models for pedigree analysis. Ann Hum Genet. 1982, 46: 373-383. 10.1111/j.1469-1809.1982.tb01588.x.View ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen H, Meigs JB, Dupuis J: Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013, 37: 196-204. 10.1002/gepi.21703.PubMed CentralView ArticlePubMedGoogle Scholar
- Self SG, Liang K-Y: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc. 1987, 82: 605-610. 10.1080/01621459.1987.10478472.View ArticleGoogle Scholar
- Blangero J, Diego VP, Dyer TD, Almeida M, Peralta J, Kent JW, Williams JT, Almasy L, Göring HH: A kernel of truth: statistical advances in polygenic variance component models for complex human pedigrees. Adv Genet. 2013, 81: 1-31.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker RA, Chambers JM, Wilks AR: The New S Language. Wadsworth,. 1988Google Scholar
- Pritchard JK, Cox NJ: The allelic architecture of human disease genes: common disease-common variant...or not?. Hum Mol Genet. 2002, 11: 2417-2423. 10.1093/hmg/11.20.2417.View ArticlePubMedGoogle Scholar
- Blangero J, Kent JW: Characterizing the extent of human genetic variation for performance-related traits. Genetic and Molecular Aspects of Sport Performance. Edited by: Bouchard C, Hoffman EP. Wiley-Blackwell 111 River Street, Hoboken, NJ 07030-5774, USA 2011, 33-45.Google Scholar
- Speed D, Hemani G, Johnson MR, Balding DJ: Improved heritability estimation from genome-wide SNPs. Am J Hum Genet. 2012, 91: 1011-1021. 10.1016/j.ajhg.2012.10.010.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.