- Open Access
Genome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes
© The Author(s). 2016
- Published: 18 October 2016
With the advance of next-generation sequencing technologies, the study of rare variants in targeted genome regions or even the whole genome becomes feasible. Nevertheless, the massive amount of sequencing data brings great computational and statistical challenges for association analyses. Aside from sequencing variants, other high-throughput omic data (eg, gene expression data) also become available, and can be incorporated into association analysis for better modeling and power improvement. This motivates the need of developing computationally efficient and powerful approaches to model the joint associations of multilevel omic data with complex human diseases.
A similarity-based weighted U approach is used to model the joint effect of sequencing variants and gene expression. Using a Mexican American sample provided by Genetic Analysis Workshop 19 (GAW19), we performed a whole-genome joint association analysis of sequencing variants and gene expression with systolic (SBP) and diastolic blood pressure (DBP) and hypertension (HTN) phenotypes.
The whole-genome joint association analysis was completed in 80 min on a high-performance personal computer with an i7 4700 CPU and 8 GB memory. Although no gene reached statistical significance after adjusting for multiple testing, some top-ranked genes attained a high significance level and may have biological plausibility to hypertension-related phenotypes.
The weighted U approach is computationally efficient for high-dimensional data analysis, and is capable of integrating multiple levels of omic data into association analysis. Through a real data application, we demonstrate the potential benefit of using the new approach for joint association analysis of sequencing variants and gene expression.
- Minor Allele Frequency
- Rare Variant
- Omic Data
- Genetic Analysis Workshop
- Gene Expression Similarity
Next-generation sequencing technology provides denser genetic profiles than previous microarray-based genotyping technology . It could effectively capture rare variants with low minor allele frequency (MAF). Driven by the advance of sequencing technology and limited heritability explained by the genome-wide association studies (GWAS) findings [2, 3], current research focus has shifted toward studying rare variants associated with common complex diseases. Although these studies hold great promise for finding new genetic variants predisposing to human disease, they also face great challenges, for example, low power for detecting rare variants because of their low frequency. The dramatic increase in numbers of single nucleotide variants (SNVs) also raises computational and statistical challenges (eg, multiple testing issue). One practical strategy is to group multiple SNVs according to known functional information (eg, variants in a gene or a pathway) or location (eg, variants in a fix-sized bin ), and jointly analyze these SNVs [5, 6]. By grouping and testing multiple SNVs, we are able to aggregate association signals and reduce the number of tests.
Besides SNVs, other omic data, such as gene expression, could also be collected. These intermediate omic data can be integrated into sequencing studies for improved power and better biological interpretation. While the conventional analysis only links SNVs or gene expression to disease phenotypes, the emergence of multilevel data brings the possibility of jointly analyzing SNVs and other omic data. By fully utilizing the information, the joint analysis has great potential to improve power . Nevertheless, how to efficiently analyze the high-dimensional sequencing data and other omic data remains a challenge.
In this empirical study, we used a similarity based weighted U approach to jointly model SNVs and gene expression data of 142 unrelated Mexican American samples provided by Genetic Analysis Workshop 19 (GAW19). By using the weighted U approach, we performed a genome-wide joint association analysis, evaluating the association of 17,558 genes with three phenotypes (ie,, systolic blood pressure [SBP], diastolic blood pressure [DBP], and hypertension [HTN]).
Under the null hypothesis of no association, phenotypic similarity is unrelated to genetic or gene expression similarities. Because phenotypic similarity is symmetric, that is, E (h(y i ,y j )) = 0, the expectation of U statistic is 0. Under the alternative, phenotypic similarity increases with the increase of genetic or gene expression similarities. Therefore, the positive phenotypic similarities are heavier weighted and the negative phenotypic similarities are lighter weighted, leading to a positive value of U. Because the U kernel satisfies the finite second moment condition, E(h 2 (y i ,y j )) < ∞, and is degenerate (ie, Var(E(h(y i ,y j ))) = 0), the limiting distribution of U can be approximated as a linear combination of chi-squared random variables with one degree of freedom , and its p value can be obtained by using the Davis method .
The weighted U approach is also flexible for testing other hypothesis. In addition to evaluating the joint effect of genetic markers and gene expression (G + T), it could be used for testing genetic effect (G) alone or gene expression (T) effect alone. For example, we can modify the approach by setting the gene expression similarity as constant (eg, g(t i ,t j ) ≡ 1) to test genetic effect.
We applied three tests (ie, G + T, G, and T) to 142 unrelated Mexican American samples from the San Antonio Family Heart Study (SAFHS) and the San Antonio Family Diabetes/Gallbladder Study (SAFDGS). All analyses were based on SNVs on the odd-numbered autosomes and gene expression data provided by GAW19. In this study, we assembled multiple SNVs based on the functional unit (ie, gene) to facilitate the joint modeling of gene and gene expression. We obtained primary and alternative assembles from Genome Reference Consortium release version 38 (GRCh38) and identified 32,436 gene regions in correspondence to 17,264 RNA probes. The number of gene regions exceeds the probes because multiple assembles of one gene can share one nucleotide sequence, as well as the RNA probes designed to capture such sequence. SNVs that are not within or near a gene (±5 kb at both ends) were removed. Gene regions with no SNVs or RNA probes were also discarded. SNVs with no variation (ie, MAF = 0) were dropped, as were gene regions containing only such SNVs. A total of 6,956,910 SNVs, corresponding to 17,558 gene regions, remained for the joint analysis. The first, second, and third quartiles of the SNV counts in these regions are 115, 205, and 411, respectively. We used SBP, DBP, and HTN measurements at the first examination year as phenotypes, and age, gender, medication use, and smoking status as covariates. To account for population stratification, we performed principal components (PCs) analysis by using the EIGENSTRAT software . The first 20 PCs were used in the analysis to adjust for potential confounding bias because of population stratification.
Summary of top 5 genes associated with SBP, DBP, and HTN
Further investigation of the top genes also found biological plausibility of several genes related to blood pressure. For instance, the product of PED4A hydrolyzes the second messenger cyclic adenosine monophosphate (cAMP), which plays a crucial role in controlling blood pressure . PHOX2A is also important for the development of autonomic nervous system, which controls the involuntary functions, such as heart rate and blood pressure .
The study has certain limitations. Out of 8,348,674 SNVs, 1,391,764 (17 %) were unused because they are not in or near any gene. We could group these SNVs by physical location and also incorporate them into the analysis . We found limited association evidence of single-nucleotide polymorphisms (SNPs) identified from previous GWAS, possibly because of differences in study samples (ie, whites vs. Mexican Americans). Another possibility is that majority of SNVs in our study are rare (MAF <0.01), whereas previous GWAS mainly focus on common variants (MAF >0.05).
The analysis of a large number of genes raised the issue of multiple testing. In our analysis, the false discovery rate approach was used to account for the issue of multiple testing. After adjusting for multiple testing, none of the genes could reach statistical significance. By using the biology knowledge and statistical tools, we might be able to further reduce the number of tests and increase our chance to detect an association. For instance, all assembles of one gene have high correlation, and we can either exclusively use the primary assemble or adjust p values for multiple correlated tests to better solve the multiple testing issue.
The emerging sequencing data and other omic data provide invaluable source for genetic study of human diseases, yet integrating and modeling these high-dimensional data remain a great challenge. By integrating both sequencing variants and gene expression into the association analysis, the weighted U approach provides a powerful and computationally efficient way for screening disease-associated genes. By applying the approach to the GAW19 data, we showed that the joint analysis of sequencing variants and gene expression could have some advantages over association analysis only using sequence variants or gene expression.
The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. The authors wish to thank the editor and two anonymous referees for their helpful comments that improved the manuscript.
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
XT conducted the data analysis and drafted the manuscript. CW helped write the R code for the joint analysis. QL conceived of the study and helped finalize the manuscript. All authors read and approved the final manuscript.
The authors declare they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Metzker ML. Sequencing technologies—the next generation. Nat Rev Genet. 2010;11(1):31–46.View ArticlePubMedGoogle Scholar
- Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69(1):124–37.View ArticlePubMedPubMed CentralGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Agne M, Huang CH, Hu I, Wang H, Zheng T, Lo SH. Identifying influential regions in extremely rare variants using a fixed-bin approach. BMC Proc. 2011;5 Suppl 9:S3.View ArticlePubMedGoogle Scholar
- Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zollner S. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet. 2010;87(5):604–17.View ArticlePubMedPubMed CentralGoogle Scholar
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philos Trans R Soc Lond B Biol Sci. 2013;368(1620):20120362.View ArticlePubMedPubMed CentralGoogle Scholar
- Wei CS, Li M, He ZH, Vsevolozhskaya O, Schaid DJ, Lu Q. A weighted U-statistic for genetic association analyses of sequencing data. Genet Epidemiol. 2014;38(8):699–708.View ArticlePubMedPubMed CentralGoogle Scholar
- Davis CS. The distribution of a linear combination of chi-square variables. Biometrics. 1982;38(1):279.Google Scholar
- Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190.View ArticlePubMedPubMed CentralGoogle Scholar
- Omori K, Kotera J. Overview of PDEs and their regulation. Circ Res. 2007;100(3):309–27.View ArticlePubMedGoogle Scholar
- Wen G, Wessel J, Zhou W, Ehret GB, Rao F, Stridsberg M, Mahata SK, Gent PM, Das M, Cooper RS, et al. An ancestral variant of Secretogranin II confers regulation by PHOX2 transcription factors and association with hypertension. Hum Mol Genet. 2007;16(14):1752–64.View ArticlePubMedPubMed CentralGoogle Scholar