Volume 8 Supplement 1
Genetic Analysis Workshop 18
A partition-based approach to identify gene-environment interactions in genome wide association studies
- Ruixue Fan^{1},
- Chien-Hsun Huang^{1},
- Inchi Hu^{2},
- Haitian Wang^{3},
- Tian Zheng^{1} and
- Shaw-Hwa Lo^{1}Email author
DOI: 10.1186/1753-6561-8-S1-S60
© Fan et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Abstract
It is believed that almost all common diseases are the consequence of complex interactions between genetic markers and environmental factors. However, few such interactions have been documented to date. Conventional statistical methods for detecting gene and environmental interactions are often based on the linear regression model, which assumes a linear interaction effect. In this study, we propose a nonparametric partition-based approach that is able to capture complex interaction patterns. We apply this method to the real data set of hypertension provided by Genetic Analysis Workshop 18. Compared with the linear regression model, the proposed approach is able to identify many additional variants with significant gene-environmental interaction effects. We further investigate one single-nucleotide polymorphism identified by our method and show that its gene-environmental interaction effect is, indeed, nonlinear. To adjust for the family dependence of phenotypes, we apply different permutation strategies and investigate their effects on the outcomes.
Background
Genome-wide association studies (GWAS) have successfully discovered many common variants associated with complex diseases, but the single-nucleotide polymorphisms (SNPs) identified so far account for a small proportion of the total heritability in quantitative traits [1]. Increasing evidence shows that gene-environment (G×E) interactions are widely involved in the etiology of complex diseases, including diabetes, cancer, and psychiatric disorders [2, 3]. The investigation of G×E interactions will not only facilitate the identification of novel genes whose marginal effects are undetectable, but also provide insights into disease etiology and hence greatly benefit drug development and personalized therapy.
where G is the genotype of a SNP, E is the environmental factor, $\epsilon $ is a normally distributed random error, and ${\beta}_{3}$ is the coefficient corresponding to the interaction term. If ${\beta}_{3}=0,$ the conditional effect of the SNP is constant across different levels of the environmental factor and we conclude that there is no G×E interaction. This model assumes a linear interaction effect; given G, the outcome y is linearly related with E. However, in practice, it is likely that the interaction schemes are more complicated so that the linear model will probably fail to capture the interaction effect. Therefore, there is a pressing need to develop novel statistical approaches for genome-wide G×E interaction studies. Here we propose a nonparametric partition-based approach to detect G×E interactions and conduct a GWAS for hypertension using the real data set provided by Genetic Analysis Workshop 18 (GAW18). For each SNP, both the linear regression model and the proposed method are used to evaluate its interaction effect with each of the 4 environmental factors: age, gender, smoking status, and medicine. We note that, compared with the linear model, the proposed method is able to identify many additional SNPs. We further study the interaction pattern between SNP rs17206492 and medicine, and find that this interaction effect is, indeed, nonlinear. We also investigate different permutation strategies in the presence or absence of pedigree dependence of the phenotype.
Methods
Data set
The GAW18 data set consists of GWAS data and whole genome sequence data with longitudinal phenotypes for hypertension and related traits from Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples (T2D-GENES) Project 2. There are 939 individuals in total, and we include in our analysis only the 849 individuals with both phenotype data and imputed sequence information. Each individual has measurements for up to 4 time points. At each visit, systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured; covariates including age, use of antihypertensive medication, and current tobacco smoking status were also recorded. Gender and pedigree are known for each subject. Genotypes of odd-numbered chromosomes are provided. In our study, we focused on chromosome 3 as suggested by the workshop organizer for the sake of comparison. Although we had access to the answers for the simulated data set, we used only the real data set in our analysis.
A general framework--a partition-based association measure
where ${n}_{i}$ is the number of subjects in partition i, ${\overline{Y}}_{i}$ is the average of the outcome Y for subjects in partition i, and $\overline{Y}$ and ${s}_{y}^{2}$ are the mean and variance of Y from all subjects. It has been shown that under the null hypothesis $\prod $ does not have influence on Y, I asymptotically converges to a weighted sum of ${{\chi}^{2}}_{1}$ distributions [5]. It has higher power than linear regression or logistic regression models, even in sparse partitions.
G×E association measure I
Partitions created by genotypic and environmental factors
E = 0 | E = 1 | E = 2 | Total | |
---|---|---|---|---|
G = 0 | n _{ 00 } | n _{ 01 } | n _{ 02 } | n _{ 0 } |
G = 1 | n _{ 10 } | n _{ 11 } | n _{ 12 } | n_{ 1 }. |
G = 2 | n _{ 20 } | n _{ 21 } | n _{ 22 } | n_{ 2 }. |
Total | n. _{ 0 } | n. _{ 1 } | n. _{ 2 } | n.. |
The significance of I_{ G×E } is evaluated by the method of permutation.
Permutation strategies
We consider 3 permutation strategies in our analysis: global permutation, local permutation, and residual permutation. Let y_{ ij } denote the phenotype of the j^{ th } individual in the i^{ th } pedigree. Global permutation is to permute phenotypes over all individuals. For local permutation, the phenotypes are permuted within each pedigree. In residual permutation, we first compute the residuals for each individual ${e}_{ij}={y}_{ij}-{\overline{y}}_{i.}$, where ${\overline{y}}_{i.}$ is the average phenotype for pedigree i, then permute e_{ ij } over all subjects to obtain a permuted residual ${e}_{ij}^{*}$ for each individual. The permuted Y values ${y}_{ij}^{*}$ are obtained by ${y}_{ij}^{*}={\overline{y}}_{i.}+{e}_{ij}^{*}$. Both local permutation and residual permutation assume ${y}_{ij}={\overline{y}}_{i.}+{\epsilon}_{ij}$, where $E\left({\epsilon}_{ij}\right)=0$ and $\left\{{\epsilon}_{ij}\right\}$ are independent. Residual permutation further assumes that $\left\{{\epsilon}_{ij}\right\}$ have the same distribution.
Results
Partitions created by environmental factors
Partitions based on the summarized quantities of age, smoking status, or medicine
By age* | By smoking | By medicine |
---|---|---|
16~33.44 →Partition 0 33.45~50.30 →Partition 1 50.31~94.20 →Partition 2 | 0 → Partition 0 1 → Partition 1 2,3,4 → Partition 2 | 0 → Partition 0 1 → Partition 1 2,3,4 → Partition 2 |
SNPs with significant G×E interaction effects
Number of significant SNPs with p value less than 7.9*10^{−}^{7} *
Environmental factor | DBP | SBP | ||||||
---|---|---|---|---|---|---|---|---|
LRM | PBI (GP) | PBI (LP) | PBI (RP) | LRM | PBI (GP) | PBI (LP) | PBI (RP) | |
Age | 0 | 4 | 7 | 3 | 6 | 16 | 33 | 20 |
Smoke | 0 | 6 | 3 | 3 | 0 | 0 | 0 | 0 |
Gender | 0 | 42 | 37 | 36 | 0 | 1 | 1 | 1 |
Medicine | 4 | 80 | 53 | 33 | 1 | 65 | 65 | 57 |
Effect of different permutation strategies
p Values for testing the pedigree dependence of SBP and DBP
ANOVA test | Kruskal-Wallis test | |
---|---|---|
SBP | 0.155 | 0.433 |
DBP | 0.000625 | 0.0004226 |
Discussion
In this paper, we have proposed a partition-based approach PBI to detect G×E interactions, which is nonparametric and model-free. The test statistic is derived from a partition-based measure I, and the interaction information score I_{ G×E } is defined as the difference between the total score I_{ T } and the maximum of the marginal scores. Intuitively, if the genetic and the environmental factors have strong interaction effect, I_{ T } will be far greater than both marginal scores; hence I_{ G×E } will be positive and large. If not, I_{ T } will be no greater than at least 1 of the marginal scores. Therefore, I_{ G×E } evaluates the amount of influence of the G×E interactions on the phenotype.
When applied to the real data set about hypertension provided by GAW18, PBI identified many more markers than the traditional linear regression method. Because our approach is model-free, it is able to capture complicated interaction patterns that are difficult to detect in linear model. The significance of I_{ G×E } is evaluated by permutation. LP and RP adjust effectively for the family dependence of the phenotype. Despite the fact that the proposed procedure selects more SNPs than linear regression, there is very little experimental evidence of G×E interactions for hypertension in the current literature to verify our findings. Therefore, biological studies will be required to investigate our results. Modifications of PBI have successfully identified gene-gene interactions and constructed genetic networks for breast cancer [6] and rheumatoid arthritis [7]. Moreover, PBI can be extended to evaluate the interaction effects between rare variants and environmental factors. Because of the low frequencies of rare variants (<1%), we can apply a gene-based approach by collapsing rare variants in a gene [8–11] and creating partitions based on the collapsed information.
Declarations
Acknowledgements
This research is supported by National Institutes of Health Grant R01 GM070789, GM070789-0551 and by Hong Kong Research Grant Council (642207 and 601312).
The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
Authors’ Affiliations
References
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- Andreasen CH, Mogensen MS, Borch-Johnsen K, Sandbæk A, Lauritzen T, Sorensen TI, Hansen L, Almind K, Jorgensen T, Pedersen O, et al: Non-replication of genome-wide based associations between common variants in INSIG2 and PFKP and obesity in studies of 18,014 Danes. PLoS One. 2008, 3: e2872-10.1371/journal.pone.0002872.PubMed CentralView ArticlePubMedGoogle Scholar
- Hamza TH, Chen H, Hill-Burns EM, Rhodes SL, Montimurro J, Kay DM, Tenesa A, Kusel VI, Sheehan P, Eaaswarkhanth M, et al: Genome-wide gene-environment study identifies glutamate receptor gene GRIN2A as a Parkinson's disease modifier gene via interaction with coffee. PLoS Genetics. 2011, 7: e1002237-10.1371/journal.pgen.1002237.PubMed CentralView ArticlePubMedGoogle Scholar
- Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ: Exploiting gene-environment interaction to detect genetic associations. Hum Hered. 2007, 63: 111-119. 10.1159/000099183.View ArticlePubMedGoogle Scholar
- Chernoff H, Lo SH, Zheng T: Discovering influential variables: a method of partitions. Ann Appl Stat. 2009, 3: 1335-1369.View ArticleGoogle Scholar
- Lo SH, Zheng T: A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. Proc Natl Acad Sci U S A. 2004, 101: 10386-10391. 10.1073/pnas.0403662101.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang CH, Cong L, Xie J, Qiao B, Lo SH, Zheng T: Rheumatoid arthritis-associated gene-gene interaction network for rheumatoid arthritis candidate genes. BMC Proc. 2009, 3 (suppl 7): S75-10.1186/1753-6561-3-s7-s75.PubMed CentralView ArticlePubMedGoogle Scholar
- Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants an overview of collapsing methods. Genet Epidemiol. 2011, 35: S12-S17. 10.1002/gepi.20643.PubMed CentralView ArticlePubMedGoogle Scholar
- Fan R, Huang CH, Lo SH, Zheng T, Ionita-Laza I: Identifying rare disease variants in the Genetic Analysis Workshop 17 simulated data: a comparison of several statistical approaches. BMC Proc. 2011, 5: S17-PubMed CentralView ArticlePubMedGoogle Scholar
- Chen G, Wei P, DeStefano AL: Incorporating biological information into association studies of sequencing data. Genet Epidemiol. 2011, 35: S29-S34. 10.1002/gepi.20646.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89: 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.