- Proceedings
- Open Access
Methods to evaluate rare variants gene-age interaction for triglycerides
- Tony Huayang Gao^{1},
- Jianjun Zhang^{2},
- Diaz Medina Miguelangel^{3} and
- Xuexia Wang^{2}Email author
- Published: 17 September 2018
Abstract
Triglycerides are an important measure of heart health. Although more than 90 genes have been found to be associated to lipids, they only explain 12 to 15% of the variance in lipid levels. Evidence suggests that age may interact with the genetic effect on lipid levels. Existing methods to detect the main effect of rare variants cannot be readily applied for testing the gene environment interaction effect of rare variants, as those methods either have unstable results or inflated Type I error rates when the main effect exists. To overcome these difficulties, we developed two statistical methods: testing of optimally weighted combination of single-nucleotide polymorphism (SNP) environment interaction (TOW-SE) and a variable weight TOW-SE (VW-TOW-SE) to test the gene environment interaction effect of rare variants by grouping SNPs into biologically meaningful SNP-sets (SNPs in a gene or pathway) to improve power and interpretability. The proposed methods can be applied to either continuous or binary environmental variables, and to either continuous or binary outcomes. Simulation studies show that Type I error rates of the proposed methods are under control. Comparing the two methods with the existing interaction sequence kernel association test (iSKAT), the VW-TOW-SE is the most powerful test and the TOW-SE is the second most powerful test when gene environment interaction effect exists for both rare and common variants. The three tests were applied to the GAW20 simulated data, among the five regions in which the main effect of common SNPs was simulated and the gene–age interaction effect was not included. As expected, none of the tests indicated positive results.
Background
Highly heritable triglycerides (TG) [1] are an important measure of heart health. Having excess levels of TG can increase the risk of heart disease. Identified common variants only explain 12% to approximately 15% of the variance in lipid levels [2]. A substantial proportion of lipid heritability is unexplained [3]. This suggests that rare (minor allele frequency [MAF] < 1%) or intermediate variants (0.01 < MAF < 0.05) with potentially larger effect sizes or other mechanisms, such as gene–environment interactions, may play a role in explaining the substantially missing heritability.
Clear evidence shows that lipids vary by age. A handful of lipid loci with age-dependent effects were identified from candidate gene studies and genome-wide association study (GWAS) [4, 5]. However, few of these explored the role of gene–age interaction for rare and intermediate variants in lipid levels. More than 74.6% of variants are rare and intermediate variants [6], which may have a larger effect size than common variants and explain substantial proportions of lipid variance. In this GAW20 study, we attempted to detect the effect of gene–age interactions on TG for rare and intermediate variants with novel statistical methods.
Due to the allelic heterogeneity and the extreme rarity of individual variants [7], most existing methods focus on improving the power of detecting gene–environment (G × E) interactions only for individual markers, especially for common variants, and are not optimal for detecting rare variants. Although there has been interest in multiple-marker analysis by grouping single-nucleotide polymorphisms (SNPs) into biologically meaningful SNP-sets (eg, SNPs in a gene or pathway) to improve power and interpretability, the existing SNP set analysis has focused on testing for the marginal effect of a SNP set [8, 9]. Limited work has been done on testing the interactions between a SNP set and an environmental variable, especially as it pertains to rare variants. Although the SNP-set–based interaction sequence kernel association test (iSKAT) [10] can be applied to detect G × E interactions in rare variants, its power is very restricted and lacks robustness to the shape of the data in many circumstances. Motivated by the need for powerful methods to test G × E interactions for rare variants, we developed two novel methods: testing of optimally weighted combination of SNP environment interaction (TOW-SE) and a variable weight TOW-SE (VW-TOW-SE) to identify G × E interactions for SNP sets of common and/or rare variants in GWAS, exome, or next-generation sequencing data. Our simulation studies show that the Type I error rates of the proposed methods are under control. Comparing the two methods with the iSKAT, and the VW-TOW-SE is the most powerful test, TOW-SE is the second most powerful test when G × E interaction effect exists for both rare and common variants.
The Genetic Analysis Workshop (GAW) 20 “SimulationBestOneRepresentative” data includes TG levels before and after treatment with fenofibrate and genotyped genome-wide SNPs from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study [11]. We imputed chromosomes 1, 6, 8, 9, 10, and 17, where five major main-effect causal SNPs of TG reside. We applied the proposed methods (TOW-SE and VW-TOW-SE) and iSKAT to the unrelated individuals (sample size n = 246) to test TG susceptible to gene–age interactions on the five imputed genes. The main effect of common SNP was simulated in the five regions. However, the gene–age interaction effect was not included. As expected, using the proposed methods, none of the regions indicated significantly gene–age interaction effects on TG.
Methods
Consider a sample of n individuals. Each individual has been genotyped at M variants in a genomic region (a gene or a pathway). For the i^{th} individual, denote y_{i} as the trait value (continuous or binary); E_{i} as the environmental variable (continuous or binary); G_{i} = (g_{i1}, ⋯, g_{iM}) as the genotypic scores at M variants, where g_{im} ∈ {0, 1, 2} is the number of minor alleles the i^{th} individual has at the m^{th} variant. Z_{i} denotes the potential confounder covariates.
We use the generalized linear model (GLM):
to model the relationship between trait values and G × E interactions E_{i}G_{i}, where f (∙) is a monotone “link” function. For a quantitative trait, f (∙) will be an identity link function. For a binary trait, a logit link function will be used. Coefficients of each term in Eq. (1) are denoted by α_{0}, a, β, Ϛ, and η, respectively. To test for the G × E interaction for a SNP set of M SNPs is equivalent to testing the null hypothesis H_{0} : β = 0 in Eq. (1).
To test H_{o} : β = 0 in equation (1) is equivalent to test H_{o} : β^{∗} = 0 in equation (2). Sha et al. [12] proposed a score test to test H_{o} : β^{∗} = 0 in GLM. However, for rare variants’ SNP–environment interactions, the score test may lose power as a consequence of the sparse data and a large degree of freedom. To increase power by effectively using information from data, we proposed to test the G × E interactions by testing the effect of a weighted combination of SNP–environment interactions, \( {\overset{\sim }{x}}_i={\sum}_{m=1}^M{w}_m{\overset{\sim }{x}}_{im}. \)
It reaches its maximum \( {S}_o\left({w}_1^0,\cdots, {w}_M^0\right)=n\sum \limits_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_i^0-{\overline{\overset{\sim }{x}}}^0\right)/{\sum}_{i=1}^n{\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)}^2 \) when \( {w}_m^0=\frac{\sum \limits_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_{im}-{\overline{\overset{\sim }{x}}}_m\right)}{\sum \limits_{i=1}^n{\left({\overset{\sim }{x}}_{im}-{\overline{\overset{\sim }{x}}}_m\right)}^2} \); \( {\overset{\sim }{x}}_i^0={\sum}_{m=1}^M{w}_m^0{\overset{\sim }{x}}_{im}, \) as rare variants are essentially independent. Thus, \( {w}_m^0 \) is the optimal weight. We define \( {T}_{T- SE}=\sum \limits_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_i^0-{\overline{\overset{\sim }{x}}}^0\right) \) as the statistic to Test the effect of the Optimally Weighted combination of SNP-Environment interactions (TOW-SE), which is equivalent to \( {S}_o\left({w}_1^0,\cdots, {w}_M^0\right) \) when we use a permutation test to evaluate p-values.
We analytically derive optimal weights for TOW-SE. The optimal weight \( {w}_m^0 \) will put a big weight to SNP–environment interactions that have strong association with the trait of interest and also adjust the direction of the association. Moreover, it will put big weights to SNP–environment interactions with small variations that are often rare variants. iSKAT assigns weights to variants based on the MAFs via a beta function. It put decent nonzero weights for variants with MAF in (0.01, 0.05). If MAFs of causal variants are not in the range of (0.01, 0.05), iSKAT will be less powerful than TOW-SE. TOW-SE targets rare variants and may lose power when testing G × E effects of both rare and common variants.
To test for the G × E interaction of both rare and common variants, we propose variable weight TOW-SE (VW-TOW-SE). We divide variants into rare and common. Let T_{r} and T_{c} denote the test statistic of TOW-SE for rare and common variants, respectively. Let \( {T}_{\lambda }=\lambda \frac{T_r}{\sqrt{\mathit{\operatorname{var}}\left({T}_r\right)}}+\left(1-\lambda \right)\frac{T_c}{\sqrt{\mathit{\operatorname{var}}\left({T}_c\right)}}. \) Denote p_{λ} as the p-value of T_{λ}. The test statistic of VW-TOW-SE is defined as T_{VW − T − SE} = min_{0 ≤ λ ≤ 1}p_{λ}. We will use permutations to evaluate p-values of both T_{T − SE} and T_{VW − T − SE}.
Simulations
We consider two scenarios: (a) with main effect and (b) without main effect in the model (4). When there are no main effects, we set the magnitudes of vector α_{2} = 0.3 for each element and their signs are randomly sampling from (−1, 1). When there are no main effects, we set α_{2} = 0. To evaluate the Type I error, we set β and β^{c} all to 0. To evaluate power, we vary the number of non-zero elements β_{j} in β. We set the magnitude of the nonzero β_{j} as |β_{j}| = c, and increase c from 0.02 to 0.1. Of the β_{j}, 50% are positive. β^{c} is positive and twice the magnitude of β_{j}. The sample size is 2000 for each scenario. P values are estimated by 10,000 permutations. The Type I error rates and power are evaluated using 1000 replicated samples.
GAW20 data analysis
We applied TOW-SE, VW-TOW-SE, and iSKAT to the GAW20 “SimulationBestOneRepresentative” data, which includes TG levels before and after treatment with fenofibrate and genotyped genome-wide SNPs from the GOLDN project [11]. We imputed chromosomes 1, 6, 8, 9, 10, and 17 with minimac2 software [16]. Five major main effect causal SNPs (rs9661059, rs736004 [LYRM4], rs1012116, rs10828412, and rs4399565 [HS3ST3A1]) of TG reside on chromosomes 1, 6, 8, 9, 10, and 17, respectively. The 1000 Genomes project haplotypes integrated phase I served as the reference panel. It includes 1092 individuals. Our analysis was based on 246 unrelated individuals. Both pre-treatment and post-treatment TG values were provided for two visits. We used the log ratio of the pre-treatment mean and the post-treatment mean as the phenotype trait in our G × E interaction analysis. For individuals who did not have two visits, we just used the existing value. The median age is 64 years (range: 28–83 years). We used the centered age in our analysis. The median of the TG ratio is 1.54 (range: 0.72–4.24). We excluded 8 individuals from our analysis because of completely missing post-treatment TG values.
We evaluated the performance of TOW-SE, VW-TOW-SE, and iSKAT by testing G × E interaction effect on TG for the aforementioned 5 regions. Each region consists of 10 SNPs, and the fifth SNP in each region is the major main effect causal SNP of TG. Additionally, we assessed the effect of region size on the power of the tests using the region of rs4399565.
Results
Type I error rates for both rare and common variants in the presence of main effects (top panel) and in the absence of main effects (bottom panel) for n = 2000
α-level | TOW-SE | iSKAT | VW-TOW-SE | |
---|---|---|---|---|
With main effect | ||||
n = 2000 | 0.050 | 0.050 | 0.055 | 0.059 |
0.010 | 0.011 | 0.012 | 0.015 | |
0.001 | 0.000 | 0.001 | 0.000 | |
Without main effect | ||||
n = 2000 | 0.05 | 0.051 | 0.061 | 0.056 |
0.01 | 0.011 | 0.013 | 0.009 | |
0.001 | 0.000 | 0.003 | 0.001 |
Results of testing G × E interaction in the five causal regions using the three methods
Region name (SNP set) | Median of MAF (range) | P _{ TOW-SE} | P _{ iSKAT} | P _{ VW-TOW-SE} |
---|---|---|---|---|
rs736004 | 0.013 (0.001–0.387) | 0.020 | 0.022 | 0.019 |
rs1012116 | 0.015 (0.002–0.167) | 0.024 | 0.030 | 0.021 |
rs4399565 | 0.006 (0.001–0.449) | 0.022 | 0.023 | 0.019 |
rs9551059 | 0.007 (0.001–0.275) | 0.031 | 0.030 | 0.037 |
rs10828412 | 0.003 (0.001–0.394) | 0.100 | 0.099 | 0.095 |
Region size effect analysis for the three methods based on the region of rs4399565
Region size | Median of MAF (range) | P _{ TOW-SE} | P _{ iSKAT} | P _{ VW-TOW-SE} |
---|---|---|---|---|
10 SNPs | 0.006 (0.001–0.449) | 0.022 | 0.023 | 0.019 |
20 SNPs | 0.008 (0.001–0.449) | 0.014 | 0.018 | 0.015 |
30 SNPs | 0.008 (0.001–0.449) | 0.013 | 0.014 | 0.012 |
Discussion
The computation time for TOW-SE and VW-TOW-SE using 10,000 permutations for analyzing 1000 individuals in a region that includes 50 SNPs is 9 s and 20 s, respectively. Suppose a whole genome sequencing data with 13,498,188 SNPs, TOW-SE will take 674 h (28 days) to conduct a whole genome analysis. The effect analysis of the region size suggests that the three methods will perform better when the region size is larger. However, the larger the region size, the higher chance for collinearity to appear in the region, which makes the computation more complex. To minimize the problem of collinearity, we recommend a region size between 10 SNPs and 30 SNPs when we apply the proposed methods to a genome-wide scan.
Conclusions
In summary, we developed two novel statistical methods: TOW-SE and VW-TOW-SE by grouping SNPs into biologically meaningful SNP-sets, which improved power and interpretability. Simulation studies show that the proposed methods yielded well-controlled Type I error rates under all study conditions. When gene environment interaction effect exists for both rare and common variants, VW-TOW-SE is the most powerful test, TOW-SE is the second most powerful test, and iSKAT is the least powerful test.
Declarations
Funding
Publication of this article was supported by NIH R01 GM031575.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
Authors’ contributions
XW designed the overall study. JZ, MDM, and TG conducted statistical analyses. TG, JZ, MDM, and XW drafted the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Coram MA, Duan Q, Homann TJ, Thornton T, Knowles JW, Johnson NA, Robinson JG. Genome-wide characterization of shared and distinct genetic components that influence blood lipid levels in ethnically diverse human populations. Am J Hum Genet. 2013;92(6):904–16.View ArticleGoogle Scholar
- Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, Ganna A, Chen J, Buchkovich ML, Mora S, et al. Discovery and refinement of loci associated with lipid levels. Nat Genet. 2013;45(11):1274–83.View ArticleGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindor LA, Hunter DJ, Cho JH. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.View ArticleGoogle Scholar
- Snieder H, Van Doornen LJ, Boomsma DI. The age dependency of gene expression for plasma lipids, lipoproteins, and apolipoproteins. Am J Hum Genet. 1997;60(3):638.PubMedPubMed CentralGoogle Scholar
- Carvalho-Wells AL, Jackson KG, Gill R, Olano-Martin E, Lovegrove JA, Williams CM, Minihane AM. Interactions between age and apoE genotype on fasting and postprandial triglycerides levels. Atherosclerosis. 2010;212(2):481–7.View ArticleGoogle Scholar
- Morrison AC, Voorman A, Johnson AD, Liu X, Yu J, Li A, Muzny D, Yu F, Rice K, Zhu C, et al. Whole-genome sequence-based analysis of a model complex trait, high density lipoprotein cholesterol. Nat Genet. 2013;45(8):899–901.View ArticleGoogle Scholar
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.View ArticleGoogle Scholar
- Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. Gene-trait similarity regression for multimarker-based association analysis. Biometrics. 2009;65(3):822–32.View ArticleGoogle Scholar
- Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet. 2010;86(6):929–42.View ArticleGoogle Scholar
- Lin X, Lee S, Wu MC, Wang C, Chen H, Li Z, Lin X. Test for rare variants by environment interactions in sequencing association studies. Biometrics. 2016;72(1):156–64.View ArticleGoogle Scholar
- Irvin MR, Kabagambe EK, Tiwari HK, Parnell LD, Straka RJ, Tsai M, Ordovas JM, Arnett DK. Apolipoprotein E polymorphisms and postprandial triglyceridemia before and after fenofibrate treatment in the genetics of lipid lowering and diet network (GOLDN) study. Circ Cardiovasc Genet. 2010;3(5):462–7.View ArticleGoogle Scholar
- Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol. 2012;36(6):561–71.View ArticleGoogle Scholar
- Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, Croteau-Chonka DC. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):179–206.View ArticleGoogle Scholar
- Shi G, Gu CC, Kraja AT, Arnett DK, Myers RH, Pankow JS, Rao DC. Genetic effect on blood pressure is modulated by age. Hypertension. 2009;53(1):35–41.View ArticleGoogle Scholar
- Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78(4):629–44.View ArticleGoogle Scholar
- Fuchsberger C, Abecasis GR, Hinds DA. minimac2: faster genotype imputation. Bioinformatics. 2015;31(5):782–4.View ArticleGoogle Scholar