- Open Access
Relating drug response to epigenetic and genetic markers using a region-based kernel score test
© The Author(s). 2018
- Published: 17 September 2018
In GAW20, we investigated the association of specific genetic regions of interest (ROIs) with log-transformed triglyceride (TG) levels following lipid-lowering medication using epigenetic and genetic markers. The goal was to incorporate kernels for cytosine-phosphate-guanine (CpG) markers and compare the kernels to a purely parametric model. Post-treatment TG levels were investigated for post-methylation data at CpG sites and region-specific SNPs and adjusted for pre-treatment TG levels and age, in independent individuals only (real data: n = 150; simulated data, replicate 84: n = 111). In both data sets, our single-CpG-marker results using kernels and linear regression were in good agreement. In the real data, we investigated the introns of the CPT1A gene previously reported as associated with TG levels as separate ROIs, and were able to find hints of an association of cg17058475 and cg00574958 with post-treatment TG levels. In the simulated data, we investigated a total of 10 regions, in which the 5 causal and 5 non-causal markers lie, respectively, with increased methylation variances, yielding plausible results for the 3 window sizes. Overall, this indicates that kernels for CpG markers are feasible. An interaction regression model for the causal SNP with the nearest CpG marker identified an effect for the SNPs with the three greatest heritabilities simulated. The simulation model assumed full SNP effect only for unmethylated regions decreasing to zero in the case of full methylation. Thus, in the context of a clear candidate setting, interaction between epigenetic and genetic data may enhance information, albeit nominally, even with small sample sizes. Relieving the burden of multiple testing, developing kernels further to analyze data from multiple omics jointly is well warranted.
The human genome is a highly intricate system comprising various genic and gene regulatory elements. Epigenetic intervention turns it into a jungle. High-throughput technologies have been used to profile phenotypes in multiple omics dimensions. In order to dissect complex genomic traits, statistical tools need to handle a multitude of markers both within and across such dimensions. The kernel score test (KST) enables us to test a set of markers for an overall association with a phenotype , such as those markers within a region of interest (ROI). It highly reduces the burden of multiple testing without simply aggregating the data. KST can be applied to common and rare variants, or adjusted for covariates and applied to data of genome-wide association studies (GWAS), epigenome-wide association studies, or sequence data (where it was named the sequence kernel association test ).
Our previous research [3, 4] focused on genetic data and mainly on logistic regression. In the current analysis, we focused on the use of the KST, employing methylation markers to investigate a normally distributed drug response in independent participants. In several regressions we modeled post treatment log-transformed triglyceride (post-lnTG) as a function of epigenetic and/or genetic markers or kernels thereof, adjusting for pre-treatment log-transformed triglyceride (pre-lnTG) and age. We initially investigated the use of KST for epigenetic markers alone and then with genetic markers. We analyzed both the real and simulated data for a selection of candidate ROIs.
The GAW20 data were provided by the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study , a longitudinal family-based study involving 991 participants of European descent to localize novel loci contributing to triglyceride (TG) response in connection with fat loading and fenofibrate treatment. Data were collected at 4 time points: visits 1, 2, 3, and 4. We performed a pre−/post-treatment analysis using only visits 2 and 4, as the time between visits 1, 2 and 3, 4 was only one 1 day and the pre- and post-methylation data were only recorded at visits 2 and 4. To normalize distributions, TGs were log-transformed (pre-lnTG and post-lnTG). In addition, we used post-methylation data, GWAS data, and age.
GAW20 also provided simulated data using a family structure and genotypes identical to those in the original data. The answer sheet we were provided described the simulation model. Post-treatment methylation levels were modeled based on pre-methylation with a higher variation at ten 10 CpG cytosine-phosphate-guanine (CpG) markers than for all others. TGs were influenced by five 5 causal SNPs with decreasing heritability and several polygenes. However, the influence of each of the 5 causal SNPs on log TG change decreased with increasing degree of methylation in that particular ROI. Five of the 10 CpG markers were those close to the causal SNPs. We defined the corresponding region as causal ROI and the other mentioned regions as non-causal ROIs.
Since we were focusing only on independent individuals with available phenotypic and genetic/epigenetic information, we only included exactly one member from each pedigree. Hence, we included 150 individuals from the real data and 111 individuals from the simulated data of replicate 84.
Selection of ROIs
We first investigated models with and without kernels with post-treatment methylation data. To define ROIs for the simulated data, we consulted the GAW20 answers for the 5 causal and 5 noncausal CpG markers and associated SNPs to define 10 ROIs with the boundaries lying 0 kilobase pairs (kbp), 15 kbp, and 3 kbp upstream and downstream of each of these 5 + 5 CpG markers. In the real data, we formulated sets of CpG markers in the CPT1A gene defined by intronic boundaries. All the genomic information is in build hg18. After this we began to investigate genetic markers and used the knowledge of the model for effect simulation.
Regression models and KST
To investigate the association of specific ROIs with post-lnTGs, we employed linear and semi-parametric kernel regressions, all adjusted for age and pre-lnTG.
p Values are not adjusted for multiple testing. The significance level is set to 5%. All analyses were performed in R.
Simulated data: association of 10 candidate CpG markers and their ROIs with post-lnTG adjusted for pre-lnTG and age
KST, Window Size
Simulation data: association of 5 causal SNPs and their nearest CpG marker with post-lnTG, adjusted for pre-lnTG and age
CpG × SNP
Real data: association of sets of CpG markers in 14 introns of the CTP1A gene with post-lnTG, adjusted for pre-lnTG and age
Real data: association of 4 CpG markers in intron 1 of the CPT1A gene with post-lnTG, adjusted for pre-lnTG and age
In this analysis, the small sample size that results from using only independent individuals limits our power. Nevertheless, we were still able to detect nominally significant associations for 3 of the 5 causal SNPs from the simulation employing a model of interaction with their nearest CpG marker. We also found hints for association of cg17058475 and cg00574958 in intron 1 of the CPT1A gene with TGs in the real data by employing KST and the linear regression model. The GOLDN study also reports cg17058475 and cg00574958 are also reported by GOLDN study to be as associated with TGs and cg00574958 as correlated with CPT1A expression . Working on the simulated data, we investigated CpG markers in 10 ROIs. A region several kbp in size contains far fewer CpG markers than SNP markers. We revealed for the KST that the window size 0 kbp is similar to linear regression and higher window sizes are similar to each other, yet different from 0 kbp. As no direct CpG effect was modeled, no additional conclusions can be drawn. However, the application of the kernel proves feasible with CpG markers, and not only with genetic markers [3, 4]. Here the use of the kernel is not crucial, as the effect was only given by a simulation model for the causal SNP and the nearest CpG marker, not involving other markers in the region.
The most common design for a treatment-response study is a cohort design with independent people that requires individuals to take the treatment. This might often be unethical for families as a whole (albeit reasonable in the GOLDN study). As we wanted to see the behavior of kernels and regression in this general context, we opted to focus on unrelated individuals. Our strategy can be adapted for family data. In GAW19, we used KST on family data by introducing a design matrix—ZcT in Eq. (1)—where random effects c caused by familial polygenic background are adjusted for Y and Y = XbT + ZcT h(Z) + ε (for more details, see Malzahn et al. ). Several GAW20 contributions to this volume used the theoretical or the estimated kinship matrix to construct the random effect in a linear mixed model.
Our analysis with multi-omics data in a linear regression interaction model in comparison to single omics data in KST and linear regression framework emphasizes that careful integration of multi-omics data might enable researchers to explain a greater proportion of the variance in complex traits, even in small samples. Consequently, it would seem highly warranted to extend kernels to incorporate multiple types of omics data.
The authors thank Andrew Entwistle for editing the manuscript.
Publication of this article was supported by NIH R01 GM031575. This research was supported by the German Research Foundation grants DFG: RTG 1644 (publication charges); CRG214: BI 576/2; and PsyCourse (SCHU 1603/7–1; FA241/16–1).
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
SY and HB conceived the study and wrote the manuscript. SY, PB, SF, and SP participated in data retrieval and analysis. All authors read and approved the final version of this manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered. 2010;70:109–31.View ArticleGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.View ArticleGoogle Scholar
- Malzahn D, Friedrichs S, Bickeböller H. Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data. BMC Proc. 2016;10:269–73.View ArticleGoogle Scholar
- Freytag S, Manitz J, Schlather M, Kneib T, Amos CI, Risch A, Chang-Claude J, Heinrich J, Bickeböller H. A network-based kernel machine test for the identification of risk pathways in genome-wide association studies. Hum Hered. 2016;76(2):64–75.View ArticleGoogle Scholar
- Irvin MR, Zhi D, Joehanes R, Mendelson M, Aslibekyan S, Claas SA, Thibeault KS, Patel N, Day K, Jones LW, et al. Epigenome-wide association study of fasting blood lipids in the genetics of lipid-lowering drugs and diet network study. Circulation. 2014;130(7):565–72.View ArticleGoogle Scholar
- Davies R. Algorithm as 155: the distribution of a linear combination of chi-2 random variables. J R Stat Soc Ser C Appl Stat. 1980;29:323–33.Google Scholar
- Duchesne P, Lafaye De Micheaux P. Computing the distribution of quadratic forms: further comparisons between the Liu-tang-Zhang approximation and exact methods. Comput Stat Data Anal. 2010;54:858–62.View ArticleGoogle Scholar