- Open Access
Methods for detecting methylation by SNP interaction in GAW20 simulation
© The Author(s). 2018
- Published: 17 September 2018
To examine whether single-nucleotide polymorphism (SNP) by methylation interactions can be detected, we analyzed GAW20 simulated triglycerides at visits 3 and 4 against baseline (visits 1 and 2) under 4 general linear models and 2 tree-based models in 200 replications of a sample of 680 individuals. Effects for SNPs, methylation cytosine-phosphate-guanine (CpG) effects, and interactions for SNP/CpG pairs were included. Causative SNPs/CpG pairs distributed on autosomal chromosomes 1 to 20 were tested to examine sensitivity. We also tested noncausative SNP/CpG pairs on chromosomes 21 and 22 to estimate the empirical null. We found reasonable power to detect the main causative loci, with the exact power depending on sample size and strength of effects at the SNP and CpG sites.
DNA methylation is an important epigenetic mark at transcriptional start sites, regulatory elements, repeat sequences, or within a gene . Methylation’s main effect is to silence genes, which are dynamically regulated in expression. Methylation/demethylation confers genome stability, gene expression control, and contributions in biological functions and development .
We analyzed GAW20 simulated methylation and triglyceride (TG) levels at time points before and after “genomethate” treatment. We examined Type I error and power for identifying single-nucleotide polymorphisms (SNPs) by methylation interactions under several methods with different models. We conducted analyses with prior knowledge of solutions to the GAW20 problem. Our goal was to examine the feasibility of detecting gene by methylation interactions.
The GAW20 simulated data contains 200 replicates, each with SNP by methylation interaction effects at 5 main effect SNP/CpG (cytosine-phosphate-guanine) pairs and 100 very-small-effect “background” SNP sites. All simulated causative SNPs were distributed on chromosomes 1 to 20. Chromosomes 21 and 22 were left without any simulated effects, so they could be used for tests under the null distribution. We identified 2267 SNPs with nearby methylation markers on chromosomes 21 and 22, which we used to generate empirical null distributions for identifying SNP by methylation interactions. For full details of the GAW20 simulation, please refer to the data description paper Aslibekyan S, et al. (2017) .
We focused on 4 methods. For the first 2 methods, we developed a set of overlapping models:
Model 1a: Post = β0 + β1SNP + β2CpG + β3SNP ∗ CpG + βC + ε.
This is the generating model for the simulation: it includes SNP main effect, CpG methylation (expressed as (1 − CpG) because less methylation results in more expression) and their interaction effect on posttreatment TGs. Post is the average of log (TG3) and log (TG4); β0 is the intercept; β1SNP is the SNP effect; β2CpG is the methylation effect; β3SNP ∗ CpG is their interaction effect; βC corresponds to a vector of covariates [age, age2, sex, and the average of log(TG1) and log(TG2) before “genomethate” treatment (Pre)]; and ε is the residual.
Model 1b: Post = β0 + β1SNP ∗ CpG + βC + ε
This is a reduced version of model 1a that includes a SNP-by-methylation interaction, but no SNP or methylation main effects. Because there are strong interaction effects at the 5 main simulated SNP-by-CpG effect sites, we wanted to test whether this model had better power because of having fewer terms.
Model 2a: δTG = β0 + β1SNP + β2δ_CpG + β3SNP∗ δ_CpG + βC + ε
This model represents a “reasonable” model that is not the generating model. Because in practice the “true” model is unknown, we also applied a model differing from the generating model. Included are the main effect of the SNP, the main effect δ_CpG (methylation change pre- to posttreatment), and their interaction effect on the δTG pre- to posttreatment [the change of average log (TG) at times 3 and 4 versus average of log (TG) at times 1 and 2]. Here δTG is change of TG (Post – Pre); δ_CpG is the methylation difference between visits 4 and 2; and βC is the beta coefficient for the covariates (age, age2, and sex).
Model 2b: δTG = β0 + β1SNP∗ δ_CpG + βC + ε
This is a reduced version of model 2a, in which only the SNP by δ_CpG methylation interaction effect is included. The rationale for this model was the same as with model 1b.
Method I: General linear models
Method II: Mixed models
Method III: Regression trees
Method IV: Random forests
Random forests predict outcome variables by averaging the predictions of a large number of uncorrelated regression trees. This can allow for better model performance in data sets with smaller sample size or many predictors. For each replicate we created a forest of 10,000 trees to predict Post (models 1a and 1b) based on the pretreatment level, age, sex, causative and background SNPs, and methylation sites (113 predictors). We created this model using the scikit-learn software package .
In addition, heritabilities were estimated at 2 levels: at the full polygenic model using SOLAR (Sequential Oligogenic Linkage Analysis Routines) software  (http://solar-eclipse-genetics.org/) and at the single SNP level, in this case squaring beta coefficient for SNP (where using standardized dosages versus standardized residuals of a normally distributed response variable after adjusting for covariates. The b2 in this case provides an approximate estimate of additive effect) .
To obtain an empirical null distribution for Methods I and II, we identified 2267 SNP–CpG pairs on chromosomes 21 and 22, where there were no simulated causative SNPs. A pair of markers was identified by selecting a SNP, and a CpG that is adjacent but has a higher base-pair position than the SNP. For chromosome 21, we formed 817 pairs from 10,385 SNPs and 4157 CpG sites. For chromosome 22, we formed 1450 pairs from 9464 SNPs and 8381 CpG sites.
Heritability estimates of trait Post (using SOLAR)
Post [average of log (TG3) and log (TG4)]
b2 regression (SAS) from standardized dosage and standardized phenotype
100 background SNPs
b2 regression (SAS) from standardized dosage and standardized phenotype
Proportion of tests with p < 0.05 under several models with proc GLM and proc MIXED
Number of tests
GLM model 1a
GLM model 1b
MIXED model 1a
MIXED model 1b
MIXED model 2a
MIXED model 2b
cg0000036 * rs9661059
cg0004591 * rs1082841
cg0124267 * rs4399565
cg1048095 * rs736004
cg1877239 * rs1012116
Null SNP interactions
Null SNP interactions (w/MAC > 50)
The p values at for the “null” sites on chromosomes 21 and 22 were distributed more or less as expected with Type I error of 0.04 to 0.05, close to nominal α = 0.05, under Methods I and II (see Table 2). However, some caution should be noted here. In an initial run (data not shown), we inadvertently used the sandwich estimator, which produced inflated p values with low minor allele frequency SNPs.
Results from decision tree analysis on Post
Decision tree (importance)
Random forest (importance rank)
The heritability estimated reported suggested the simulated phenotypes had a good inheritance pattern, making it possible to detect causative SNPs and SNP–CpG interactions in our analyses. In examining the frequency of tests with p value of interactions ≤0.05 in 4 models with mixed model analysis (models 1a, 1b, 2a, and 2b for Method II; see Table 2) and 2 models with GLM analysis (models 1a and 1b for Method I; see Table 2), we see reasonable power to detect these effects in this sample. When the exact mechanism is unknown, the results from models 2a and 2b suggest it may be prudent to use a full model with both main and interaction terms, but when the mechanism is well understood, neither the full model (1a) nor the reduced model (1b) was always advantageous. Larger samples are required to detect these effects with these methods after correction for multiple testing. However, these results suggest that, given a sufficient sample, it is possible to detect gene by methylation interactions with the methods used here.
Publication of this article was supported by NIH R01 GM031575 and this study was funded in part by NHLBI R01 HL091357, NHLBI R01 HL104135, NHLBI R01 HL117078, and NIA U01 AG023746.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
EWD, JH and ATK conceptualized the study, performed analyses, and drafted the manuscript; PL, SJL, JW and CW performed analyses and drafted tables for the manuscript; PA and MAP discussed the study and provided feedback; all authors read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012;13:484–92.View ArticleGoogle Scholar
- Wu H, Zhang Y. Reversing DNA methylation: mechanisms, genomics, and biological functions. Cell. 2014;156:45–68.View ArticleGoogle Scholar
- Aslibekyan S, Almasy L, Province MA, Absher DM, Arnett DK. Data for GAW20: genome-wide DNA sequence variation and epigenome-wide DNA methylation before and after fenofibrate treatment in a family study of metabolic phenotypes. BMC Proc. 2018;12(Suppl 9). https://doi.org/10.1186/s12919-018-0114-0.
- Kim K, Timm NH, Timm NH. Univariate and multivariate general linear models: theory and applications with SAS. Boca Raton: Chapman & Hall/CRC; 2007. p. xvii.Google Scholar
- Littell RC. SAS System for Mixed Models. Cary: SAS Institute; 1996.Google Scholar
- Matignon R. SAS Institute: Data Mining Using SAS Enterprise Miner. Hoboken: Wiley-Interscience; 2007. p. xiii.View ArticleGoogle Scholar
- Géron A. Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol: O’Reilly Media; 2017.Google Scholar
- Almasy L, Blangero J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998;62:1198–211.View ArticleGoogle Scholar
- Schielzeth H. Simple means to improve the interpretability of regression coefficients. Methods Ecol Evol. 2010;1:103–13.View ArticleGoogle Scholar