CpG-set association assessment of lipid concentration changes and DNA methylation

Epigenome association studies that test a large number of methylation sites suffer from stringent multiple-testing corrections. This study’s goals were to investigate region-based associations between DNA methylation sites and lipid-level changes in response to the treatment with fenofibrate in the GAW20 data and to investigate whether improvements in power could be obtained by taking into account correlations between DNA methylation at neighboring cytosine-phosphate-guanine (CpG) sites. To this end, we applied both a recently developed block-based data-dimension-reduction approach and a region-based variance-component (VC) linear mixed model to GAW20 data. We compared analyses of unrelated individuals with familial data. The region-based VC approach using unrelated (independent) individuals identified the gene LGALS9C as significantly associated with changes in triglycerides. However, univariate tests of individual CpG sites yielded no valid statistically significant results.


Background
Lipid levels can be influenced by drug therapy or lifestyle factors such as diet, physical activity, alcohol consumption, and smoking [1]. Lipid levels are also associated with inherited genetic variants (single-nucleotide polymorphisms [SNPs]), as revealed by several genome-wide association studies [2]. However, DNA sequence variation explains only a small proportion of lipid-level variance [2]. Epigenetic modifications (eg, DNA methylation) alter DNA accessibility and hence can be involved in regulating patterns of gene expression. Through regulation of lipid levels, epigenetic mechanisms may contribute to cardiovascular risk profiles [3,4]. Irvin et al. [3] identified strong association of 4 cytosine-phosphate-guanine (CpG) sites within the CPT1A gene on chromosome 11 with both triglycerides (TGs) and very-low-density lipoprotein C (VLDL-C). Because their analysis examined phenotype associations at each CpG site, a substantial correction for multiple testing was required.
This study's goals were to investigate region-based associations between DNA methylation sites and lipid-level changes in response to the treatment with fenofibrate in the real data set provided by the GAW20 and to investigate whether improvements in power could be obtained by taking into account correlation between DNA methylation at neighboring CpG sites. We conducted 2 complementary block-based association tests that simultaneously accommodated all CpG sites that fell within a genomic region (block) for the purpose of investigating whether taking into account the correlation between DNA methylation at neighboring CpG sites and reducing the number of tests can improve power.
Turgeon et al. [5] have proposed "principal components of explained variation (PCEV)." The PCEV approach integrates, simultaneously, an optimal data-dimension-reduction technique with testing for association. It provides analytical and empirical p value calculations for testing association between a set of correlated variables (eg, methylation profiles of a genomic region) and 1or more variables of interest (eg, high-density lipoprotein [HDL]/ TG).
We contrast PCEV with a variance components (VC) score test method. This is a sequence kernel association test (SKAT)-type association test that decomposes the total variance of a phenotype (eg, HDL/TG) into the variance explained by a block/region-methylation profiles and a residual variance term [6]. Specifically, the model assumes that the phenotypic similarity between subjects is captured by the region-methylation similarity. The VC-score approach significantly reduces the model degrees of freedom compared to standard multivariate regression models.

Methods
Suppose we have observed data {Y, x, C} where Y is an n × p matrix of n subjects for which a block of p variables/phenotypes are measured (eg, methylation values at a genomic region/gene with pCpG dinucleotides), x is an n × 1 vector of an observed trait of interest (eg, HDL phenotype) and C is an n × r matrix where the columns are known confounding factors (eg, age, sex).

PCEV approach
PCEV is a dimension-reduction technique that searches for a linear combination (a principal component) of the columns of Y, y pcev = Y w (w is ap × 1 vector), that maximizes h 2 (w), the ratio of the variance in Y explained by x to the total variance of Y, while taking into account the confounding factors, C. This new score y pcev can then be used as a phenotype in standard statistical models to test for the relationship between Y and x. Searching for y pcev is equivalent to projecting the rows of Y into w, where w is the most relevant direction in p-dimension space to x. A linear relationship between Y and x can be tested by H 01 : corr(y pcev , x) = 0. This test requires the use of the data twice, and therefore a naïve approach for p value calculation will suffer from Type I error inflation. However, the null H 01 is equivalent to testing for H 02 : h 2 (w pcev ) = 0 which uses the data only once. Turgeon et al. [5] derived an analytic test for the null hypothesis H 02 , which was shown to yield the proper Type I error rate.

VC-score approach
In a reverse model where x (eg, HDL) is modeled as the response variable and Y as a design matrix of p predictors, the VC model links x to Y using a linear mixed-effects regression model in which Y has an effect on the variance of x instead of on its mean [6]. This approach was developed to test association between a set of rare variants and a phenotype of interest. However, the test can be adapted easily to handle different types of design matrices, such as methylation from a genomic region of interest. This method can be extended to take into account population and family structures. The family-based VC-score approach is a linear mixed-effects model in which a second random effect for genetic relationships (ie, kinship) is added [7].
Phenotypic, methylation, and covariate data Circulating blood lipids, HDL, TGs, and the methylation profiles were measured at baseline and following 3 weeks of daily treatment with 160 mg of micronized fenofibrate [2]. For this study, we investigated HDL and TG changes among 714 participants for whom pretreatment methylation data were available. Because the PCEV approach has only been implemented for use with independent subjects, analyses using this method were conducted using 242 unrelated individuals. The selection of the maximum set of unrelated individuals from each pedigree was done using a greedy algorithm that used the kinship matrix to sequentially remove related individuals [8]. Log-transformations were performed for TG, as this variable was not normally distributed.
T-cell pre-and posttreatment DNA-methylation at 463,995CpG sites were already normalized using ComBat [3]. These CpG sites were allocated to 22,319 genes. We also included sites located 20 kb up-and downstream of the gene region. Only the CpG sites with gene annotations were evaluated in the analyses; consequently, we analyzed 401,326 CpG sites. Because PCEV works when the block and sample sizes are comparable [5], we divided the largest gene blocks to obtain 22,488 gene regions with no more than 130 CpG sites per block.
We focused on the pretreatment methylation levels to evaluate the effect of individual CpG sites and genes on explaining the observed heterogeneity in response to treatment. To capture unwanted variability in methylation profiles, which could result from variation in cell purity or batch effects, we constructed principal components of genome-wide methylation levels using 2000 randomly sampled probes from all autosomes. The association analyses between pretreatment methylation probes and blood lipid changes were adjusted for age, sex, study center, smoking status, diagnosed metabolic syndrome status, the fast time on the pre-and posttreatment visits, and the top 4 methylation-derived principal components (PCs).

Results
Bonferroni thresholds for significance at a 10% family-wise error rate were established using 401,326 univariate tests and 22,488 CpGset tests. No univariate tests for TG changes passed this threshold. The family-based VC-score approach identified 2 genes (RNMT andMIR130B) as significantly associated with HDL changes. The VC-score approach using unrelated subjects identified the gene LGALS9C as significantly associated with TG changes. Table 1 lists the 5 most significant genes with association to lipid changes identified by univariate, PCEV and VC-score approaches. Among the top 5 genes, both region-based approaches identified NUDCD3 for its relationship with HDL changes among independent participants. There is no overlap among the top 5 genes for TG changes identified by the 3 approaches. Figure 1 shows quantile-quantile (Q-Q) plots for the gene-based p values and the individual CpG-based p values for the HDL changes and TG changes, using data from the 242 unrelated individuals. Under each analysis, adjustments with and without the 4 methylation PCs were compared, revealing that inclusion of these PCs was important in controlling for unknown confounding. Without this extra adjustment, the distribution of p values was very biased away from what would be expected under the null. Figure 2 contrasts the results obtained using the independent participants and the ones

Discussion
In this study, we investigated associations between DNA methylation and lipid-level changes in response to fenofibrate treatment in the GAW20 real data set. The smallest p value found by both the VC-score test for TG differences and the single-CpG association test was at gene LGALS9C. However, as a result of multiple-testing power-loss issues, no single-CpG test passed a Bonferroni-corrected threshold. Simulation studies are necessary for more rigorous power assessments comparing region-based and univariate methods. Furthermore, we did not replicate the results of Irvin et al. [3] for baseline lipids, although this is not surprising as we analyzed lipid-level changes. We focused on linear relationships; however, nonlinear association methods may be more favorable/powerful when the primary interest is lipid-level changes.
In all analyses, we adjusted for available/known confounders and for unknown confounders using 4 PCs calculated based on 2000 CpG probes selected randomly from available DNA methylation on all chromosomes. We also contrasted this with an analysis using PCs calculated from all CpG sites, and found little difference in the results (not shown). The adjustment resulting from the PCs improved the validity of VC-score test results; however, (unusual) deviation of the PCEV test statistic from the null distribution was also indicated. This might be a consequence of the nonrobustness of PCEV to violation of normality assumption or the constant variance assumption of model residuals (errors). Even after normalization, methylation proportions have variances that are small when means are near 0 and 1. This heteroscedasticity might lead to a loss of power [9]. Thus, a transformation such as the logit may help in obtaining test statistics with valid null distributions.
We considered 2 ways to accommodate the relatedness among the participants: to restrict the analysis to unrelated individuals or to use a linear mixed model that takes family structure into account. In contrast to the well-behaved p value distribution for the analysis of unrelated subjects, the family-based VC-score test showed inflation in the Q-Q plots. Hence, the significant results for the 2 genes RNMT and MIR130B may be questionable. Almeida et al. [10] found that heritability of pretreatment DNA methylation was much higher than expected. Our results agree that pedigree-based kinship corrections are insufficient to correct for familial correlations in DNA methylation, and that additional corrections must be considered.
Other region-based association methods may be worth exploring in the future, such as the global analysis of methylation profiles (GAMP) [11] in which the density of methylation values in a region is approximated by B-splines and then the spline coefficients for each individual are used as covariates in association tests. Other strategies to accommodate the family structure in the region-based association tests include MF-KM (multivariate family data using kernel machine regression) [12], a linear mixed model built upon kernel machine regression, and mFARVAT (multivariate family-based rare variant association tool) [13], a quasi-likelihood-basedscore test approach.

Conclusions
The region-based VC approach using unrelated individuals identified the gene LGALS9C as significantly associated with changes in triglycerides. However, univariate tests of individual CpG sites yielded no valid statistically significant results. After correctly accounting for the unknown confounding and subject relatedness, region-based methods show an improvement in power to detect associated genes as compared to single-marker methods.

Funding
Publication of this article was supported by NIH R01 GM031575. This work was also supported by the Ludmer Centre for Neuroinformatics and Mental Health, the Canadian Institutes of Health Research grant # 130344, the Fonds de recherche du Québec -Santé grant # 31110, and CANSSI, the Canadian Statistical Sciences Institute.

Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.

About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.