- Proceedings
- Open Access

# Evidence of batch effects masking treatment effect in GAW20 methylation data

- Angelo J. Canty
^{1}Email author and - Andrew D. Paterson
^{2, 3}

**Published:**17 September 2018

## Abstract

Using the real data set from GAW20, we examined changes in the distribution of DNA methylation before and after treatment. Paired analysis of differences in both mean and variance had grossly inflated type 1 error, suggesting either a very large number of changes across the entire epigenome or major non-biological issues, such as batch effects. Separate analysis of Infinium I and II probes indicated differences in the paired *t*-test statistics between these two types of probes. Examination of combined principal components showed that the first and fourth principal components discriminate between the before and after treatment measurements, further evidencing the presence of batch effects that make any conclusions about treatment effect suspect.

## Background

Treatment of CD4+ T cells with fenofibrate results in differences in gene expression and interferon γ protein levels [1], suggesting some of its actions may be mediated by effects on DNA methylation. For the GAW20 data, the Illumina Human Methylation 450 K BeadChip was used to measure methylation in CD4+ T cells before and after 3 weeks of treatment with 160 mg oral fenofibrate. This chip uses two different probe chemistries (Infinium Type I and Infinium Type II) to assess methylation [2]. The two probe types have differing dynamic ranges and target different genomic features [3]. The supplied data gave the normalized methylation proportion (β values) at each of 463,995 cytosine-phosphate-guanine (CpG) sites. Previously, the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study examined the association of change in lipids, before and after treatment, with change in DNA methylation [4]. No genome-wide significant associations were observed. In that analysis, methylation measures before and after treatment were normalized (separately at each time point, stratified for Type I and Type II probes) and adjusted for differences in cell composition using the first four principal components. In our analysis, we examined the differences in methylation, and also adjusted for principal components and change in triglyceride levels to examine changes in the distribution of methylation before and after treatment.

## Methods

Methylation measures were available at both time points for 446 individuals across 140 pedigrees. To avoid complications resulting from relatedness, we selected one individual at random from each of these pedigrees, and so used a sample of *n* = 140 for our analyses. Because the original β values are non-normal we used a logit transformation to get M-values as suggested by Du et al. [5]. We omitted 668 probes that gave infinite means on the logit scale, leaving 463,327 sites for analysis.

We were interested in looking for evidence of differences in both the mean and variability of the methylation values. For differences in the mean, we used a simple paired *t* test at each site as our primary analysis. To examine differences in the variability we used the Pitman-Morgan test [6]. Theory shows that the covariance between the sum and difference of two random variables is equal to the difference in their marginal variances. This result implies that to test for the equality of variance in a paired setting, we need to test the hypothesis that the correlation between the sum of the pre- and post-treatment methylation values and their difference is equal to zero. We used the usual *t* test of zero correlation between normal random variables for this. Both tests described above rely on the assumption of normality of the underlying M-values. As a sensitivity analysis, we also replaced the two *t* tests with non-parametric tests: the Wilcoxon signed rank test in place of the paired *t* test, and the Spearman’s rank correlation test in place of the *t* test for correlation.

To examine the impact of principal components and the change in triglycerides we also recast both the paired *t* test and the correlation test in terms of a standard linear model with the difference in methylation being the response variable. The usual paired *t* test is equivalent to a test of 0 intercept in such a model, and the test of correlation is equivalent to a test of a zero slope for the sum when it is included as a covariate in the model.

## Results

### Difference in mean methylation

*t*-test statistic for each of the 463,327 probes analyzed.

*t*test statistic, suggesting a possible mixture of two distributions. Figure 2 shows a quantile-quantile (−log

_{10}scale) of the resulting

*p*values and a Manhattan plot. There is very clear inflation of Type 1 error (λ = 36.18). In fact, 32.3% of probes had a significant

*p*value after Bonferroni correction as shown by the red horizontal lines on the panels of Fig. 2.

We found a very similar distribution of *p* values when using the Wilcoxon test (results not shown) indicating that deviation from normality is not the cause of the excess of small *p* values. We conclude that there are real differences between the observed methylation signals pre- and post-treatment.

### Difference in variance of methylation

*p*values in the third panel of Fig. 3 again shows an excess of small

*p*values (λ = 7.61). We found 9982 probes (2.2%) that showed significant differences in standard deviation (SD). Almost all (9807) of these significant probes had higher SD pre-treatment. There was no material change to the results when a non-parametric test was used in place of the

*t*test (results not shown).

### Probe type analysis

### Principal component analysis

### Controlling for triglyceride differences

_{10}(

*p*values) before and after adjustment for the difference in log-triglycerides. Figure 6 shows that the

*p*values after adjustment tend to be less extreme for the test of means, and marginally so for the test of variances. Despite this attenuation of the small

*p*values problem, there still remain 26,371 (5.7%) significant probes for differences in mean and 7856 (1.7%) significant probes for differences in variance. These differences are distributed across every chromosome, as we saw in the data before adjustment. Further adjustment of the tests for the first 4 PCs of the pre-treatment M-values, as well as those for the post-treatment M-values, had minimal impact on the distribution of genome-wide

*p*values (results not shown).

## Discussion

We see a very large number of significant differences in mean and variance of the methylation distribution before and after fenofibrate treatment. The distribution of the paired *t* test statistics is asymmetric and varies by probe type. PCs almost completely separate the data from the two visits. We believe that the large differences in the mean and variance of the methylation values seen across the epigenome are unlikely to be caused by the treatment; rather, they suggest systematic batch effects between the processing of the samples from the two visits. If the observed differences really were caused by the treatment then we would expect them to be highly correlated with the major treatment effect, namely difference in the triglyceride level. Adjustment of the tests for differences in log-triglycerides attenuates, but does not remove, the issue of an excess of very small *p* values across the genome. Other authors, such as Bock [7], have also commented on the presence of major batch effects when arrays are processed at different times. Our understanding is that the pre-treatment arrays were all processed and normalized first, and the post-treatment arrays were processed and normalized later. Joint normalization of the original data across both time points may have helped to correct for some of these systematic differences, but the data that would have allowed joint normalization was not available as part of the GAW20 data set used in this analysis.

## Conclusions

It is our view that the batch effects seen in the GAW20 methylation data make it impossible to draw any real conclusions regarding the differences in methylation or their association with other traits. These effects are likely to occur in any longitudinal analysis of methylation, and so care needs to be taken to minimize the effects by processing and normalizing all arrays together for any analysis that will look at changes over time.

## Declarations

### Acknowledgements

The authors thank the organizers of GAW20. The GAW20 real data set used was provided by the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) which is supported by NIH National Heart, Lung, and Blood Institute grants R01 HL104135 and U01 HL72524.

### Funding

Publication of this article was supported by NIH R01 GM031575.

### Availability of data and materials

The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW) but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.

### About this supplement

This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.

### Authors’ contributions

AJC was the primary author of the manuscript, conducted the statistical analysis and participated in the workshop. ADP suggested some of the approaches used and edited the manuscript. Both authors have read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Zhang MA, Ahn JJ, Zhao FL, Selvanantham T, Mallevaey T, Stock N, Correa L, Clark R, Spaner D, Dunn SE. Antagonizing peroxisome proliferator-activated receptor α activity selectively enhances Th1 immunity in male mice. J Immunol. 2015;195(11):5189–202.View ArticleGoogle Scholar
- Illumina Inc: Illumina methylation beadchips achieve breadth of coverage using 2 Infinium® chemistries. Illumina Technical Note: Epigenetic Analysis 2005, San Diego, California. Google Scholar
- Dedeurwarder S, Defrance M, Calonne E, Denis H, Sotiriou C, Fuks F. Evaluation of the Infinium methylation 450K technology. Epigenomics. 2011;3(6):771–84.View ArticleGoogle Scholar
- Das M, Irvin MR, Sha J, Aslibekyan S, Hidalgo B, Perry RT, Zhi D, Tiwari HK, Absher D, Ordovas JM, et al. Lipid changes due to fenofibrate treatment are not associated with changes in DNA methylation patterns in the GOLDN study. Front Genet. 2015;6:304.View ArticleGoogle Scholar
- Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11:587.View ArticleGoogle Scholar
- Morgan WA. A test for the significance of the difference between the two variances in a sample from a normal bivariate population. Biometrika. 1939;31:13–9.Google Scholar
- Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet. 2012;13(10):705–19.View ArticleGoogle Scholar