- Open Access
Detection and analysis of CpG sites with multimodal DNA methylation level distributions and their relationships with SNPs
© The Author(s). 2018
- Published: 17 September 2018
DNA methylation levels at cytosine-phosphate-guanine (CpG) sites with multimodal distributions among different samples have been reported recently. One possible explanation for such variability is that genetic variants might affect epigenetic variation. One obvious case is that mutations such as single-nucleotide polymorphisms (SNPs) interrupt CpG sites, resulting in different DNA methylation levels for different genotypes. However, the relationship between genetic variations and epigenetic differences has not been studied thoroughly, partially because of the lack of powerful and robust methods to survey genome-wide CpG sites with multimodal methylation level distributions (mmCpGs). In this article, we develop a Gaussian mixture-model clustering (GMMC)–based approach to systematically detect all mmCpGs across the genome based on the GAW20 data set. In total, 3785 and 3847 mmCpGs have been identified in pre- and posttreatment data sets, respectively. Result analysis shows that approximately 68 to 70% of mmCpGs detected from unrelated individuals either have direct overlaps with SNPs or have associations with nearby SNPs, suggesting a strong correlation between SNPs and mmCpGs. Comparison with an existing approach illustrates that our GMMC-based method is more consistent when the number of samples decreases. In conclusion, mmCpGs may reveal important connections between genetics and epigenetics and they should be carefully identified and evaluated.
DNA methylation is one of the most widely used epigenetic marks and plays an important role in gene regulations, which may result in phenotypic differences among different individuals, as well as phenotypic differences of the same individual, before and after treatments . Although epigenetics is traditionally defined as heritable changes in gene activities that do not involve genetic mutations, recent studies suggest associations exist between genetic variants and differences in DNA methylation levels [2, 3]. Large-scale genome-wide DNA methylation profiling (e.g., using Illumina Infinium Human Methylation450 Beadchip, aka Illumina 450 K), together with genome-wide genotyping assays using single-nucleotide polymorphism (SNP) arrays, enables studies of associations between genetic variations and differences in DNA methylation levels.
Even though many studies have treated DNA methylation levels as a quantitative trait and performed so-called methylation quantitative trait locus analysis, two recent studies [4, 5] have investigated multimodal distributions of methylation levels at cytosine-phosphate-guanine (CpG) sites, primarily as a quality control step to correct methylation signals from Illumina 450 K chips. Daca-Roszak et al.  studied relationships between SNP genotypes and methylation levels of 96 CpG sites from European and Asian populations. They observed multimodal distributions among individual samples for CpG sites with SNPs. However, their study was limited to a very small subset of CpG sites and only considered CpG sites that physically overlapped with SNPs. In another attempt, Andrews et al. developed an interval-based clustering method called Gaphunter to identify CpG sites with multimodal distributions (mmCpGs) , which was implemented in the Bioconductor package minfi . Gaphunter first sorts individual DNA methylation levels of candidate CpG sites and then groups them into clusters with predefined methylation-level thresholds. An optional post-processing step can be used to exclude outlier-driven clusters, which are defined as clusters with smaller sizes relative to the total sample size and the size of the largest cluster. This simple algorithm is fast and works well for moderate-size data sets with little experimental measurement noise of methylation levels. The authors also explored applications of mmCpGs such as probe quality control and population stratification adjustment. However, threshold-based approaches such as Gaphunter are sensitive to noise levels and sample size.
To overcome those limitations, we propose a more generic and more robust clustering method to identify mmCpGs. The method is based on a Gaussian mixture model and we apply it to the GAW20 data sets to identify mmCpGs. We further check the relationships between SNPs and mmCpGs in terms of direct overlaps of their genomic locations, as well as statistical associations between mmCpG clusters and genotypes of SNPs that are physically close to mmCpGs. Analysis result shows that approximately 68 to 70% of mmCpG sites are associated with some SNPs within their 100 kbp neighborhood, suggesting high concordances between mmCpG clusters and individual genotypes. In comparison with Gaphunter, results show that our approach is more robust and more stable than that of Gaphunter.
In this study, we analyzed the genome-wide DNA methylation data before treatment and after treatment, as well as dense SNP genotype data provided by GAW20. There are 995 individuals from 182 families in the pretreatment methylation data set and 530 individuals from 153 families in the posttreatment methylation data set. Among the 1525 individuals, 823 have been genotyped. Of the individuals in the pretreatment dataset, 717 have both methylation and SNP data; in the posttreatment data set, 507 individuals have both methylation and SNP data. We performed mmCpG predictions on all individuals with methylation data, separately for the pretreatment and posttreatment data sets. Because of time limitations, we randomly picked 1 member from each family to assess associations between mmCpGs and genotypes. Association between mmCpGs and SNPs in related individuals will be examined in future studies. The number of CpG sites included is 463,995. The DNA methylation level of each CpG site in an individual is a numeric value between 0 and 1. The SNP array data consists of 718,566 SNPs. The genotype data are defined as 0, 1, or 2, representing the number of copies of the coded allele. Because genotype and DNA methylation data may contain missing value for some SNPs or CpG sites, we only include those individuals who have both genotypes and DNA methylation information when associating cluster labels with genotypes.
Gaussian mixture-model clustering
The assumption is that when methylation levels are affected by genotypes, each distinct genotype corresponds to a different distribution. In a population with different types of genotypes, their methylation levels will exhibit the characteristics of a mixture distribution. In our study, we use the mixtools  to perform GMMC, which will estimate model parameters for each cluster using the Expectation-Maximization algorithm for a given number of clusters. It also provides posterior probabilities of a sample belonging to each of the clusters. An evaluation metric, such as Bayesian information criterion (BIC) and log-likelihood of the mixture-model, can be used for model selection.
Detection of multimodal CpG sites
To determine the best number of clusters, we try different numbers of clusters iteratively and determine the best model using the BIC criteria. More specifically, we apply the following algorithm:
0. Starting with k = 1, calculate BIC1 based on unimodal GMM1. BEST_MODEL = GMM1.
1. If k > MAX_K, stop iteration.
Otherwise, k = k + 1. Apply GMMC with given k components. Obtain BICk and GMMk.
2. If BICk > BICk − 1 + BIC_INC_THRESHOLD, BEST_MODEL = GMMk. Continue to step 1.
Otherwise, stop iteration.
Given the property of our specific application, we set the MAX_K as 3, corresponding to the 3 distinct genotypes of some SNPs that are potentially associated with the mmCpG. The BIC incremental threshold (BIC_INC_THRESHOLD) is used to control the model complexity. A larger number of clusters are meaningful only if it’s the larger number’s BIC is substantially higher than the BIC with a smaller number of clusters. In practice, a higher value of the threshold will allow the method to be less sensitive. We set BIC_INC_THRESHOLD = 100 in our analysis so that our results are more conservative.
Once the model is fixed, each individual is assigned to the cluster for which the posterior probability is highest. Our method also incorporates a postprocessing step that uses several thresholds to filter out low-quality clusters. First, the largest cluster cannot be too big. If the fraction of the largest cluster is greater than 1 − OUT_CUTT, where OUT_CUTT is a user-specified parameter, the mmCpG will be excluded from further analysis. Second, samples within each cluster should have small variance, controlled by a threshold MAX_STD for the maximum allowed standard deviation in each cluster. Finally, we require that cluster centers should be separable from each other, which is controlled using a threshold MIN_MEAN_DIFF. In our study, we set MAX_STD = 0.1 and MIN_MEAN_DIFF = 0.2.
Associating GMM cluster labels with genotypes
To study the relationships between genotypes and mmCpGs, we evaluated genotype data and GMM cluster labels together to assess the strength of associations. In our study, we included all SNPs located less than 50 kb on either side of an mmCpG site. For each pair of a SNP and an mmCpG, we first constructed the contingency table for 3 genotypes and 3 cluster labels. Then a chi-square p value was calculated and corrected by Bonferroni correction for multiple testing. Among all the nearby SNPs around an mmCpG, only the SNP with the minimum p value is considered as the measure of SNP-mmCpG association. Finally, a critical value of 0.001 (after Bonferroni correction for multiple testing) was used to determine if an mmCpG has strong association with at least 1 nearby SNP.
Genome-wide survey of mmCpG sites in the GAW20 data set
Association between mmCpGs and SNPs
Summary of mmCpGs pre- and posttreatment
p ≤ 0.001
p > 0.001
We have proposed a novel GMMC-based method to detect genome-wide mmCpGs generated from Illumina 450 K chips. We applied this method on the GAW20 data set and found that the majority of mmCpGs are associated with SNPs that are either directly overlapping with CpG sites or are in close proximity to CpG sites. Empirical analysis demonstrates that our method is more stable than Gaphunter, a thresholding-based method. The ideas that underlie threshold-based clustering and model-based clustering are quite different. Threshold-based methods such as Gaphunter use a fixed cutoff value to draw a boundary to separate data points, which may not be able to capture characteristics of different clusters. First, the choice of cutoff values is mostly arbitrary. The same cutoff values may not be valid in different data sets or, even worse, they may not be valid for different CpG sites in the same data set, because methylation level distributions of different CpG sites may have different characteristics, that is, some CpG sites have larger gaps between clusters than other CpG sites. Moreover, the cutoff values can be quite sensitive to sample sizes. When the sample size is small, the distribution of DNA methylation levels among individuals will be sparse and the within-cluster distances may be big. Threshold-based methods are prone to false positives. When the sample size is large, the distribution is dense and clusters may have overlaps, making it hard for threshold-based approaches to correctly cluster samples. Unlike threshold-based methods, model-based clustering methods, including our proposed GMMC method, are designed to obtain models that fit the distributions and can naturally capture the characteristics of cluster structures. Consequently, model-based clustering methods usually provide more accurate results. In addition, the GMM can detect clusters with identifiable overlaps.
The current study mainly focused on detection of mmCpGs. Our findings suggest that there might be some connections between genetics and epigenetics. One should not treat mmCpGs as irregularities and filter them out from further analysis. Instead, careful characterization after identification is needed to better understand the biological significance of mmCpGs. In the future, we will investigate the association between mmCpGs and genotypes and explore how these mmCpGs might be related to phenotypes.
A Gaussian mixture-model clustering algorithm was developed and applied on the GAW20 data set to detect CpG sites with multimodual methylation levels. The result of our analysis shows that a large number of detected mmCpGs either directly overlap with SNPs or have strong associations with nearby SNPs, suggesting correlations between genetic mutations and methylation level variations.
This work has been supported in part by NIH grants R03HG008632 and R01DC012380.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
Both authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Irvin MR, Zhi D, Joehanes R, Mendelson M, Aslibekyan S, Claas SA, Thibeault KS, Patel N, Day K, Jones LW, et al. Epigenome-wide association study of fasting blood lipids in the genetics of lipid-lowering drugs and diet network study. Circulation. 2014;130:565–72.View ArticleGoogle Scholar
- Olsson AH, Volkov P, Bacos K, Dayeh T, Hall E, Nilsson EA, Ladenvall C, Rönn T, Ling C. Genome-wide associations between genetic and epigenetic variation influence mRNA expression and insulin secretion in human pancreatic islets. PLoS Genet. 2014;10(11):e1004735.View ArticleGoogle Scholar
- Smith AK, Kilaru V, Kocak M, Almli LM, Mercer KB, Ressler KJ, Tylavsky FA, Conneely KN. Methylation quantitative trait loci (meQTLs) are consistently detected across ancestry, developmental stage, and tissue type. BMC Genomics. 2014;15:145.View ArticleGoogle Scholar
- Daca-Roszak P, Pfeifer A, Żebracka-Gala J, Rusinek D, Szybińska A, Jarząb B, Witt M, Ziętkiewicz E. Impact of SNPs on methylation readouts by Illumina Infinium HumanMethylation450 BeadChip Array: implications for comparative population studies. BMC Genomics. 2015;16:1003.View ArticleGoogle Scholar
- Andrews SV, Ladd-Acosta C, Feinberg AP, Hansen KD, Fallin MD. “Gap hunting” to characterize clustered probe signals in Illumina methylation array data. Epigenetics Chromatin. 2016;9:56.View ArticleGoogle Scholar
- Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363–9.View ArticleGoogle Scholar
- Benaglia T, Chauveau D, Hunter DR, Young DS. Mixtools: an R package for analyzing mixture models. J Stat Softw. 2009;32:1–29.View ArticleGoogle Scholar