A Bayesian mixed modeling approach for estimating heritability

Nustad, Haakon E.; Page, Christian M.; Reiner, Andrew H.; Zucknick, Manuela; LeBlanc, Marissa

doi:10.1186/s12919-018-0131-z

Volume 12 Supplement 9

Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data

Proceedings
Open access
Published: 17 September 2018

A Bayesian mixed modeling approach for estimating heritability

Haakon E. Nustad^1,2,3^na1,
Christian M. Page^4,5^na1,
Andrew H. Reiner⁴,
Manuela Zucknick⁶ &
…
Marissa LeBlanc⁴

BMC Proceedings volume 12, Article number: 31 (2018) Cite this article

3074 Accesses
9 Citations
Metrics details

Abstract

Background

A Bayesian mixed model approach using integrated nested Laplace approximations (INLA) allows us to construct flexible models that can account for pedigree structure. Using these models, we estimate genome-wide patterns of DNA methylation heritability (h²), which are currently not well understood, as well as h² of blood lipid measurements.

Methods

We included individuals from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study with Infinium 450 K cytosine-phosphate-guanine (CpG) methylation and blood lipid data pre- and posttreatment with fenofibrate in families with up to three-generation pedigrees. For genome-wide patterns, we constructed 1 model per CpG with methylation as the response variable, with a random effect to model kinship, and age and gender as fixed effects.

Results

In total, 425,791 CpG sites pre-, but only 199,027 CpG sites posttreatment were found to have nonzero heritability. Across these CpG sites, the distributions of h² estimates are similar in pre- and posttreatment (pre: median = 0.31, interquartile range [IQR] = 0.16; post: median = 0.34, IQR = 0.20). Blood lipid h² estimates were similar pre- and posttreatment with overlapping 95% credibility intervals. Heritability was nonzero for treatment effect, that is, the difference between pre- and posttreatment blood lipids. Estimates for triglycerides h² are 0.48 (pre), 0.42 (post), and 0.21 (difference); likewise for high-density lipoprotein cholesterol h² the estimates are 0.61, 0.68, and 0.10.

Conclusions

We show that with INLA, a fully Bayesian approach to estimate DNA methylation h² is possible on a genome-wide scale. This provides uncertainty assessment of the estimates, and allows us to perform model selection via deviance information criterion (DIC) to identify CpGs with strong evidence for nonzero heritability.

Background

Narrow-sense heritability (h²), traditionally estimated using twins or other constrained family relationships, can also be estimated in wider pedigrees using a linear mixed-model approach [1, 2]. This approach is well-established for traits having moderate to high h² [2, 3]. What is less clear is how well these models perform for traits with low h², which is the case for some proportion of cytosine-phosphate-guanine (CpG) methylation sites genome wide. This is of particular interest for h² because it is a ratio of variance components, that is, the proportion of phenotypic variance explained by additive genetic variance. Compared to estimates of the mean, estimates of variance tend to have large uncertainty.

DNA methylation (DNAm) is an epigenetic mark that is implicated in many heritable diseases and traits. DNAm patterns are influenced by the environment, change over lifetime, and exhibit mitotic heritability [4]. There are multiple reports of DNAm h² with CpG-specific estimates ranging from 0 to 1, and with genome-wide mean or median estimates ranging from approximately 15 to 30% [5,6,7].

Any point estimates of h² should ideally be accompanied by an assessment of the uncertainty. A Bayesian mixed model with inference by integrated nested Laplace approximations (INLA) [2, 8] presents an improvement on the traditionally used simple, balanced designs (eg, parent–offspring or twin-based correlations) as it can include the full pedigree structure. Additionally, full posterior inference provides uncertainty for the estimated h², which is not obtainable using frequentist methods, and allows an assessment of whether the true h² is nonzero via comparison of models with and without a genetic component.

In this article, we use three-generation pedigrees from the Genetics of Lipid Lowering Drug and Diet Network (GOLDN) [9] to estimate h² in a fully Bayesian framework. Blood lipid and DNAm measurements were available pre- and posttreatment with a lipid-lowering drug (fenofibrate). We first investigate continuous blood lipid traits with known moderate to high h² and compare the Bayesian estimates with previously published frequentist estimates [10]. Second, we estimate the genome-wide distribution of DNAm h².

The Bayesian mixed-model approach provides posterior distributions and therefore uncertainty for h², the additive genetic variance, and the environmental variance. We further investigate if it is reasonable to estimate h² for each CpG site given the data by performing model selection, to assess if the genetic component should be included at all. When the additive genetic variance is truly zero, a linear mixed model including the family structure will misspecify the model, making any resultant h² estimates untrustworthy. When h² is truly nonzero but smaller than the minimum h² detectable in a given data set (ie, when the study is underpowered), we would expect extremely large standard errors for the h² estimates.

In this article, our aims are to estimate h² for (a) blood lipid traits and (b) genome-wide DNAm using a Bayesian approach, including a model selection step to identify CpG sites with strong evidence for nonzero heritability. We highlight the importance of the model selection step and of quantifying the uncertainty in h² estimates.

Methods

Data

Individuals in the GOLDN study having DNAm and blood lipid measurements, as well as information on gender and age, were included in the analysis [11]. Genome-wide DNAm was measured using the 450 K Infinium array at 463,995 CpG sites and was available for 995 and 530 individuals at pre- and postfenofibrate treatment, respectively. From the original 463,995 sites available, we removed a list of known single nucleotide polymorphism–related CpG sites. Blood lipid levels were available both pre- and posttreatment for 818 and 861 individuals for triglycerides (TGs) and high-density lipoproteins (HDLs), respectively. In our analyses, we use log-transformed TG and HDL (averaged over 2 measurements taken 1 day apart), and simply refer to these transformed average values as TG and HDL from here on.

Standard quality control inspection of the DNAm data showed systematic differences with respect to probe-type chemistry. We therefore normalized the data with respect to the probe type using beta-mixture quantile normalization using the R package wateRmelon [12]. We calculated the M values for each probe and used these in the analysis.

Statistical model

We estimate h² in a Bayesian framework using the INLA package in Rue et al. [8]. Using the following model, we investigate h² for TG, HDL, and DNAm in the GOLDN study pre- and posttreatment:

$$ {y}_i\mid {\eta}_i,{\sigma}_e^2\sim N\left({\eta}_i,{\sigma}_e^2\right)\forall i=1,\dots, n\ \mathrm{with}\ {\eta}_i={\beta}_0+{\boldsymbol{x}}_i^T\boldsymbol{\beta} +{u}_i $$

$$ \boldsymbol{u}\mid {\sigma}_g^2\sim N\left(\mathbf{0},2{\sigma}_g^2\boldsymbol{K}\right)\ \mathrm{with}\ \boldsymbol{u}=\left({u}_1,\dots, {u}_n\right) $$

(1)

where y = (y₁, …, y_n) is the outcome vector, that is, DNAm for a given CpG, TG, or HDL, for n subjects. $ N\left({\eta}_i,{\sigma}_e^2\right) $ is the likelihood function for subject i, where η_i is the linear predictor defining the latent variables of interest and $ {\sigma}_e^2 $ denotes the residual unexplained variance in the model. Fixed effects are the intercept β₀ and β = (β₁, β₂) corresponding to the effects of age (x_1i) and gender (x_2i). Family structure was modeled with u as a random effect, with distribution$ N\left(\mathbf{0},2{\sigma}_g^2\boldsymbol{K}\right) $, where K is the kinship matrix calculated based on pedigree information using the {kinship2} package in R [13], and $ {\sigma}_g^2 $ is the additive genetic variance. We assume default noninformative priors for all hyperparameters for the genome-wide investigation of h² ($ \mathrm{in}\ \mathrm{particular}\ {\sigma}_{\ast}^{-1}\sim Gamma\left(\mathrm{1,0.00005}\right) $), but other priors were also investigated. From this model,$ {h}^2=\frac{\sigma_g^2}{\sigma_g^2+{\sigma}_e^2} $.

With the deviance information criterion (DIC) [14, 15], we can investigate the fit of a model to the data and do model selection. We can compare the DIC of the 2 latent models:

$$ {\eta}_i={\beta}_0+{\boldsymbol{x}}_i^T\boldsymbol{\beta} $$

(2)

$$ {\eta}_i={\beta}_0+{\boldsymbol{x}}_i^T\boldsymbol{\beta} +{u}_i $$

(3)

and use the recommended ΔDIC = DIC₍₂₎ − DIC₍₃₎ > 10 to choose if we should include the genetic component (u_i) or not.

As a result of the setup of INLA, we can only obtain point estimates of h² with this model formulation, but not the marginal posterior distribution. If this is of interest, the following equivalent model formulation must be used instead [2]. The nuisance parameter $ {\sigma}_0^2 $ is introduced to ensure that $ {\sigma}_e^2 $ remains identifiable. It is fixed at a small value, so that it does not absorb much of the variation in the data and thus does not interfere with $ {\sigma}_g^2 $ or $ {\sigma}_e^2 $. We used $ {\sigma}_0^2=4.54\mathrm{e}-5 $, which corresponds to log(σ²₀) = − 10 and was suggested in an earlier publication [2].

$$ {y}_i\mid {\eta}_i,{\sigma}_e^2\sim N\left({\eta}_i,{\sigma}_0^2\right)\forall i=1,\dots, n\ \mathrm{with}\ {\eta}_i={\beta}_0+{\boldsymbol{x}}_i^T\boldsymbol{\beta} +{u}_i+{\varepsilon}_i $$

$$ \boldsymbol{u}\mid {\sigma}_g^2\sim N\left(\mathbf{0},2{\sigma}_g^2\boldsymbol{K}\right)\mathrm{with}\ \boldsymbol{u}=\left({u}_1,\dots, {u}_n\right) $$

$$ \boldsymbol{\varepsilon} \mid {\sigma}_e^2\sim N\left(\mathbf{0},{\sigma}_e^2\boldsymbol{I}\right)\ \mathrm{with}\ \boldsymbol{\varepsilon} =\left({\varepsilon}_1,\dots, {\varepsilon}_n\right) $$

Results

Figure 1 shows the marginal posterior distribution of h² for HDL and TG levels pre- and posttreatment, as well as for the change (post–pre). The change in blood lipid levels can be interpreted as the response to treatment. The model selection step supported the inclusion of u_i in the model and thus estimation of h² for each case. The mode estimates of h² for TG are 0.48, 0.42, and 0.21 for baseline, posttreatment, and change, respectively. Likewise, the mode estimates for h² for HDL are 0.61, 0.68, and 0.10. Based on point estimates alone, HDL appears to be more heritable than TG but their 95% credible intervals overlap. TG is more heritable pretreatment than posttreatment, but the opposite is true for HDL. However, the 95% credibility intervals for h² pre- and posttreatment overlap each other substantially for both traits.

In Fig. 2, the histogram of point estimates of h² of DNAm for 448,040 CpG sites is displayed, for both pre- and posttreatment. Model selection indicates if there is evidence for nonzero h².

For pretreatment, we detect nonzero h² for 425,791 of 448,040 CpG sites (95.0%). For the CpG sites with strong evidence for nonzero h², the mean, median, and interquartile range (IQR) are 0.33, 0.31, and 0.16. In contrast, for posttreatment we detect nonzero h² for 199,027 of 448,040 CpG sites (44.4%) with mean, median, and IQR equal to 0.36, 0.34, and 0.20. Both cases show evidence of zero inflation and a right skewed distribution of DNAm h². Indirectly, we see that GOLDN is underpowered for detecting h² of DNAm less than approximately 10 to 15%.

Figure 2 clearly shows a striking difference in the distribution of genome-wide h² estimates between pre- and posttreatment. Figure 3 shows the direct pairwise comparison of the two h² estimates for DNAm across those CpG sites with strong evidence for nonzero h². For these CpG sites, we observe a moderate Pearson correlation between the baseline and posttreatment heritability estimates of 0.53.

Discussion and conclusions

In this article, we use a Bayesian approach to estimate h² for blood lipid levels and DNAm in extended pedigrees, and show that it is feasible to do genome-wide Bayesian estimation of h².

For blood lipids, our pretreatment point estimates for h² using the Bayesian approach (h²_TG = 0.48 and h²_HDL = 0.61) are comparable to previously reported estimates using other methods [16]. Notably, the 95% credibility intervals are wide (see Fig. 1). Interestingly, the response to treatment (change in blood lipid levels from pre- to posttreatment; see Fig. 1) also appears to be heritable. This may have important implications for personalized medicine, especially if one were able to identify explanatory variables (genetic or environmental) that predict the individual genetic value (u_i).

The genome-wide distribution of DNAm h² (as ascertained by the Illumina Infinium 450 K chip) appears to follow a 2-group mixture model with a h² = 0 component and a nonzero component following a smooth, unimodal right skewed distribution. The mixture proportion is strikingly different in the pre- and posttreatment (see Fig. 2). It is unlikely that treatment with fenofibrate could cause substantial genome-wide changes in DNAm h². It is much more likely that technical differences between the DNAm time points explain the difference [17].

The h² estimates we present in this article are generally similar to earlier genome-wide results in different populations [5, 7]. These previously published studies also show zero inflation, although the reported proportions vary substantially, from 5 to 17%. All show right-skewed distributions, but with slightly different modality. We report the median, h²_pre = 0.31, h²_post = 0.34, in the GOLDN study only for CpG sites with strong evidence for nonzero h² (ie, ΔDIC > 10). This is higher than previously reported, although this is to be expected if previous reports included the zero component in the calculation of the median. With the zero component included, our median estimates decrease to h²_pre = 0.30, h²_post = 0.0036.

A Bayesian model holds some advantages over a frequentist model when estimating h². The most notable difference is that a Bayesian model can estimate the full posterior distribution of the parameters in question. This includes the random effect variance parameters, which allows more direct assessment of the uncertainty of the random-effect estimates—and thus of the h² estimates—than could be achieved with classical mixed-effects models. For example, by comparing posterior credibility intervals for h² pre- versus posttreatment, we can determine if the h² changes are a result of treatment (see Fig. 1). In addition, model selection can be performed in a principled way by DIC or Bayes factors to assess if the data provide sufficient evidence for nonzero h². When the h² is truly zero, the genetic values u_i should not be included in the model, and therefore a model without the genetic values should be favored. Indeed, DIC-based model selection favored the model without random effect in the vast proportion of CpG sites for which h² was estimated to be close to zero.

An important limitation in our study is the lack of information on shared households to estimate the effect of shared environment. This likely results in an overestimation of h². The CpG sites on the 450 K array are a nonrandom selection of CpG sites with a bias toward CpG islands that often have regulatory function. This is important to consider, as an oversampling of biologically important CpG sites could result in a biased estimation for nonspecific genome-wide DNAm h².

In this article, we use a Bayesian approach to estimate h² in extended pedigrees and show the importance of using model selection to determine if there is strong evidence of nonzero h² given the data. Future work should focus on simulation studies comparing the frequentist and Bayesian approaches, with particular focus on traits having low h^2..

References

Almasy L, Blangero J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998;62(5):1198–211.
Article CAS Google Scholar
Holand AM, Steinsland I, Martino S, Jensen H. Animal models and integrated nested Laplace approximations. G3 (Bethesda). 2013;3(8):1241–51.
Article Google Scholar
Ralston SH, Uitterlinden AG. Genetics of osteoporosis. Endocr Rev. 2010;31(5):629–62.
Article CAS Google Scholar
Slatkin M. Epigenetic inheritance and the missing heritability problem. Genetics. 2009;182(3):845–50.
Article Google Scholar
McRae AF, Powell JE, Henders AK, Bowdler L, Hemani G, Shah S, Painter JN, Martin NG, Visscher PM, Montgomery GW. Contribution of genetic variation to transgenerational inheritance of DNA methylation. Genome Biol. 2014;15(5):R73.
Article Google Scholar
Quon G, Lippert C, Heckerman D, Listgarten J. Patterns of methylation heritability in a genome-wide analysis of four brain regions. Nucleic Acids Res. 2013;41(4):2095–104.
Article CAS Google Scholar
Kulkarni H, Kos MZ, Neary J, Dyer TD, Kent JW Jr, Goring HH, Cole SA, Comuzzie AG, Almasy L, Mahaney MC, et al. Novel epigenetic determinants of type 2 diabetes in Mexican-American families. Hum Mol Genet. 2015;24(18):5330–44.
Article CAS Google Scholar
Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Series B Stat Methodol. 2009;71:319–92.
Article Google Scholar
Irvin MR, Kabagambe EK, Tiwari HK, Parnell LD, Straka RJ, Tsai M, Ordovas JM, Arnett DK. Apolipoprotein E polymorphisms and postprandial triglyceridemia before and after fenofibrate treatment in the genetics of lipid lowering and diet network (GOLDN) study. Circ Cardiovasc Genet. 2010;3(5):462–7.
Article CAS Google Scholar
Weiss LA, Pan L, Abney M, Ober C. The sex-specific genetic architecture of quantitative traits in humans. Nat Genet. 2006;38(2):218–22.
Article CAS Google Scholar
Das M, Irvin MR, Sha J, Aslibekyan S, Hidalgo B, Perry RT, Zhi D, Tiwari HK, Absher D, Ordovas JM, et al. Lipid changes due to fenofibrate treatment are not associated with changes in DNA methylation patterns in the GOLDN study. Front Genet. 2015;6:304.
Article Google Scholar
Pidsley R, Wong CCY, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14:293.
Article CAS Google Scholar
Therneau T, Atkinson E, Sinnwell J, Matsumoto M, Schaid D, Mcdonnell S: Package “kinship2” [Online] 2012. Available at: http://cran.r-project.org/web/packages/kinship2/kinship2.pdf.
Spiegelhalter DJ, Best NG, Carlin BR, van der Linde A. Bayesian measures of model complexity and fit. J R Stat Soc Series B Stat Methodol. 2002;64:583–616.
Article Google Scholar
Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis, vol. 2. Boca Raton: Chapman & Hall/CRC; 2014.
Google Scholar
Kathiresan S, Manning AK, Demissie S, D’Agostino RB, Surti A, Guiducci C, Gianniny L, Burtt NP, Melander O, Orho-Melander M, et al. A genome-wide association study for blood lipid phenotypes in the Framingham heart study. BMC Med Genet. 2007;8(Suppl 1):S17.
Article Google Scholar
LeBlanc M, Nustad HE, Zucknick M, Page CM. Quality control for Illumina 450K methylation data in the absence of iDat files using correlation structure in pedigrees and repeated measures. BMC Genet. 2018;19(Suppl 1) https://doi.org/10.1186/s12863-018-0636-5.

Download references

Acknowledgements

Part of this work was performed on the Abel Cluster, owned by the University of Oslo. The authors also acknowledge input from Robert Lyle and Kristina Gervin.

Funding

Publication of this article was supported by NIH R01 GM031575. HEN received funding from the Norwegian Research Counsel. CMP and ML received internal funding from Oslo University Hospital.

Availability of data and materials

The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW) but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.

About this supplement

This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.

Author information

Haakon E. Nustad and Christian M. Page contributed equally to this work.

Authors and Affiliations

Department of Medical Genetics, Oslo University Hospital, Kirkeveien 166, 0450, Oslo, Norway
Haakon E. Nustad
Department of Medical Genetics, University of Oslo, Klaus Torgårds vei 3, 0372, Oslo, Norway
Haakon E. Nustad
PharmaTox Strategic Research Initiative, University of Oslo, Sem Sælands vei 3, 0371, Oslo, Norway
Haakon E. Nustad
Oslo Centre for Biostatistics and Epidemiology, Oslo University Hospital, Klaus Torgårds vei 3, 0372, Oslo, Norway
Christian M. Page, Andrew H. Reiner & Marissa LeBlanc
Department of Non-Communicable Disease, Norwegian Institute of Public Health, Marcus Thranes gate 6, 0473, Oslo, Norway
Christian M. Page
Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Sognsvannsveien 9, 0372, Oslo, Norway
Manuela Zucknick

Authors

Haakon E. Nustad
View author publications
You can also search for this author in PubMed Google Scholar
Christian M. Page
View author publications
You can also search for this author in PubMed Google Scholar
Andrew H. Reiner
View author publications
You can also search for this author in PubMed Google Scholar
Manuela Zucknick
View author publications
You can also search for this author in PubMed Google Scholar
Marissa LeBlanc
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HEN, ML, MZ, and CMP conceived and designed the experiment; HEN, ML, AHR, and CMP generated the results; and HEN, ML, MZ, and CMP interpreted the results. All authors wrote the manuscript and approved the final version.

Corresponding author

Correspondence to Marissa LeBlanc.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Nustad, H.E., Page, C.M., Reiner, A.H. et al. A Bayesian mixed modeling approach for estimating heritability. BMC Proc 12 (Suppl 9), 31 (2018). https://doi.org/10.1186/s12919-018-0131-z

Download citation

Published: 17 September 2018
DOI: https://doi.org/10.1186/s12919-018-0131-z

Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data

A Bayesian mixed modeling approach for estimating heritability