Volume 8 Supplement 1

## Genetic Analysis Workshop 18: Human sequence data in extended pedigrees

# Mixed-effects models for GAW18 longitudinal blood pressure data

- Wonil Chung
^{1}and - Fei Zou
^{1}Email author

**8(Suppl 1)**:S87

https://doi.org/10.1186/1753-6561-8-S1-S87

© Chung and Zou; licensee BioMed Central Ltd. 2014

**Published: **17 June 2014

## Abstract

In this paper, we propose two mixed-effects models for Genetic Analysis Workshop 18 (GAW18) longitudinal blood pressure data. The first method extends EMMA, an efficient mixed-model association-mapping algorithm. EMMA corrects for population structure and genetic relatedness using a kinship similarity matrix. We replace the kinship similarity matrix in EMMA with an estimated correlation matrix for modeling the dependence structure of repeated measurements. Our second approach is a Bayesian multiple association-mapping algorithm based on a mixed-effects model with a built-in variable selection feature. It models multiple single-nucleotide polymorphisms (SNPs) simultaneously and allows for SNP-SNP interactions and SNP-environment interactions. We applied these two methods to the longitudinal systolic blood pressure (SBP) and diastolic blood pressure (DBP) data from GAW18. The extended EMMA method identified a single SNP on Chr5:75506197 (*p*-value = 4.67 × 10^{−7}) for SBP and three SNPs on Chr3:23715851 (*p*-value = 9.00 × 10^{−8}), Chr 17:54834217 (*p*-value = 1.98 × 10^{−7}), and Chr21:18744081 (*p*-value = 4.95 × 10^{−7}) for DBP. The Bayesian method identified several additional SNPs on Chr1:17876090 (Bayes factor [BF] = 102), Chr3:197469358 (BF = 69), Chr15:87675666 (BF = 43), and Chr19:41642807 (BF = 33) for SBP. Furthermore, for SBP, we found a single SNP on Chr3:197469358 (BF = 69) that has a strong interaction with age. We further evaluated the performances of the proposed methods by simulations.

## Background

Genome-wide association studies (GWAS) have been used for examining genetic variants associated with blood pressure and hypertension [1, 2]. Because blood pressure changes over time, it is important to collect multiple blood pressure measurements to study time-dependent genetic effects. Genetic Analysis Workshop 18 (GAW18) data included systolic blood pressure (SBP) and diastolic blood pressure (DBP) measurements from a human whole genome sequencing (WGS) study [3]. The study was longitudinal, and the majority of participants had three measurements collected at approximately 5-year intervals. This paper proposes two mixed-effects models for GAW18 longitudinal SBP and DBP data. The first approach extends the EMMA method [4], an efficient mixed-model association-mapping algorithm. EMMA corrects for population structure and genetic relatedness using a kinship similarity matrix. We replace the kinship similarity matrix in EMMA with an estimated correlation matrix for the dependence structure of the multiple measurements from each individual. With this extended approach, hundreds of thousands or even millions of association tests can be performed efficiently. However, this approach tests only one single-nucleotide polymorphism (SNP) at a time and may have low power to map SNPs that interact with each other. Furthermore, it is not straightforward to tweak EMMA software for testing SNP by time interaction, an important question that can be addressed through longitudinal data. To address these concerns, we developed a Bayesian method based on the composite model space framework of Yi et al [5–7]. The proposed method fits multiple SNPs simultaneously. In addition, it allows for SNP-SNP interactions and SNP-time interactions.

## Methods

### Extended EMMA

where ${y}_{i}={\left({y}_{i1},...,{y}_{i{n}_{i}}\right)}^{T}$ is the ${n}_{i}\times 1$ phenotype vector of individual $i$; ${\mu}_{i}=\mu {1}_{{n}_{i}}$ with $\mu $ being the grand mean and ${1}_{{n}_{i}}$ being the ${n}_{i}\times 1$ vector whose elements are all equal to 1; ${x}_{i}^{\mathbf{\text{e}}}$ is the design matrix corresponding to nongenetic covariates (e.g., time), and ${\beta}^{\mathbf{\text{e}}}$ is the associated nongenetic effects; ${x}_{i}^{g}$ is the numerically coded genotype of individual *i* and ${\beta}^{g}$ is the corresponding SNP effect. In the model, we assume random effect ${u}_{i}~N\left(\mathbf{\text{0}},{\sigma}_{g}^{2}{K}_{i}\right)$ where ${K}_{i}$ is an ${n}_{i}\times {n}_{i}$ matrix, and random error ${e}_{i}~N\left(\mathbf{\text{0}},{\sigma}_{e}^{2}{I}_{{n}_{i}}\right)$. The SNP effect can be tested as ${H}_{0}:{\beta}^{g}=0$ versus ${H}_{1}:{\beta}^{g}\ne 0$ via the likelihood ratio test. For GWAS or WGS data, this test needs to be performed with a large number of SNPs, which can be computationally intensive if we treat ${K}_{i}$s as the unknowns and estimate them jointly with the fixed effects. EMMA [4] is an efficient algorithm originally developed for GWAS data in which samples are potentially structured. EMMA models the structure effect via a similarity matrix. An R package that implements EMMA can either estimate the similarity matrix using genotype data or take any similarity matrix provided by users. We tweak EMMA for our purpose. We provide EMMA with the following similarity matrix $K=diag\left({\widehat{K}}_{1},{\widehat{K}}_{2},...,{\widehat{K}}_{n}\right)$ where ${\widehat{K}}_{i}$s are the estimated correlation matrices from model (1) in which ${\beta}^{g}$ is set to 0. The idea of estimating ${K}_{i}$s this way is not new and has been used in EMMAX [8], a fast version of EMMA. These estimates should be reasonable unless some SNPs have large effects, which is rare for most complex traits.

### Bayesian multiple QTL mapping

where ${x}_{i}\left[=\left({x}_{i}^{\mathbf{\text{e}}},{x}_{i}^{\mathbf{\text{g}}},{x}_{i}^{\mathbf{\text{gg}}},{x}_{i}^{\mathbf{\text{ge}}}\right)\right]$ is the design matrix corresponding to nongenetic factors, *p* putative SNPs, two-way interactions between *p* SNPs (resulting in total of *p*(*p*−1)/2 terms) and other selected SNP-environment interactions (for GAW18 data, we consider *p* SNP-age interactions); $\beta \left[={\left({\beta}^{\mathbf{\text{e}}T},{\beta}^{\mathbf{\text{g}}T},{\beta}^{\text{gg}T},{\beta}^{\mathbf{\text{ge}}T}\right)}^{T}\right]$ is the vector of all fixed effects. We define ${\mu}_{i}$ the same way as in model (1). The random effects u_{
i
} and e_{
i
} are also assumed to follow the same distributions as described in model (1). Model (2) includes the effects of all putative SNPs; thus, the number of such effects can be large. To identify SNPs associated with the trait of interest, we use a Bayesian variable selection procedure in which we use a set of latent binary variables ${\gamma}_{k}$$\left(k=1,...,q\right)$ to indicate which of the $q$ genetic effects (be they main genetic effects, epistasis effects and/or SNP by environment interactions) are associated $\left({\gamma}_{k}=1\right)$ or not associated $\left({\gamma}_{k}=0\right)$ with the trait.

As in model (1), we assume matrix ${K}_{i}$ is known. We apply the Cholesky decomposition to ${K}_{i}$ such that ${K}_{i}={M}_{i}{M}_{i}^{T}$ where ${M}_{i}$ is the ${n}_{i}\times {n}_{i}$ lower triangular Cholesky decomposition matrix of ${K}_{i}$. Then model (2) can be reparameterized as ${y}_{i}={\mu}_{i}+{x}_{i}\beta +{\sigma}_{g}{M}_{i}{b}_{i}+{e}_{i}$ where ${b}_{i}={\left({b}_{i1},...,{b}_{i{n}_{i}}\right)}^{T}~N\left(\mathbf{\text{0}},{I}_{{n}_{i}}\right)$. We use the same prior distributions for $\mu $, $\beta $, $\gamma ={\left({\gamma}_{1},...,{\gamma}_{q}\right)}^{T}$, and ${\sigma}_{e}^{2}$ in Yi et al [7]. We set the prior of ${\sigma}_{g}$ to ${N}^{+}\left({m}_{g0},{s}_{g0}^{2}\right)$, where ${N}^{+}\left({\mu}_{0},{\sigma}_{0}^{2}\right)$ is the positive truncated normal density with mean ${\mu}_{0}$ and variance ${\sigma}_{0}^{2}$, and both ${m}_{g0}$ and ${s}_{g0}^{2}$ are prespecified hyperparameters. The proposed method has been implemented upon the widely used R package, R/qtlbim [9] for these GAW18 longitudinal data.

## Results and discussion

### GAW18 data

The GAW18 data included 849 individuals with both phenotype and imputed genotype data from 20 extended pedigrees. Each sample was measured multiple times on their blood pressures over approximately 5-year intervals. Among these 849 individuals, 139 were genetically unrelated and were measured for age, sex, medication use, smoking status, and blood pressure. Our analysis was restricted to the 139 unrelated individuals. The number of SBP and DBP ranged from one to four for each sample. WGS data provided by the GAW18 data had 8,348,674 SNPs from odd numbered autosomes. All SNPs provided passed the initial quality control checking, but among 2,796,608 SNPs with minor allele frequency (MAF) greater than 0.05, 17,463 of them failed Hardy-Weinberg equilibrium (HWE) test (with *p*-value<0.05/2,796,608, a Bonferroni corrected genome-wide threshold). We removed all SNPs with MAFs less than 0.05 plus those not passing the HWE test, resulting in 2,779,145 SNPs for the subsequent analyses.

To check population outliers and potential population substructure, we generated a subset of SNPs that are not in high linkage disequilibrium (LD) with each other (i.e., ${r}^{2}<0.5$) and performed the multidimensional scaling (MDS) analysis in PLINK [10]. Pairwise scatter plots of the top four MDS scores showed that the 139 individuals are homogeneous in terms of their ethnicities. However, two samples, T2DG0400207 and T2DG0400247, have an estimated IBD value of 0.3 between them, indicating that they are likely related. In our analysis, we retained all 139 samples because the number of putatively related samples is small and their inclusion should have a negligible effect on our analysis results.

^{2}, sex, medication use, and smoking status) were included for analyses. We fitted these data with different covariance matrices in SAS 9.2 and selected the spatial power covariance structure for the downstream analysis based on the AIC criteria. Specifically, we let $cov\left({y}_{ij},{y}_{i{j}^{\prime}}\right)={\sigma}^{2}{\rho}^{{d}_{i,j{j}^{\prime}}}$, where ${d}_{i,j{j}^{\prime}}$ is the time distance between the $j$th and ${j}^{\prime}$th examinations for individual $i$. After obtaining the parameter estimate of $\rho $, $\widehat{\rho}$ from model (1) with ${\beta}^{g}=0$, we substituted the kinship matrix $K$ in EMMA by $K=diag\left({\widehat{K}}_{1},{\widehat{K}}_{2},...,{\widehat{K}}_{n}\right)$ where ${\widehat{K}}_{i}=\left\{{\widehat{\rho}}^{{d}_{i,j{j}^{\prime}}}\right\}$. Figure 1(a) displays the Manhattan plots of the two phenotypes from the extended EMMA model. For SBP, one SNP on Chr5:75506197 ($P=4.67\times 1{0}^{-7}$) reached the genome-wide significance ($p-\text{value}<5\times 1{0}^{-7}$, a cutoff suggested by Burton et al [11]). For DBP, three SNPs on Chr3:23715851 ($p-\text{value}=9.00\times 1{0}^{-8}$), Chr17:54834217 ($p-\text{value}=1.98\times 1{0}^{-7}$) and Chr21:18744081 ($p-\text{value}=4.95\times 1{0}^{-7}$) exceeded the genome-wide significance.

Because of the limited sample size, it is not feasible to include all available SNPs in our Bayesian analysis. For each phenotype, we selected a list of 3000 top-ranked SNPs that are not highly correlated with each other (with correlation < 0.95 to avoid multicollinearity) from the extended EMMA for the Bayesian analysis. We applied this Bayesian method with the same covariates used in the extended EMMA method. For all analyses, the MCMC algorithm ran for $4\times 1{0}^{5}$ iterations after the first 1000 burn-in iterations were discarded. The chain was then thinned for every 40 iterations, yielding $1{0}^{4}$ MCMC samples for the posterior analysis. Based on the posterior inclusion probability of each SNP, the Bayes factor (BF) (see [6, 7] for details) was estimated and used to judge the importance of each SNP. Figure 1(b) shows the Manhattan plots of $2ln\left(BF\right)$ for the combined genetic effects of each SNP, which include the main effects, epistasis effects, and SNP-age interactions. We found several additional SNPs with strong signals (BF $>30$ as suggested by Yandell et al [12]) on Chr1:17876090 (BF = 102), Chr3:197469358 (BF = 69), Chr15:87675666 (BF = 43), and Chr19:41642807 (BF = 33) for SBP. No new SNPs were found for DBP. For SBP, we found one SNP located on Chr3:197469358 (BF = 69) has a strong interaction with age.

### Simulations

*p*-value or BF, we calculated the true and false positive findings as follows: a significant finding is claimed to be a true positive finding if it is located less than 1 Mb from any one of the simulated causal SNPs; otherwise the finding is false. The ROC curves with the false-positive rate less than 0.2 are presented in Figure 2. Intuitively, our two methods that used all available data are more powerful than their corresponding univariate analysis methods that only used the first-time data. Furthermore, the Bayesian method is more powerful than the extended EMMA as expected because (a) the Bayesian model allows for SNP-SNP and SNP-age interactions, which are totally ignored by the extended EMMA, and (b) the Bayesian model jointly model multiple SNPs, but the extended EMMA only tests one SNP at a time.

## Conclusions

In this paper, we developed two mixed-effects models for the GAW18 longitudinal blood pressure data. The first approach extends the EMMA method. We replace the kinship similarity matrix in EMMA with an estimated correlation matrix for dealing with the dependent structure of the repeated measurements. The second approach is a Bayesian method that models multiple SNPs simultaneously and allows for SNP-SNP interactions and SNP-time interactions. The advantages of the Bayesian method have been clearly demonstrated by our simulations. Both methods are currently developed for unrelated samples. The GAW18 data contained extended pedigrees. Ideally, we should use all available data in our analysis. What complicates the analysis on longitudinal pedigree data is that both the correlation structure of the repeated measurements and the familial correlation structure of related individuals should be considered. We are currently extending the two proposed methods for the GAW18 pedigree data. Furthermore, for both our analyses, we assume that the covariance matrix is known up to a constant. For the Bayesian model, this assumption can be relaxed and we are developing a semiparametric approach where the covariance matrix is assumed unknown. We estimate the unknown covariance matrix with a modified Cholesky decomposition [13]. Last, our Bayesian model for GWAS data relies on a set of preselected putative SNPs. How to select a good set of putative SNPs, especially those with low marginal effects but high interactions with other SNPs or environmental factors is challenging and deserves further investigations.

## Declarations

### Acknowledgements

This research was supported in part by National Institutes of Health (NIH) grant R01 GM074175. The GAW18 WGS data were provided by the T2D-GENES (Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples) Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.

This article has been published as part of *BMC Proceedings* Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

## Authors’ Affiliations

## References

- Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, Glazer NL, Morrison AC, Johnson AD, Aspelund T, et al: Genome-wide association study of blood pressure and hypertension. Nat Genet. 2009, 677-687. 41Google Scholar
- Padmanabhan S, Melander O, Johnson T, Di Blasio AM, Lee WK, Gentilini D, Hastie CE, Menni C, Monti MC, Delles C, et al: Genome-wide association study of blood pressure extremes identifies variant near UMOD associated with hypertension. PLoS Genet. 2010, 6: e1001177-10.1371/journal.pgen.1001177.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: Human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-PubMed CentralView ArticlePubMedGoogle Scholar
- Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E: Efficient control of population structure in model organism association mapping. Genetics. 2008, 178: 1709-1723. 10.1534/genetics.107.080101.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi N: A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics. 2004, 167: 967-975. 10.1534/genetics.104.026286.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi N, Yandell BS, Churchill GA, Allison DB, Eisen EJ, Pomp D: Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics. 2005, 170: 1333-1344. 10.1534/genetics.104.040386.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS: An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics. 2007, 176: 1865-1877. 10.1534/genetics.107.071365.PubMed CentralView ArticlePubMedGoogle Scholar
- Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E: Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010, 42: 348-354. 10.1038/ng.548.PubMed CentralView ArticlePubMedGoogle Scholar
- Yandell BS, Mehta T, Banerjee S, Shriner D, Venkataraman R, Moon JY, Neely WW, Wu H, Von Smith R, Yi N: R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics. 2007, 23: 641-643. 10.1093/bioinformatics/btm011.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, De Bakker PIW, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani NJ, et al: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 661-678. 447Google Scholar
- Yandell BS, Moon JY, Banerjee S, Neely WW, Yi N: QTL analysis using Bayesian interval mapping 2012. [http://www.cran.r-project.org/web/packages/qtlbim/vignettes/qtlbim.pdf]
- Chen Z, Dunson DB: Random effects selection in linear mixed models. Biometrics. 2003, 59: 762-769. 10.1111/j.0006-341X.2003.00089.x.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.