Volume 8 Supplement 1

## Genetic Analysis Workshop 18: Human sequence data in extended pedigrees

Proceedings | Open | Published:

# Modeling of multivariate longitudinal phenotypes in family genetic studies with Bayesian multiplicity adjustment

*BMC Proceedings***volume 8**, Article number: S69 (2014)

## Abstract

Genetic studies often collect data on multiple traits. Most genetic association analyses, however, consider traits separately and ignore potential correlation among traits, partially because of difficulties in statistical modeling of multivariate outcomes. When multiple traits are measured in a pedigree longitudinally, additional challenges arise because in addition to correlation between traits, a trait is often correlated with its own measures over time and with measurements of other family members. We developed a Bayesian model for analysis of bivariate quantitative traits measured longitudinally in family genetic studies. For a given trait, family-specific and subject-specific random effects account for correlation among family members and repeated measures, respectively. Correlation between traits is introduced by incorporating multivariate random effects and allowing time-specific trait residuals to correlate as in seemingly unrelated regressions. The proposed model can examine multiple single-nucleotide variations simultaneously, as well as incorporate familyspecific, subject-specific, or time-varying covariates. Bayesian multiplicity technique is used to effectively control false positives. Genetic Analysis Workshop 18 simulated data illustrate the proposed approach's applicability in modeling longitudinal multivariate outcomes in family genetic association studies.

## Background

High-throughput genotyping advances have generated large amounts of genetic data. At the same time, numerous related phenotypes are often collected in genetic studies of complex traits. To understand how genetic variants influence multiple traits, it is necessary to consider correlation among variants and multiple traits jointly [1, 2]. However, most genetic studies have focused on single variant-single trait association. Analysis performed in this fashion faces at least 3 issues. First, it leads to concerns regarding false discoveries. Bonferroni andother types of correction may limit false discoveries, but implementing these methods on correlated traits limits the power to detect true associations. Second, same genetic variants may influence multiple related traits [3]. Thus, modeling traits separately may misrepresent the underlying biology. Lastly, failure to integrate correlation among variants because of linkage disequilibrium (LD) dilutes the association. Longitudinal studies provide an additional challenge because traits measured over time are likely to be correlated.

We propose a Bayesian joint modeling of longitudinally measured multiple traits in family genetic studies. Explicitly considering the correlation structure between multiple traits using multivariate random effects [4] and seemingly unrelated regression techniques [5], this model studies simultaneously multiple single-nucleotide variations (SNVs) and their associations with multiple traits measured longitudinally. Bayesian multiplicity [6] correction controls false positives. Our method uniquely demonstrates the ability of Bayesian methods to account for the complexity of multivariate, longitudinal data in the context of genetic family data while controlling for false positives. Simulated data from Genetic Analysis Workshop 18 (GAW18) illustrate the application of the proposed method. The analysis was performed with knowledge of the simulation model.

## Methods

### Bayesian bivariate model

We consider SNV association models with a longitudinal bivariate quantitative outcome in family studies. Let ${y}_{ijt}^{k}$ be the ${k}^{th}$ outcome of subject $j$ of pedigree $i$ at time $t$. Here $k=1$ or 2 for bivariate outcomes; $i=1,2,\dots ,I$ indexes $I$ families in the study; $j=1,2,\dots ,{n}_{i}$ indexes ${n}_{i}$ subjects in pedigree $i$; and $t=1,2,\dots {m}_{ij}$ indexes time points where outcomes are measured for subject $j$ of pedigree $i$. Here, the number of measurement ${m}_{ij}$ does not need to be equal for different subjects, but we assume the 2 outcomes are measured at the same set of time points for the same individual. We jointly model the outcomes as follows:

Here, $X$ is a vector of subject-specific time-invariant covariates and $Z$ is a vector of time-varying covariates. $SN{V}_{ijg},g=1,2,\dots G,$ is the ${g}^{th}$ SNV of subject $j$ of pedigree $i$, coded as 0, 1, or 2 indicating the number of minor alleles and $G$ is the number of SNVs in the model. ${\beta}^{\prime}\text{s}$ are regression coefficients (including an intercept) for the covariates or genetic variants. ${\gamma}^{\prime}\text{s}$ are SNV-specific indicator variables, which take values 1 or 0, depending on whether the SNV has an effect on the given outcome or not. Let ${p}_{i}=\left\{{p}_{i}^{1},{p}_{i}^{2}\right\}$, a pedigree-specific random effect common to all the individuals in the same family and accounts for correlation within families. Let ${s}_{ij}=\left\{{s}_{ij}^{1},{s}_{ij}^{2}\right\}$, a subject-specific random effect accounting for the correlation between measures taken repeatedly on the same subject. Let ${e}_{ijt}=\left\{{e}_{ijt}^{1},{e}_{ijt}^{2}\right\},$ the residual error. Additionally, ${p}_{i}$, ${s}_{ij}$ and ${e}_{ijt}$ are assumed to be independent, and

This Bayesian bivariate model uses bivariate normal random family and subject effects, and a bivariate normal distribution for the residual errors, as in seemingly unrelated regressions, to jointly model the 2 traits and allow them to be correlated.

### Bayesian multiplicity adjustment

Bayesian multiplicity adjustment is a Bayesian solution to multiplicity problems [6]. Through appropriate choices of priors on model probabilities, this Bayesian solution will not inflate type I error in the face of multiple comparisons. In our model, we use prior distributions on SNV inclusion to provide correction for multiplicity. For each SNV in the model, 2 parameters are present--the inclusion indicator and the regression coefficient. We assume that the inclusion indicators follow a Bernoulli distribution with an unknown inclusion probability $q$, while $q$ follows a Beta distribution, that is, ${\gamma}_{g}^{k}~Bernoulli\left(1,q\right)$ and $q~Beta\left(a,b\right)$. Here, we set $a=1,b=G$, which represents a sceptical prior such that the marginal prior odds of an association is $1/G$[7]. This Beta-Bernoulli type of prior provides an intrinsic multiplicity correction in the face of multiple comparisons. The significance of the Bayesian analysis is the posterior inclusion probability. Unlike traditional *p* values, a higher posterior inclusion probability indicates greater confidence of a true result. For each SNV, we estimate posterior inclusion probability as

Here, $S$ is the model space being visited by the Markov chain Monte Carlo (MCMC); $\widehat{p}\left({M}_{{\gamma}^{k}}|D,S\right)$ is the estimated posterior probability of model ${M}_{{\gamma}^{k}}$, where ${\gamma}^{k}=\left\{{\gamma}_{1}^{k},{\gamma}_{2}^{k},\dots ,{\gamma}_{G}^{k}\right\}$ is a vector of 0's and 1's, indicating which SNVs are in the model or not. ${I}_{{\gamma}_{g}^{k}=1}=1$ if ${\gamma}_{g}^{k}=1$ and 0 otherwise. True- and false-positive rates are based on the median probability model, which is the model that includes variants with a posterior inclusion probability larger or equal to 0.5. We use noninformative, independent,univariate normal priors for all regression coefficients ${\beta}_{0}^{k},{\beta}_{x}^{k},{\beta}_{z}^{k}$, and ${\beta}_{g}^{k}$, and noninformative inverse Wishart priors for variance-covariance matrices ${\text{\Sigma}}_{p},{\text{\Sigma}}_{s}$, and ${\text{\Sigma}}_{e}$.

### Analysis of GAW18 data

Two outcomes, diastolic (DBP) and systolic blood pressure (SBP), were evaluated. Covariates included sex, age, and smoking status. To adjust for antihypertensive medications use, we calculated the mean difference of blood pressure (BP) between observations with medication use (med = 1) and those without medication (med = 0) among observations with hypertension (htn = 1);that is, $(\overline{BP}\left|htn=1,med=1\right)-\left(\overline{BP}\right|htn=1,med=0)$. Observed BP values were adjusted using mean differences for observations with htn = 1 and med = 1 to impute DBP and SBP measures without medication use. These adjusted outcomes were used in the analysis [8].

Included were 849 individuals from 20 pedigrees (average number of subjects per pedigree was 42; range: 21-75). Each individual has 3 observations. Our focus is on SNVs in *MAP4* on chromosome 3. There are 894 SNVs in the gene; 14 influence both DBP and SBP, and 1 influences SBP. Among the 15 causal SNVs, three are common (minor allele frequency [MAF]>0.05); all others are rare. All analyses were based on replication set 1. To investigate effectiveness of Bayesian multiplicity adjustment, a random set of noise variants of size 1, 15, 30, 60, and 90 were added from *MAP4*. Selection of noise variants was truly random and did not account for LD with causal SNVs. After burn-in and thinning every 10^{th} iteration, 50,000 samples were drawn for MCMC simulations.

We compared our method with two other methods. One method is a Bayesian univariate model where the off diagonal of the variance and covariance matrices in equations (1), (2), and (3) are zero, thus the 2 traits are modeled independently. The univariate model can handle longitudinal data and multi-variants, and uses Bayesian multiplicity techniques to adjust for multiple comparisons. The other is the family-based measured genotype approach (MGA), which is a standard approach to analyze family genetic studies and compares polygenic models with or without each SNV as a covariate [9]. MGA models a single outcome and single SNV, and cannot handle longitudinal data from multiple outcomes jointly. Thus, only the first pair of DBP and SBP measures, adjusted for medication use, of each individual was used and modeled separately. Bonferroni correction was used for multiple comparisons after accounting for 105 tests (90 noises and 15 causal).

## Results

Using a threshold of 0.5 for the posterior inclusion probability, the Bayesian bivariate model detected 5 of the 15 SNVs between SBP and DBP with 1 noise variant (Table 1). As the number of noise variants increased, the number of true positives identified was reduced, such that by 90 noise variants, only two causal variants were identified. True positives had relatively low MAFs and large effect sizes (Table 1). False negatives either were rare, had small effect size, or both. Importantly, Bayesian multiplicity adjustment yielded no false positives. This suggests that a lower posterior probability threshold may be appropriate to yield a family-wise error rate of 0.05.

Using the same set of 90 noise variants, MGA detected five causal SNVs (4 for SBP and 1 additional for DBP). However, MGA identified 8 and 6 false positives for DBP and SBP, respectively. The Bayesian univariate model identified 2 causal SNVs and 1 false positive for both DBP and SBP.

As the 90 noise variants were randomly selected, examining the LD structure is also important. Among the true positives, two pairs (SNVs 5 and 8 and SNVs 7 and 10) have a high LD $\left({r}^{2}\ge 0.8\right)$. SNVs 7 and 10 were both identified by the Bayesian bivariate model and MGA, but only SNV 10 was identified by the Bayesian univariate model. SNVs 5 and 8 were identified by MGA, but none was identified by either the Bayesian bivariate or univariate approach. For MGA, six of the false positives of DBP and four of the false positives of SBP had relatively high LD $\left({r}^{2}\ge 0.8\right)$ with identified true positives. After discounting false positives caused by indirect effects, MGA had 2 false positives for both DBP and SBP.

Overall the Bayesian bivariate model resulted in posterior estimates of beta values very similar to the reported effect size from Genetic Analysis Workshop (GAW) data generators (Figure 1). However, estimates of effect sizes were inflated when only 1 of the 2 variants in high LD and having the same effect direction was included in the model during MCMC simulation (e.g. SNVs 7 and 10). The estimated effect size for SNV 8 (true effect size ≈ 0) was also inflated because it is in high LD with SNV 5 (true effect size <0). In addition, for many causal SNVs there was a high posterior exclusion probability (the proportion of times the variant was not included in the model), suggesting that some beta estimates were based on a small number of MCMC runs.

## Discussion

We developed a novel Bayesian model for analysis of multiple longitudinal traits and multi-variant models in family genetic studies. This is a significant advance because previous studies have looked at bivariate models, longitudinal data, or multi-SNV models separately. For the first time, we have developed an analytic approach that can incorporate all of these issues jointly. The model considers bivariate random family and random subject effects to account for correlation between outcomes, family members, and repeated measures of the same individual. The seemingly unrelated regression technique is used to correlate the residuals of the 2 outcomes measured at the same time for the same individual. Inherent in this method is Bayesian multiplicity for the control of false positives. Using the GAW18 simulated data set, we demonstrated the feasibility of the proposed model.

Compared to the Bayesian univariate model in which the 2 outcomes are modeled separately, our method had similar power, but fewer false positives. Compared to MGA, our Bayesian bivariate approach had fewer false positives regardless of LD between causal and noise SNVs. While the noise variants in LD with our causal variants did not result in increased false positives using the Bayesian approach, given the multi-variant nature of the analysis it is possible that the LD in these noise parameters may have reduced power to detect causal effects. Although different from our proposed method in many aspects, MGA represents a standard practice for analysis of family-based genetic studies. This comparison of different approaches is important because it demonstrates that given the low false positives, the Bayesian model is potentially better than current standard practice.

Several other studies have shown improved power of joint modeling of multiple traits over univariate analyses [10–13]. However, we saw a modest reduction in false positives without much increase in power. This may be a result of the fact that although the correlation between SBP and DBP is modest $\left({r}^{2}=0.55\right)$, these 2 outcomes depend on almost exactly the same set of variants; efficiency gains by utilizing a multivariate approach in this case are negligible. In contrast, large efficiency gains are expected when the outcomes depend on different covariate sets [11]. Furthermore, our model does not explicitly model the genetic relationship between individuals; instead, we used a random family effect. Although we do not expect the inclusion of a kinship coefficient matrix to influence the results, future studies should examine this further.

It is important to note that the model inference is based on computationally expensive MCMC simulation. Although we didn't use any special algorithms to efficiently sample the model space and the number of iterations was relatively small given the extremely large model space, the posterior inclusion probabilities became stable very early in the MCMC chain. This suggests that posterior inclusion probabilities can provide valid model inference. Sensitivity analyses with more optimistic priors on inclusion probabilities gave more power while controlling false positives. However, prior knowledge or evidence must support the use of this type of prior.

## Conclusions

In summary, we introduced a Bayesian model for analysis of bivariate quantitative traits measured longitudinally in family genetic studies. This method extends the previous joint modeling methods and permits simultaneous analysis of multiple traits with longitudinal data. Furthermore, this method incorporates multi-variant effects while effectively controlling the false-positive rate.

## References

- 1.
Shriner D: Moving toward system genetics through multiple trait analysis in genome-wide association studies. Front Genet. 2012, 3: 1-

- 2.
Zhu WS, Zhang HP: Why do we test multiple traits in genetic association studies?. J Korean Stat Soc. 2009, 38: 1-10. 10.1016/j.jkss.2008.10.006.

- 3.
Almasy L, Dyer TD, Blangero J: Bivariate quantitative trait linkage analysis: pleiotropy versus co-incident linkages. Genet Epidemiol. 1997, 14: 953-958. 10.1002/(SICI)1098-2272(1997)14:6<953::AID-GEPI65>3.0.CO;2-K.

- 4.
Reinsel G: Multivariate repeated-measurement or growth curve models with multivariate random-effects covariance structure. J Am Stat Assoc. 1982, 77: 190-195. 10.1080/01621459.1982.10477785.

- 5.
Zellner A: An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J Am Stat Assoc. 1962, 57: 348-368. 10.1080/01621459.1962.10480664.

- 6.
Scott JG, Berger JO: Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann Stat. 2010, 38: 2587-2619. 10.1214/10-AOS792.

- 7.
Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM: Bayesianmodel search and multilevel inference for SNP association studies. Ann Appl Stat. 2010, 4: 1342-1364. 10.1214/09-AOAS322.

- 8.
Tobin MD, Sheehan NA, Scurrah KJ, Burton PR: Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Stat Med. 2005, 24: 2911-2935. 10.1002/sim.2165.

- 9.
Boerwinkle E, Chakraborty R, Sing CF: The use of measured genotype information in the analysis of quantitative phenotypes in man.1. Models and analytical methods. Ann Hum Genet. 1986, 50: 181-194. 10.1111/j.1469-1809.1986.tb01037.x.

- 10.
Liu JF, Pei YF, Papasian CJ, Deng HW: Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genet Epidemiol. 2009, 33 (3): 217-227. 10.1002/gepi.20372.

- 11.
Teixeira-Pinto A, Normand SL: Correlated bivariate continuous and binary outcomes: issues and applications. Stat Med. 2009, 28: 1753-1773. 10.1002/sim.3588.

- 12.
Verzilli CJ, Stallard N, Whittaker JC: Bayesianmodelling of multivariate quantitative traits using seemingly unrelated regressions. Genet Epidemiol. 2005, 28: 313-325. 10.1002/gepi.20072.

- 13.
Yang F, Tang ZH, Deng HW: Bivariate association analysis for quantitative traits using generalized estimation equation. J Genet Genomics. 2009, 36: 733-743. 10.1016/S1673-8527(08)60166-6.

## Acknowledgements

This work was supported in part by NIH grants 8P20GM103436-12 (DWF), K25AG043546(DWF), NS36695 (LD, LJM), AI070235 (HH, LJM), AI066738 (LJM), HL111459 (LJM, VP), T32-ES10957 (ESA), K12 HD001097-16 (BGK), K01HL103165 (TBM). This paper was also support by an Institutional Clinical and Translational Science Award (LD), NIH/NCATS Grant Number 8UL1TR000077-04. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of *BMC Proceedings* Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

## Author information

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

LD conceived and performed the statistical analysis and drafted the manuscript. LJM contributed to the design of the statistical analysis.HH helped with statistical analysis. BGK, ESA, TBM, DWF, HH, XZ, VP, LK, and LJM helped with the writing of the manuscript. All authors read and approved the final manuscript.

## Rights and permissions

## About this article

#### Published

#### DOI

### Keywords

- Markov Chain Monte Carlo
- Noise Variant
- High Linkage Disequilibrium
- Multiple Trait
- Markov Chain Monte Carlo Simulation