Bivariate genetic association analysis of systolic and diastolic blood pressure by copula models

We conduct genetic association analysis in the subset of unrelated individuals from the San Antonio Family Studies pedigrees, applying a two-stage approach to take account of the dependence between systolic and diastolic blood pressure (SBP and DBP). In the first stage, we adjust blood pressure for the effects of age, sex, smoking, and use of antihypertensive medication based on a novel modification of censored regression. In the second stage, we model the bivariate distribution of the adjusted SBP and DBP phenotypes by a copula function with interpretable SBP-DBP correlation parameters. This allows us to identify genetic variants associated with each of the adjusted blood pressures, as well as variants that explain the association between the two phenotypes. Within this framework, we define a pleiotropic variant as one that reduces the SBP-DBP correlation. Our results for whole genome sequence variants in the gene ULK4 on chromosome 3 suggest that inference obtained from a copula model can be more informative than findings from the SBP-specific and DBP-specific univariate models alone.


Background
A number of genome-wide association studies (GWAS) involving large populations have been conducted to identify genetic variants associated with various single blood pressure (BP) measures: systolic blood pressure (SBP), diastolic blood pressure (DBP), or a linear function of them. Although the correlation between SBP and DBP is high, the results of the GWAS for each separately indicate only partially overlapping sets of variants associated with SBP and DBP. In this report, we model SBP and DBP jointly, taking the association between them into account. Constructing a bivariate model for these two phenotypes can increase the power to detect causal variants for one or both phenotypes, shedding more light onto the complex underlying genetic processes.
We apply copula functions [1] to model the bivariate distribution of SBP and DBP conditional on genetic variants. Copulas are functions used to construct a joint distribution by combining the marginal distributions with a dependence structure, and they allow investigation of the dependence structure between the phenotypes SBP and DBP separately from the marginal distributions. This property of copula models is very useful in identifying genetic variants that explain the dependence between SBP and DBP. It is well known that the Pearson correlation coefficient effectively measures the linear dependence of two random variables coming from a bivariate normal distribution. However, it may not be a good measure for other bivariate distributions where the conditional mean of Y i given Y j is not linear in Y j. Hence, we prefer a nonparametric correlation measure. One frequently used measure based on concordance and discordance is Kendall's tau, which is the probability of concordance minus the probability of discordance. We also use upper and lower tail dependence measures, which measure the level of dependence in the upper-right quadrant tail and lower-left quadrant tail of a bivariate distribution, respectively, as it might be especially interesting to find pleiotropic variants explaining association between high SBP and high DBP or low SBP and low DBP.
Our analysis has two objectives: (a) to investigate the association of some common variants with SBP and DBP under the joint model of SBP and DBP, and (b) to identify pleiotropic variants, which we define as variants that explain the association between SBP and DBP.

San Antonio family studies data
The Genetic Analysis Workshop 18 (GAW18) data set includes 153 unrelated individuals from the San Antonio family studies pedigrees having SBP and DBP measurements at one or more study exams, information regarding current use of antihypertensive medication, and nongenetic covariates sex, age, and current tobacco smoking status at some examinations. To retain all subjects, we imputed missing age values by adding the mean time interval between measurements to the last known age of a subject. Some missing values of smoking status were also imputed by examining the smoking patterns of individuals over the four time points. We later verified that these imputations did not lead to any significant differences in parameter estimates and inference. Among the 153 unrelated individuals with measured phenotype data, 100 have whole genome sequence data. We conducted our genetic analysis on chromosome 3 in this group, considering only the ULK4 gene previously found to be associated with DBP [2]. We analyzed 1771 variants with minor allele frequency (MAF) ≥0.05.

Phenotype definition
Before modeling the joint distribution of SBP and DBP conditional on genetic variants, we first adjusted the observed BPs for the effect of antihypertensive medication and other nongenetic covariates. The GAW18 unrelated pedigree members include hypertensive individuals (i.e, with high BP) and some taking antihypertensive medication (Table 1). Adjusting BP for the effect of BP-lowering medication is crucial when the objective is to identify genes associated with high or low BP. Based on a simulation study comparing several methods [3], the use of a censored regression model conditional on both nongenetic and genetic covariates was recommended, assuming that a treated individual's true "underlying" BP is higher than that observed. For our analysis, we extend their censored regression approach by deriving the maximum likelihood estimate (MLE) of a conditional expectation that is in the form of fitted BP plus a nonnegative adjustment term depending on the observed BP and the nongenetic covariates. It provides a more intuitive adjustment of the BP of treated individuals than, for example, the nonparametric method or assuming that treatment has the same constant effect for each individual as in Tobin et al [3].
At each examination point j j, we separately fitted censored regression models of BP conditional on nongenetic covariates age, sex, and smoking status with medication use as the censoring indicator. After conducting standard residual analysis and model selection, we specified the models as and i = 1, . . . , n j n 1 = 141, n 2 = 97, n 3 = 98, n 4 = 37). This formulation of the age covariates reflects previous findings (see, eg, Ref. [4]) that SBP increases with age whereas DBP decreases after the age of 55 to 60 years, which can be approximated here with the sample mean age. We used the "survreg" function in the "survival" package of R to fit the censored regression models.
For individuals who received antihypertensive medication, we estimate the underlying BP with the MLE of the conditional expectation of BP, given that the observed BP is lower than the true underlying BP. For illustration, under the model (1), the conditional expectation is where z i,j denotes the vector of nongenetic covariates with associated regression parameter γ (j) = γ 0(j) , γ 1(j) , γ 2(j) , γ 3(j) , and f , F are the normal probability density and cumulative distribution functions, respectively, with mean γ (j) z i,j and variance σ SBP,j 2 . The effects of the adjustment are evident in Figure 1, with the adjusted BP of treated individuals always higher than their observed BP. At each of the first 3 examination time points, we obtained residuals from fitting the censored regression models (1) and (2). We disregarded the last examination time because few BP measurements were available (see Table 1). An untreated individual's residual is the difference between observed and fitted BPs; a treated individual's residual is the difference between adjusted and fitted BP. Finally, we averaged the residuals (over j = 1,2,3) separately for SBP and DBP, and took these mean residuals as our adjusted phenotypes.

Bivariate copula modeling
In the second stage, we first constructed the marginal models for our adjusted phenotypes Y 1 and Y 2 given a genetic variant X = x, assuming that the genetic variants X are independent of the nongenetic variants Z. In the marginal models we observed no evidence against the normality assumptions for the error terms. We then used a copula function C to build the bivariate distribution of Y 1 and Y 2 conditional on genetic variants by combining the 2 marginal distribution functions F 1 y 1 |X = x and F 2 y 2 |X = x . More specifically, we consider the bivariate distribution function where F 1 and F 2 are the normal cumulative distribution functions with variances σ 1 2 and σ 2 2 , respectively, and ψ is the vector of copula parameters. To illustrate the approach, consider the 2-parameter copula family with 0 ≤ u 1 , u 2 ≤ 1, and the copula (or dependence) parameters ψ = (ϕ, θ ) , ϕ > 0, θ ≥ 1. To explain the association between Y 1 and Y 2 , we use Kendall's tau (τ ), which is a measure of overall association based on concordance and discordance, and we use lower and upper tail dependence measures (l L , l U , respectively), which explain the amount of dependence between extreme values, and can give more insight in identifying pleiotropic variants. For the copula family in (6), these dependence measures become [1] We obtain MLEs of the marginal parameters β = (β 0 , β 1 , σ 2 ), β = (β 0 , β 1 , σ 2 ) in equation (4) and the copula parameters ψ = (ϕ, θ ) in equation (6) by maximizing the likelihood function [1] with the general optimization software implemented in the nlm function in R. Variance estimates for the MLEs are obtained from the inverse of the observed information matrix.
To address aim (a) concerning the marginal association of a variant with each SBP and DBP under the bivariate model (5), we test the null hypotheses H 0 : α 1 = 0 (vs. H A : α 1 = 0) and H 0 : β 1 = 0 (vs. H A : β 1 = 0) with the large sample Wald test statistic. We expect improved inference under the bivariate model compared to inference obtained by separate analysis of SBP and DBP, which we refer to as the working independence model.
In contrast, for aim (b), which is to identify a variant that explains association between SBP and DBP, the copula model dependence parameters ϕ and θ, and dependence measures (7) are of interest. We compare estimates of Kendall's τ , λ L , and λ U under the full bivariate model (5) that includes the genetic variant with the corresponding estimates obtained under the bivariate model without the variant (ie, the null model with H 0 : α 1 = β 1 = 0). According to the delta method, we construct a confidence interval (CI) for the dependence measures using large-sample standard errors. When the CIs for a given association measure under the null and the full model do not overlap, we conclude that the given variant is pleiotropic. Use of CIs in this way is quite conservative. We also check whether the CI for λ L or λ U under the full model includes 0. Note that the copula model (6) only becomes an independent copula C(u 1 , u 2 ) = u 1 u 2 when θ = 1 and ϕ goes to 0. However, because it is practically impossible to identify all variants, instead of testing independence, we search for variants that reduce the magnitude of the overall dependence measures, such as Kendall's τ.

Results and discussion
For model selection, we note that the Akaike information criterion (AIC) value under the copula model (6) is much lower than the AIC under a bivariate normal model, indicating that the copula model is a better fit. For example, the AIC value under the copula model (6) reported in Table 2 is 1227.6 compared to an AIC value of 1337.8 under the bivariate normal model (not shown). These AICs are comparable to those obtained when conditioning on other variants. The aim (a) results (Table 2) are thus limited to the Wald test p values of the MLE estimates of the coefficients α 1, β 1 in (4) for testing H 0 : α 1 = 0 and H 0 : β 1 = 0 under two models: the working independence model and the bivariate copula model (6) for single-variant analysis. We observed some variants, including less common (0.05 ≤ MAF ≤0.10) and more common (MAF >0.10) variants, that are identified by both models, but the p values for testing H 0 : α 1 = 0 and H 0 : β 1 = 0 under the copula model (with minimum p values 1.7 × 10 −4 and 5.1 × 10 −5 , respectively) are smaller than the p values under the working independence model (with minimum p values 5.5 × 10 −3 and 7.0 × 10 −4 , respectively). This includes variants significantly associated with both BP phenotypes under the joint copula model, although they are not significantly associated at the 1% level with either under the univariate phenotypic models (see Table 2). Overall, the estimated genetic effect sizes are larger and the estimated standard errors are slightly smaller under the bivariate model. Table 3 displays 2 of 10 variants yielding a substantial reduction in point estimates of the upper tail dependence measure λ U under the bivariate model (5) conditional on the variant. Compared to the null model when conditioning on a variant in the gene ULK4, Kendall's tau and lower tail dependence do not differ markedly. We observe that without conditioning on any variant (H 0 : α 1 = β 1 = 0), the 2 phenotypes are moderately correlated with a Kendall's tau estimate of 0.578, and higher lower tail dependence than upper tail dependence (Table  3). Conditioning on the variant at 41,984,243 base-pair position diminishes upper tail dependence measure λ U at 0.01 level of significance (99% CI for λ U includes 0); this variant is also associated with SBP and DBP (see Table 2). Figure 2 illustrates how it achieves a reduction in upper tail dependence. Tail dependence can also be reduced in the absence of strong marginal BP associations. For example, the variant at 41,971,559 base-pair position is only modestly associated with SBP (p value = 0.040) and DBP (p value = 0.055), but the upper tail dependence is reduced from 0.449 obtained under the null model to 0.289 with a CI that includes 0 (Table 3).

Conclusions
In this report, we demonstrate how to model the bivariate distribution of SBP and DBP with copulas and conduct appropriate inference for genetic association. The proposed method is shown to be applicable by considering a single gene, and crude estimates of computation time suggest that it is feasible to process 1 million variants in less than a day, for example, by using one hundred 2.5-GHz cores. Although estimating the bivariate distribution is computationally more intensive than fitting the working independence model, given the high correlation between the phenotypes, a potential advantage is that genetic associations can be detected with higher power under a plausible joint model. Using joint copula models, we were also able to identify candidate variants explaining the upper tail dependence of SBP and DBP. We generally observed strong linkage disequilibrium between variants identified. By conducting joint analyses of multiple variants in moderate linkage disequilibrium, we achieved a much more significant reduction in upper tail dependence (data not shown), and although we observed some reduction in point estimates of lower tail dependence and Kendall's tau, the CIs still overlap with those under the null model. Calling a comparison significant when the CIs fail to overlap is a conservative approach, but it is computationally efficient. As an alternative, a nonparametric bootstrap procedure could be used to estimate the variance of the estimated difference between dependence measures under the null and full models, and to construct an approximate CI. To allow multiple testing adjustments, instead of checking whether the CI for λ L or λ U under the full model includes 0, it would be desirable to test each of the null hypotheses H 0 : ϕ = 0 or H 0 : θ = 1, respectively, to obtain p values [5].
In principle, the extension of our approach to 3 or more quantitative traits is straightforward; however, the copula model (6) may not be ideal in this setting. It involves some restrictions on the association structure, and generally the Gaussian copula is used when there are 3 or more traits. The approach could also be extended to binary traits, but with some caution because there is no unique copula identifying the joint distribution function of discrete variables [6].   , they spread out. Note that the unexplained association is likely to be caused by other unknown variants. If we could have identified all the variants that explain the association between phenotypes, then the plot in the right panel conditioning on these variants would have all dots randomly scattered on the unit square.