Bivariate linear mixed model analysis to test joint associations of genetic variants on systolic and diastolic blood pressure
© Neupane and Beyene; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Genetic variants that predispose adults and the elderly to high blood pressure are largely unknown. We used a bivariate linear mixed model approach to jointly test the associations of common single-nucleotide polymorphisms with systolic and diastolic blood pressure using data from a genome-wide association study consisting of genetic variants from chromosomes 3 and 9 and longitudinal measured phenotypes and environment variables from unrelated individuals of Mexican American ethnicity provided by the Genetic Analysis Workshop 18. Despite the small sample size of a maximum of 131 unrelated subjects, a few single-nucleotide polymorphisms appeared significant at the genome-wide level. Simulated data, which was also provided by Genetic Analysis Workshop 18 organizers, showed higher power of the bivariate approach over univariate analysis to detect the association of a selected single-nucleotide polymorphism with modest effect. This suggests that the bivariate approach to longitudinal data of jointly measured and correlated phenotypes can be a useful strategy to identify candidate single-nucleotide polymorphisms that deserve further investigation.
High blood pressure is a common disorder in adults and the elderly and is associated with increased risk of cardiovascular diseases and many other morbidities . It is a complex trait that can be influenced by certain genetic makeups, environmental factors, or their interactions. There have been some genetic association studies to identify the effect of genetic component on blood pressure in certain ethnic population. The San Antonio Family Studies (SAFS) is a family-based longitudinal study designed to identify genes associated with high blood pressure in Mexican American population. For Genetic Analysis Workshop 18 (GAW18), the data from SAFS was provided for odd-numbered chromosomes.
In multivariate longitudinal data, multiple response variables are jointly measured over time from the same individuals. Other environmental variables can also change over time and could have been recorded. To understand the relationship between independent and multivariate dependent response variables, we need to take into account the correlation between multivariate responses measured over time. In GAW18 data, the phenotypic data, namely systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as some environmental data, were measured over time and these phenotypic data from the same individuals are likely to be correlated as they might be regulated by the same genes or common environments. Therefore, we used a bivariate linear mixed-effect model approach to test the association of genetic variants with bivariate phenotypic values adjusting for environmental factors.
Participants, genotyping data, and quality control
Details about sample recruitment can be found in Hunt et al . For GAW18, data from participants of Mexican American ethnicity from 20 large pedigrees, who were followed up for up to 4 times from 1991 to 2011, were provided. However, for our analysis we considered only 157 unrelated members from this cohort, of whom only 142 individuals had both genotypic and phenotypic data.
We analyzed genome-wide association studies (GWAS) data from chromosomes 3 and 9. There were 65,519 single-nucleotide polymorphisms (SNPs) on chromosome 3 and 42,177 SNPs on chromosome 9. We restricted our analysis to common SNPs (minor allele frequency ≥0.05) with genotyping call rates ≥0.95 and Hardy-Weinberg equilibrium p value ≥10−3 and subjects with genotyping call rates ≥0.95. We applied these criteria to those 142 individuals using PLINK.
That is, , where,
. Here, is the Kronecker product. The is the bivariate stochastic processes that not only captures the correlation of measurements within the same subject at multiple times, but also the correlation between 2 traits at the same time for the subject, and has the variance matrix at time and covariance matrix at times and , . Two traits are independent if . Here, is a real matrix chosen such that the eigenvalues of have negative real parts and matrices and are positive semidefinite symmetric . We have and under independence assumption of and ; thus Solution for can be obtain by maximum likelihood or restricted maximum likelihood (REML) approach using multivariate normal likelihood of
After the quality control filtering, we had only participants available for bivariate analysis, who had genotype data and at least 1 measurement of phenotypic data. There were 52,862 SNPs from chromosome 3 and 34,475 SNPs from chromosome 9 that passed the filtering criteria.
For the bivariate linear mixed-effect model fitting for each SNP, genotype (0, 1, or 2 for the number of copies of minor allele) of a SNP was the independent variable of interest. Besides, we a priori selected to include 3 covariates, namely, measurement time in years at an examination since enrollment (which is 0 for the first year of enrollment), baseline age (at enrollment), and repeatedly measured antihypertensive medication use in the model. We also included sex and repeatedly measured smoking status in the model because keeping them, each separately or together, resulted in a better fit (smaller Akaike information criteria [AIC]) of the bivariate model for the 20 SNPs from chromosome 9 selected for model fit assessment (modeling techniques description given in a successive paragraph). We considered autoregressive order-1 ) assumption used for repeated measured analysis and unstructured variance components used for the random-effect analysis to identify the appropriate covariance (or correlation) structure for between and within the 2 phenotypic measurements over time [4, 5]. The assumption did not lead to the noticeably improved model fit but did involve some unrealistic assumptions in bivariate modeling, such as measurement at equal interval of time for both phenotype at a time and for a phenotype over time. Therefore, we assumed unstructured variance components assumption that allows correlations between any 2 measurements for the same phenotype and between 2 phenotypes at a time to vary across subjects.
There were missing follow-up data on blood pressure and other repeatedly measured covariates, where data from 97 subjects were missing in the fourth enrollment period. The available data for the fourth period were not used because discarding them resulted in much better fit for each of those 20 SNPs. Next, 2 subjects had missing data on medication use and smoking in all 3 examinations. Thus, the effective sample size was 131 for real data analysis. No attempt was made to use imputation.
A binary variable "BPTYPE" (blood pressure type) was defined as BPTYPE = 1 for SBP and = 0 for DBP for a subject at a measurement time. Finally, equation (2) was fitted for each SNP, where the repeatedly measured bivariate blood pressure (a maximum of six possible measurements for two phenotypes at three examinations for a subject) was regressed against genotype and the covariates specified above (each regressor was multiplied by BPTYPE in order to perform bivariate analysis) using MIXED procedure in SAS (see Refs. [4, 5] for details of modeling technique and SAS codes). In the analysis using REML, we considered BPTYPE having group effect, patients' ID having subject effects, and measurement time having random effect; that is, time effects on blood pressure varies across individuals.
For the genotype effects on SBP and DBP, we estimated and the corresponding covariance matrix , and being standard errors of and , respectively, and their covariance, per 1 copy increase in minor allele of each SNP, using the bivariate approach. For each SNP, we tested the null hypothesis vs. alternative (ie H1 : at least1, ; that is, the genotype was associated with at least 1 phenotype) using F-test with degrees of freedom , assuming multivariate normality of . We also tested the hypothesis with test statistic assuming large sample approximation . Because the data arose from GWAS, we used a genome-wide significance threshold, α = 7.2 × 10−8 , to adjust for multiple testing problems, which enables us to see if any SNPs from chromosome 3 or 9 achieve this threshold in bivariate analysis.
We assessed the statistical power of the bivariate linear mixed effect model using 200 simulated longitudinal data sets provided by GAW18 organizers. However, we considered the data from only 142 unrelated subjects who had genotypic data and phenotype and covariate information from all 3 examinations. We chose to assess power to detect the association of a common SNP, rs6442089 (from MAP4 gene). The SNP had the effect sizes, β1 = −1.4951 (variation explained ) and β2 = −2.3810 (variation explained = 0.0143%) in simulated SBP and DBP, respectively, per copy increase in minor allele. We used data from 141 subjects as genotype data was missing for the SNP in 1 subject. We employed the same regression analysis model and modeling technique, and assessed the same hypothesis using F and test statistics as in real data analysis above. Power of univariate linear mixed model analysis to detect the effect of the same SNP separately on SBP and DBP . was also assessed using the same regression model and assumption as in bivariate case using F and test statistics. We assessed the power at as there was no issue of multiple testing; however, we also used α = 7.2 × 10−8 to be consistent with real data analysis.
Real data analysis
In our sample data, the mean age (standard deviation) at enrollment was 53.7 (16.0) years, where subjects were 20.3 to 94.2 years old when enrolled. In a graphical inspection, the SBP and DBP data at each and all 3 examinations looked approximately normal.
Top 15 most significant SNPs in the bivariate linear mixed model analysis.
Top 3 p values and corresponding simulation data sets from bivariate and separate univariate mixed model analyses, and p values from other methods in the same simulation data sets.
Simulation data set #
Bivariate model for both SBP and DBP
Univariate model for SBP only
Univariate model for DBP only
In bivariate linear mixed model analysis, we observed associations of a few SNPs from intergenic regions and TTC39B gene with high blood pressure despite the small sample size. The bivariate mixed model framework to the GAW18 simulated data suggested that the bivariate analysis is more powerful than univariate approach to analyze longitudinal data when phenotypes are correlated. An earlier simulation study  also found that bivariate approach is in general more powerful than the univariate analysis when quantitative traits are correlated. High blood pressure is believed to be influenced by hundreds of genes with generally very small to modest effects. So we need statistical method with improved power to detect such associations and bivariate method to longitudinal data can be a useful strategy to identify the list of SNPs that can be followed up for replication or validation of their associations with the phenotypes of interest.
However, we wish to underscore the limitation that our simulation study was not extensive; we just compared the power of the bivariate approach with that of the univariate approach for only 1 SNP. We did not assess the type I error rate of the bivariate method, which could be inflated, especially for a small data set. In fact, the Q-Q plot (see Figure 2) of the joint association p values (see Figure 2) strongly suggests such inflation of the error rate, although the deviation from null distribution could also be an indication of the presence of population substructure or admixture. Also, one needs to be cautious in interpreting our results from the real data analysis as we have a number of limitations in this study, including small sample size, missing observations, uncertainty about the underlying genetic model, selection of an appropriate correlation structure, random effect assumptions, and choice of test statistic. For instance, although we chose a covariance (correlation) structure between AR(1) and UN, AR(1) had a similar or somewhat smaller AIC. However, it assumes that measurements were made at an equal interval over time for each and all phenotypes [4, 5]. But this was an unrealistic assumption in our data because the time interval between the first and second examination ranged from 1.4 to 7.6 years. Next, it also assumes the same correlation between measurements of 2 phenotypes at a time and between any 2 measurements of all the phenotypes for all subjects, which might not be true. Although it involves estimating more parameters, a UN assumption is more flexible, consequently, we employed it our bivariate analysis. A detail investigation of the properties of bivariate method in many realistic scenarios via extensive simulation is warranted before we draw a general conclusion about the usefulness of the bivariate method for the correlated repeatedly measured phenotypes.
A bivariate approach to test associations of genetic variants with multiple phenotypes jointly measured over time from the same individuals could be a useful strategy to identify genetic variants that deserve further investigation as it can exploit the correlation structure between phenotypes at a time and the same phenotype over time.
JB acknowledges grant funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), and The Canadian Institutes of Health Research (CIHR, grant 84392). JB holds the John D. Cameron Endowed Chair in the Genetic Determinants of Chronic Diseases, Department of Clinical Epidemiology and Biostatistics, McMaster University. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Lawes CM, Vander Hoorn S, Rodgers A: Global burden of blood-pressure-related disease, 2001. Lancet. 2008, 371: 1513-1518. 10.1016/S0140-6736(08)60655-8.View ArticlePubMedGoogle Scholar
- Hunt KJ, Lehman DM, Arya R, Fowler S, Leach RJ, Goring HH, Almasy L, Blangero J, Dyer TD, Duggirala R, et al: Genome-wide linkage analyses of type 2 diabetes in Mexican Americans: the San Antonio Family Diabetes/Gallbladder Study. Diabetes. 2005, 54: 2655-2662. 10.2337/diabetes.54.9.2655.View ArticlePubMedGoogle Scholar
- Sy JP, Taylor JM, Cumberland WG: A stochastic model for the analysis of bivariate longitudinal AIDS data. Biometrics. 1997, 53: 542-555. 10.2307/2533956.View ArticlePubMedGoogle Scholar
- Thiebaut R, Jacqmin-Gadda H, Chene G, Leport C, Commenges D: Bivariate linear mixed models using SAS proc MIXED. Comput Methods Programs Biomed. 2002, 69: 249-256. 10.1016/S0169-2607(02)00017-2.View ArticlePubMedGoogle Scholar
- Gao F, Thompson P, Xiong C, Miller JP: Analyzing multivariate longitudinal data using SAS. 2012, Carey, NC, SAS Institute, 187-31. paper Accessed on 20 AugustGoogle Scholar
- Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O: SAS for mixed models. 2006, Carey, NC, SAS Institute, 2Google Scholar
- Dudbridge F, Gusnanto A: Estimation of significance thresholds for genomewide association scans. Genet Epidemiol. 2008, 32: 227-234. 10.1002/gepi.20297.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang F, Tang Z, Deng H: Bivariate association analysis for quantitative traits using generalized estimation equation. J Genet Genomics. 2009, 36: 733-743. 10.1016/S1673-8527(08)60166-6.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.