Proceedings | Open | Published:
Comparing baseline and longitudinal measures in association studies
BMC Proceedingsvolume 8, Article number: S84 (2014)
In recent years, longitudinal family-based studies have had success in identifying genetic variants that influence complex traits in genome-wide association studies. In this paper, we suggest that longitudinal analyses may contain valuable information that can enable identification of additional associations compared to baseline analyses. Using Genetic Analysis Workshop 18 data, consisting of whole genome sequence data in a pedigree-based sample, we compared 3 methods for the genetic analysis of longitudinal data to an analysis that used baseline data only. These longitudinal methods were (a) longitudinal mixed-effects model; (b) analysis of the mean trait over time; and (c) a 2-stage analysis, with estimation of a random intercept in the first stage and regression of the random intercept on a single-nucleotide polymorphism at the second stage. All methods accounted for the familial correlation among subjects within a pedigree. The analyses considered common variants with minor allele frequency above 5% on chromosome 3. Analyses were performed without knowledge of the simulation model. The 3 longitudinal methods showed consistent results, which were generally different from those found by using only the baseline observation. The gene CACNA2D3, identified by both longitudinal and baseline approaches, had a stronger signal in the longitudinal analysis (p = 2.65 × 10−7) compared to that in the baseline analysis (p = 2.48 × 10−5). The effect size of the longitudinal mixed-effects model and mean trait were higher compared to the 2-stage approach. The longitudinal results provided stable results different from that using 1 observation at baseline and generally had lower p values.
Longitudinal data analyses are widely used in genome-wide association studies to assess genetic and environmental risk factors and their association with phenotypes of interest [1–3]. They are more complicated than analyses using only baseline measures because subjects are followed over time and change is measured during follow-up. Standard linear regression techniques are not applicable in this setting because of the correlation that exists among the repeated measures per subject. Methods for longitudinal study designs have enabled the investigation of genetic variation influencing trait values over time . In Genetic Analysis Workshop 13, Gauderman et al  provided an overview of a wide range of methods for the genetic analysis of longitudinal data in families. They summarized these methods into 2 groups: (a) 2-stage approaches, in which a summary statistic is obtained and used in genetic analysis, and (b) joint modeling, in which the genetic and longitudinal data are analyzed simultaneously in a single analysis. They argued that the use of a mean-type statistic could provide greater power compared to a slope-type statistic for detecting a gene effect. Zhu et al  performed a genome-wide association in which they identified genes and gene-environment interactions associated with longitudinal traits. They implemented a multivariate adaptive spline for the analysis of the longitudinal data.
In this paper, our main object is to compare existing methods of longitudinal data analyses with those that use only 1 baseline measure in association studies. We explore the following longitudinal methods: (a) a longitudinal mixed-effects model; (b) analysis of the mean trait over time; and (c) a 2-stage analysis, with estimation of a random intercept in the first stage and regression of the random intercept on a single-nucleotide polymorphism (SNP) in the second stage. These longitudinal methods use statistics that capture the level of a trait, such as a mean, to detect genetic associations as opposed to methods that focus on the change in the trait over time, such as a slope. Despite the strengths and integrated approach of a longitudinal mixed model, its implementation is very computer-intensive because of its complex structure. Therefore, the main motivation for trying some "simpler" alternative longitudinal models, such as analysis of the mean trait over time and a 2-stage analysis, is to see if they can serve as good substitutes with equally good performance.
Study subjects and phenotype
We used real phenotype data collected in the San Antonio Family Heart Study, including sex, age, year of examination, systolic and diastolic blood pressure, use of antihypertensive medications, and tobacco smoking at up to 4 time points for 939 subjects in 20 pedigrees. Of the 939 participants, 244 attended only 1 exam; for the remaining subjects, the median follow-up time was 11 years with a median gap time between assessments of 5 years. We analyzed 2 continuous traits: systolic blood pressure (SBP) and diastolic blood pressure (DBP). For participants on medication, we imputed both SBP and DBP to mimic what their unmedicated values would be. If a subject was on medications at an exam, we imputed the blood pressure at this exam to be the average blood pressure of all observations with higher values among those of the same gender and ± 10 years of the age of the subject. We performed a preliminary analysis to select covariates for both SBP and DBP. Variables significantly associated (p <0.05) with SBP or DBP were selected. For SBP, we adjusted for age, sex, and tobacco smoking. For DBP, we adjusted for age, sex, tobacco smoking, and centered age squared.
The genetic data from Genetic Analysis Workshop 18 (GAW18) consisted of whole genome sequence data in a pedigree-based sample with longitudinal phenotype data for hypertension and related traits. A total of 26.8 million SNPs were identified in the 483 individuals. After eliminating 19 outlier individuals who failed to meet SNP quality control criteria such as fractions and ratio of homogeneous and heterogeneous sites and fraction of novel SNPs, 24 million SNPs passed support vector machine and indel proximity filters. Genotype calls cleaned of mendelian errors and dosages were provided for 959 individuals (464 directly sequenced and the rest imputed) for 8,348,674 locations in the genome. A majority of the SNPs were rare variants; 51% had a minor allele frequency (MAF) below 1%. As suggested by GAW18 leaders, all analyses for this current paper were based on 402,985 common variants (MAF ≥5%) of chromosome 3 only, accounting for around one-third of the total number of variants on the chromosome.
Baseline association analysis
For comparison with the methods that used the longitudinal data, we applied a baseline association analysis that considered only the first observation (baseline) for each person. In addition to adjusting for covariates, we incorporated a familial correlation structure (kinship coefficient matrix) into the model as , where i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the phenotype at baseline, denotes the covariates at baseline, and denotes the SNP dosage. is the fixed intercept, is a vector of regression coefficients for the m covariates, and is the SNP effect size; is the random intercept for the (i,j)th person. Within each pedigree, the is normally distributed with a mean of 0 and a covariance matrix of (the kinship matrix), contributing a diagonal block for each pedigree to the overall covariance matrix; is an error term with a mean of 0 and a variance of . This model was implemented using the lmekin package in R (version 2.9.2) package "kinship" , which employed maximum likelihood methods to estimate parameters.
The notations of used in this baseline model apply to the following models where applicable.
To compare with the baseline approach, we considered 3 approaches for longitudinal analyses of these data: (a) longitudinal mixed-effects association analysis, (b) mean measure in longitudinal association analysis, and (c) 2-stage longitudinal association analysis.
Longitudinal mixed-effects association analysis
We used a random-intercept mixed effects model with familial correlation structure . The model is:
Here i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the trait at time point t; denotes the covariates at time t, including time-dependent covariates. This model was implemented in the R (version 2.15.1) package "pedigreemm" , which used the method of restricted maximum likelihood for parameter estimation.
Mean measure in longitudinal association analysis
We also considered the mean across all time points as the trait and its corresponding averaged covariates as one alternative for longitudinal association analysis. This model is:
Here i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the mean trait across time. denotes the covariates, which for time-dependent covariates is the average measure across time.
This model was implemented using the function lmekin in R (version 2.9.2) package "kinship" , using maximum likelihood methods to estimate parameters.
Two-stage longitudinal association analysis
Another longitudinal approach employs a 2-stage strategy . In the first stage, a random intercept, , as the level of the trait for each person was generated from a growth curve model:
Here i denotes the ith pedigree, and j denotes the jth individual in the ith pedigree. For this individual, denotes the trait at time point t. denotes the covariates including time-dependent covariates. is the fixed intercept of the first stage; is the random intercept. As above, the covariance structure of is which contributes a diagonal block for each pedigree to the overall covariance matrix.
In the second stage, random intercept is treated as the "new" trait and regressed on a SNP as follows:
Here denotes the SNP dosage. is the intercept of the second stage; is the SNP effect size; is an error term with a mean of 0 and a variance of is the random intercept that adjusts for the familiar correlation of ; and, similarly, the vector is normally distributed with a mean of 0 and a covariance matrix of contributing a diagonal block for each pedigree to the overall covariance matrix.
Gauderman et al  pointed out that a mean-based statistic is more powerful to detect a genetic association than a slope-based statistic (eg, a random slope). So here we adopted the random intercept of the first stage rather than the random slope as the "trait" in the second stage. The first-stage model was implemented using lmekin of the R (version 2.15.1) package "coxme" , which could handle more than 1 random effect; the second-stage model was implemented using lmekin of the R (version 2.9.2) package "kinship", which adopted a faster computing algorithm. Both packages used maximum likelihood in parameter estimation.
Power and type I error
We conducted power calculations for all 4 methods and evaluated type I error by means of the genomic control value. We chose the variant (chromosome 3: 47956424) on gene MAP4, the top variant influencing simulated SBP and DBP, as the functional variant for power calculations. To determine power, we tested the null hypothesis that the trait SBP was not associated with the functional variant, versus the alternative hypothesis that it is associated. Therefore, results would be considered statistically significant if the p value obtained using the analysis methods fell below a predetermined threshold. Here we divided the significance level 0.05 by the approximate number (25,676) of independent SNPs on chromosome 3 to adjust for multiple testing. We used PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)  to prune out SNPs on chromosome 3 where the pairwise linkage disequilibrium was 0.2 or greater, and 25,676 SNPs remained. For each of the 4 methods, the estimated power was the proportion of replicates in which the method detected a significant association between the trait and the functional variant.
For each of the 4 methods, genomic control value was used to assess the extent of the inflation of type I error, based on the p value of common variants on chromosome 3.
Association analysis of real data
For SBP, there were no shared results in the top 10 hits between the baseline approach and the other 3 longitudinal methods (Table 1). Some shared genes identified by the longitudinal methods were FGF12 and FHIT. The mean measure and 2-stage methods yielded similar results. For DBP, the 3 longitudinal methods yielded consistent results (as shown in Figure 1, right side): the top 10 hits came from the same gene (CACNA2D3 in Table 2; eg, SNP 3_54748234 has a p value of 2.65 × 10−7), with SNPs nearly reaching a Bonferroni significance threshold. This gene was also found using the baseline method but was less significant (rank = 2, p = 2.76E-05 in Table 2).
Power and type I error
Power was computed to assess the baseline method and the 3 longitudinal methods (Table 3). The 3 longitudinal methods had at least 10.5% higher power than the baseline method. Among the longitudinal methods, the power of both mean measure and 2-stage methods was comparable (41% and 40.5%, respectively) and substantially higher than that of the linear mixed-effects (LME) method (32.5%). None of the 4 methods showed elevated type I error because the genomic control value ranged from about 0.98 to 1.034.
Discussion and conclusions
For both traits, the genes identified by the 3 longitudinal methods were consistent, but different from those found with the baseline approach. From the perspective of computational time, the mean measure and 2-stage methods were more computer efficient than the LME method. Furthermore, these 2 longitudinal methods were more powerful than the LME method. These 2 methods can act as efficient and powerful "substitutes" for LME. The mean measure method worked as well as the 2-stage method, identifying the same genes. The signals found with the 2-stage method (third row of Manhattan plot in Figure 1) were almost identical to those with the LME method, for both SBP and DBP. Therefore, we concluded that the mean measure and 2-stage methods were 2 efficient ways to analyze longitudinal data when the goal is to examine level of a trait. Only the longitudinal approach can evaluate associations with trends over time.
Zhu W, Cho K, Chen X, Zhang M, Wang M, Zhang H: A genome-wide association analysis of Framingham Heart Study longitudinal data using multivariate adaptive splines. BMC Proc. 2009, 3 (suppl 7): S119-10.1186/1753-6561-3-s7-s119.
Luan J, Kerner B, Zhao JH, Loos RJ, Sharp SJ, Muthén BO, Wareham NJ: A multilevel linear mixed model of the association between candidate genes and weight and body mass index using the Framingham longitudinal family data. BMC Proc. 2009, 3 (suppl 7): S115-10.1186/1753-6561-3-s7-s115.
Smith EN, Chen W, Kähönen M, Kettunen J, Lehtimäki T, Peltonen L, Raitakari OT, Salem RM, Schork NJ, et al: Longitudinal genome-wide association of cardiovascular disease risk factors in the Bogalusa Heart Study. PLoS Genet. 2010, 6: e1001094-10.1371/journal.pgen.1001094.
Gauderman WJ, Macgregor S, Briollais L, Scurrah K, Tobin M, Park T, Wang D, Rao S, John S, Bull S: Longitudinal data analysis in pedigree studies. Genet Epidemiol. 2003, 25 (Suppl 1): S18-S28.
R Development Core Team: A language and environment for statistical computing. 2009, R Foundation for Statistical Computing, Vienna, Austria, [http://www.R-project.org]
R Development Core Team: A language and environment for statistical computing. 2012, R Foundation for Statistical Computing, Vienna, Austria, [http://www.R-project.org]
Laird NM, Ware JH: Random-effects models for longitudinal data. Biometrics. 1982, 38: 963-974. 10.2307/2529876.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.
The research was conducted using Linux Clusters for Genetic Analysis (LinGA). The computing resource was funded by the Robert Dawson Evans Endowment of the Department of Medicine at Boston University School of Medicine and Boston Medical Center, the National Heart, Lung and Blood Institute contract to the Framingham Heart Study (N01-HC-38038), and contributions from individual investigators. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
The authors declare that they have no competing interests.
SW, WG, JN, CA, CTL, LAC designed the overall study, SW, WG, JN, CA conducted statistical analyses and SW, WG, JN drafted the manuscript. All authors read and approved the final manuscript.