Detecting longitudinal effects of haplotypes and smoking on hypertension using B-splines and Bayesian LASSO
© Xia and Lin; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
The behavior of a gene can be dynamic; thus, if longitudinal data are available, it is important that we study the dynamic effects of genes on a trait over time. The effect of a haplotype can be expressed by time-varying coefficients. In this paper, we use the natural cubic B-spline to express these coefficients that capture the trends of the effects of haplotypes, some of which may be rare, over time; that is, at different ages. More specifically, to capture disease-associated common and rare haplotypes and environmental factors for data from unrelated individuals, we developed a method of time-varying coefficients that uses the logistic Bayesian LASSO methodology and B-spline by setting proper prior distributions. Haplotype and environmental effect coefficients are obtained by using Markov chain Monte Carlo methods. We applied the method to analyze the MAP4 gene on chromosome 3 and have identified several haplotypes that are associated with hypertension with varying effect sizes in the range of 55 to 85 years of age.
The Genetic Analysis Workshop 18 (GAW18) real data are family-based, consisting of cleaned single-nucleotide polymorphism (SNP) genotypes, sex, age at the time of examination, hypertension status, and smoking for up to 4 time points. Data on 157 unrelated individuals are also extracted from the families and made available for analysis. Previous studies have examined more than 50 genes for their associations with hypertension, and the number is growing . Moreover, hypertension is also considered to be age-dependent; the chance of being hypertensive rises with age and the risk after midlife (eg, more than 50 years of age) is considerable . Hence, in this paper, we aim to identify both genetic and environmental factors that are associated with high blood pressure, with the effects potentially varying at different ages.
This contribution concerns a haplotype-based method because haplotype procedures can be more powerful than a single SNP analysis if there are multiple causal variants interacting in cis-fashion, or if only SNPs in linkage disequilibrium with causal SNPs are genotyped. Rare haplotypes can result even when only common SNPs are considered. Thus, novel methods are needed not only to take the varying effects of haplotypes and environmental factors into account, but also to deal with the anomaly of rare variants. The particular environmental factor of interest is smoking, as it has been shown to be a potential risk factor for hypertension ; thus, we include smoking in our model in addition to sex and age.
Because the GAW18 data are collected prospectively, we first formulate a prospective likelihood. We then borrow the idea from the logistic Bayesian LASSO (LBL) approach to penalize parameters (regression coefficients) by setting up proper prior distributions . We chose to follow the LBL idea because it has been shown to be capable of detecting rare associated haplotypes, albeit under a retrospective setting with fixed, rather than varying, coefficients. As such, the LBL time-varying coefficient (LBL-tvc) method developed in this paper can be considered as a generalization of the original LBL. The haplotype and environmental effect coefficients are obtained by using Markov chain Monte Carlo (MCMC) methods. By using the proper percentiles of the sampled parameters, we can also construct hypothesis tests to determine whether a haplotype or an environmental covariate is associated with the disease.
We considered data from 153 unrelated individuals. Blood pressure measurements, age, and smoking status were available for up to 4 time points. Specifically, 40, 41, 45, and 31 individuals had measurements at 1, 2, 3, and 4 time points, respectively. Binary hypertension status is as defined in the original study: An individual is labeled as hypertensive if the systolic blood pressure is greater than 140 mm Hg, or the diastolic blood pressure is greater than 90 mm Hg, or if the individual is on antihypertensive medication at the time of examination. Individuals with incomplete genotype data for the SNPs under consideration were excluded from the analysis. However, individuals with measurements at less than 4 time points were all included because such individuals can be accommodated by our model (see below).
Selection of 4 regions in the MAP4 gene
where and SEX are the hypertension status (1 if hypertension), smoking status, age, and sex at the first examination, respectively, and is the genotype at the kth SNP. A chi-square analysis of variance (ANOVA) test is performed and the p value is recorded.
We choose SNPs corresponding to the 4 smallest p values as the anchors of our 4 regions for further analysis. Each chosen SNP and its 4 adjacent SNPs (2 on each side) form a 5-SNP-haplotype block. We used the Hapassoc software (http://cran.r-project.org/web/packages/hapassoc/index.html) to estimate haplotype frequencies, which were then used as the starting values for our MCMC analysis.
Prospective likelihood formulation
Let . A logistic regression model leads to where is the vector of haplotype effects at age t; is the smoking effect; is the interaction effect; and is the sex effect. Furthermore, is the design vector that gives the number of copies of each haplotype in Z; is the smoking status at age t (age at examination); is the interaction of smoking and haplotype; and is the sex.
where . The likelihood function is now completely specified in terms of the parameter vector Ψ.
MCMC estimation of parameters
We follow the LBL  methodology for estimating the parameters. A double exponential distribution with mean 0 is set to be the prior distribution for each parameter in with the intensity parameter set to be gamma, to control shrinkage. Uninformative priors are set for haplotype frequencies and inbreeding coefficient. We use MCMC methods to sample the parameters from the appropriate posterior distributions. If it is feasible to sample directly from the conditional distribution of a parameter, then we use the Gibbs sampler; otherwise, we use the Metropolis-Hastings algorithm with appropriate proposal distribution.
Significant haplotypes and their effect estimates in 4 regions of the MAP4 gene on chromosome 3
The longitudinal nature of the GAW18 data calls for methodology that is able to take the correlated measurements into account. Furthermore, there is a great deal of treasures that are yet to be mined from the common SNP data collected in genome-wide association studies. To this end, we have proposed LBL-tvc, a logistic regression model, to handle the correlated measurements over 4 time points. LBL-tvc considers the effects of haplotypes, which can be rare even if all the underlying SNPs are common. Application of LBL-tvc to the MAP4 gene yielded results that are consistent in all 4 regions of the MAP4 gene and appear to be useful. As one may expect, the effect of an associated haplotype would confer risk only when an individual reaches the age of 55 to 60 years, when hypertension typically strikes. The results further demonstrate the utility of the methodology for its ability to detect the effects of rare associated haplotypes.
To evaluate the performance of LBL-tvc, we carried out a preliminary simulation study with the effect mimicking that of what we see in the real data. More specifically, the simulation model considers a 5-SNP-haplotype block in which there are 5 common haplotypes and 2 rare haplotypes, with 1 of each type being associated with the hypertensive status. The strength of association across the age range of 20 to 90 years varies in a fashion similar to the pattern in the fitted real data. We also entertained an interaction effect between smoking and the common risk haplotype. Affection and smoking status are simulated at 4 time points for 250 individuals. The results, based on 100 replications, show that the type I error is well controlled, and there is overwhelming power (>90%) for detecting the common haplotype effects in the mid-age range. The power is much lower (approximately 50%), although still reasonable, for the rare haplotype effect. The power for detecting the haplotype-smoking interaction is also very high (>90%); we note, however, that the power will likely be much smaller had the interaction been with a rare haplotype. Overall, the simulation results are encouraging and to some extent validate our findings in the real data. Nevertheless, further investigation is needed to fully evaluate the properties of the method.
Because MCMC is applied for estimating the parameters, the procedure is computationally intensive. For example, analysis of each simulation replicate on a 5-SNP-haplotype block with 250 individuals and data on 4 time points as described above took about 35 minutes to complete. Therefore, our method should be primarily used for follow-up studies in interesting gene/regions. In our real data analysis, we simply use a prescreening procedure to find single SNP signals to form haplotypes with 4 neighboring SNPs. This construction of haplotype block is somewhat arbitrary. An alternative would be to select additional SNPs based on linkage disequilibrium plots.
The authors would like to acknowledge the NIH grant that supports GAWs and the GAW18 data providers. This work was supported in part by NIH grant 1R03CA171011-01. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Dickson ME, Sigmund CD: Genetic basis of hypertension: revisiting angiotensinogen. Hypertension. 2006, 48: 14-20. 10.1161/01.HYP.0000227932.13687.60.View ArticlePubMedGoogle Scholar
- Vasan RS, Beiser A, Seshadri S, Larson MG, Kannel WB, D'Agostino RB, Levy D: Residual lifetime risk for developing hypertension in middle-aged women and men: the Framingham Heart Study. JAMA. 2002, 287: 1003-1010.View ArticlePubMedGoogle Scholar
- Halperin RO, Gaziano JM, Sesso HD: Smoking and the risk of incident hypertension in middle-aged and older men. Am J Hypertens. 2008, 21: 148-152. 10.1038/ajh.2007.36.View ArticlePubMedGoogle Scholar
- Biswas S, Lin S: Logistic Bayesian LASSO for identifying association with rare haplotypes and application to age-related macular degeneratisson. Biometrics. 2012, 68: 587-597. 10.1111/j.1541-0420.2011.01680.x.View ArticlePubMedGoogle Scholar
- Weir BS: Genetic Data Analysis II. 1996, Sunderland, MA: Sinauer AssociatesGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.