Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach
- Osvaldo Espin-Garcia^{1, 2},
- Xiaowei Shen^{1, 2},
- Xin Qiu^{1},
- Yonathan Brhane^{3},
- Geoffrey Liu^{4, 5} and
- Wei Xu^{1, 5}Email author
https://doi.org/10.1186/1753-6561-8-S1-S70
© Espin-Garcia et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
Abstract
We propose a genetic association analysis using Dirichlet regression to analyze the Genetic Analysis Workshop 18 data. Clinical variables, arranged in a longitudinal data structure, are employed to fit a multistate transition model in which the transition probabilities are served as a response in the proposed analysis. Furthermore, a gene-based association analysis via penalized regression is implemented using the markers at a single-nucleotide polymorphism level that we previously identified via nonpenalized Dirichlet regression.
Keywords
Background
Genetic association analyses have had tremendous successes in recent years; however, most of these analyses were based on binary or continuous responses. Thus we propose a multivariate response vector indicating probabilities of transitions to predefined hypertensive states. This enables us to reflect the inherent uncertainty involved in the probability that a patient will transfer to a given state.
An important feature of our approach is the incorporation of prehypertension as an intermediate state. As Winegarden argues, prehypertension blood pressure in young patients helps predict the development of hypertension [1].
Methods
Definition of response
We defined a response summarizing the phenotype information into a vector that will be used in a genetic association analysis. The response is defined as a 3-dimensional vector of probabilities $y=\left({y}_{1},{y}_{2},{y}_{3}\right),{\sum}^{{y}_{j}}=1\left({\mathsf{\text{y}}}_{1},{\mathsf{\text{y}}}_{2},{\mathsf{\text{y}}}_{3}\right),{\sum}^{{\mathsf{\text{y}}}_{\mathsf{\text{j}}}}=1$, such that each element measures the probability of a transition to a blood pressure level (normotensive, prehypertensive, or hypertensive) given the previous level.
The analysis was done without the knowledge of the underlying simulation model and we used the real phenotype data only.
Data quality control
Data quality control was performed in PLINK [2]. We only considered the data from chromosome 3 for analysis. We used a call rate for individuals of 95%, a Hardy-Weinberg disequilibrium test at a significance level of 1 × 10^{−6}, and a missing rate of 95% for each marker. Markers with a minor allele frequency of at least 5% were retained for analysis. Additionally, all individuals' time points with at least 1 missing clinical variable were excluded.
Multistate transition model
We describe hypertension, our trait of interest, using a 3-state model based on recorded blood pressure levels for each individual at each examination. The states are defined as follows: normal blood pressure (state 1) when the systolic blood pressure is less than 120 mm Hg and diastolic blood pressure is less than 80 mm Hg; prehypertension (state 2) when the blood pressure level is not in state 1, the systolic blood pressure is less than 140 mm Hg, and the diastolic blood pressure is less than 90 mm Hg; and hypertension (state 3) for all other cases. Also, if a patient used antihypertensive medication, the state assigned at that examination is hypertension (state 3) regardless of the recorded blood pressure levels. Once the states are defined, we consider a multistate transition model; it is important to note that all 9 transitions are possible.
where $\left\{{S}_{i}\left(r\right),r=1,2,\dots \right\}$ and $\left\{{x}_{i}\left(r\right)=\left({x}_{i1}\left(r\right),\dots ,{x}_{ip}\left(r\right)\right),r=1,2,\dots \right\}$ denote the observed state and the covariates for subject i at the ${r}^{\mathsf{\text{th}}}$ examination respectively.
This model takes advantage of the longitudinal data structure and the definition of the response follows naturally. To estimate the transition probabilities, we fit a multinomial regression model, based on covariates (gender, smoking status and age) and the state at the previous examination.
besides, $1={y}_{ill}+{\sum}_{j=1,j\ne l}^{3}{y}_{ilj}={y}_{ill}\left(1+{\sum}_{j=1,j\ne l}^{3}\mathsf{\text{exp}}\left({z}_{il}{\gamma}_{lj}\right)\right)$, where ${z}_{il}=\left({x}_{i}\left(l\right),time\right)$ is the observed vector of covariates for subject i plus a categorical variable denoting the effect of examination time in the model (and possible interactions), and ${\gamma}_{lj}$ is the vector of coefficients for the corresponding multinomial regression model.
Therefore, the response for subject i is a row taken from $\mathsf{\text{TP}}{\mathsf{\text{M}}}_{i}$ and is determined by conditioning on the patient's last available observed state and covariates.
Dirichlet regression
Once the response is modeled our objective is to determine whether there is an association between it and the genotypes. We assess this association using Dirichlet regression [4], which suits this response structure. The advantage of this approach lies in its tractability in dealing with the proposed response. It also allows a more comprehensive understanding of the genetic effect on the expression of hypertension, and therefore in its possible interpretation. For instance, if a signal was detected for a marker, it would suggest an association between the marker and the transition of blood pressure states jointly rather than a single level. Therefore, the Dirichlet approach is more informative in the sense of explaining the plausibility of each defined state.
where ${\lambda}_{j}\left({s}_{i}\right)={\lambda}_{ij}>0,\Lambda \left({s}_{i}\right)={\Lambda}_{i}={\sum}_{j=1}^{3}{\lambda}_{j}\left({s}_{i}\right)$ and $\Gamma \left(\cdot \right)$ is the gamma function.
where $M$ is the number of covariates included in the model and ${\mathit{\beta}}_{j}$ is the vector of regression coefficients that explains the effects (in log scale) of the covariates on the j^{th} component.
Considering the above, 2 models are analyzed:
Model 1 (M1): $\mathsf{\text{log}}\left({\lambda}_{ij}\right)={\alpha}_{j}^{M1}+{\beta}_{j}^{M1}{g}_{i}^{k}$(base model)
Model 2 (M2):$\mathsf{\text{log}}\left({\lambda}_{ij}\right)={\alpha}_{j}^{M2}+{\beta}_{j}^{M2}{g}_{i}^{k}+\mathbf{\text{FA}}{\mathbf{\text{M}}}_{i}{\delta}_{j}^{M2}$ (adjusted model)
Here ${g}_{i}^{k}$ represents the number of copies of the minor allele on the k^{th} single-nucleotide polymorphism (SNP) for the i^{th} individual under an additive genetic model; $\mathbf{\text{FA}}{\mathbf{\text{M}}}_{i}$ is the i^{th} row of contrast matrix for the pedigree number considered as a categorical variable and ${\mathit{\theta}}_{j}^{h}={\left({\alpha}_{j}^{h},{\beta}_{j}^{h},{{\delta}^{h}}_{j}^{t}\right)}^{t}$ is the vector of regression coefficients on the j^{th} component. (Note ${\mathit{\delta}}_{j}^{M1}=0)$.
Our interest in these models lies in the potential genetic effect of each marker on the proposed response. To assess this, Wald statistics were used to test the null hypothesis of no association between each SNP and the response, ${\mathsf{\text{H}}}_{0}:\mathit{\beta}=0$ (vs.${\mathsf{\text{H}}}_{A}:\mathsf{\text{not}}\phantom{\rule{0.3em}{0ex}}{\mathsf{\text{H}}}_{0}$), $\mathit{\beta}=\left({\beta}_{1},{\beta}_{2},{\beta}_{3}\right)$.
Gene-based association
where $l\left(\mathit{\eta};\mathit{Y},\mathit{G}\right)$ represents the log-likelihood of a dirichlet distributed sample with response matrix $Y={\left({y}_{1}^{t},\dots ,{y}_{n}^{t}\right)}^{t};$ $G={\left({g}_{1}^{t},\dots ,{g}_{n}^{t}\right)}^{t}$ (or $G:\mathbf{\text{FAM}}$ for M2) is the design matrix, ${\mathit{g}}_{i}=\left(1,{g}_{i}^{1},\dots ,{g}_{i}^{p}\right);$ $p$ denotes the number of markers considered for variable selection; $k$ is the number of states; $\mathit{\eta}$ is the regression coefficients vector; $c$ and $\lambda $ are parameters for the penalized regression; and ${\u2225{\eta}_{.l}\u2225}_{2}={\left({\sum}_{j=1}^{k}{\eta}_{lj}^{2}\right)}^{1/2}$ and are the penalty norms. It is important to note that when $c=1$ we have a ridge regression penalty, whereas when $c=0$ we have a lasso penalty. We implement the variable selection for the penalized dirichlet regression using R code provided on the Statistical Genetics and Genomics Laboratory at the University of Pennsylvania webpage [5].
Results
Data quality results
The Genetic Analysis Workshop 18 (GAW18) data consists of 855 individuals with genotype and phenotype information. As a result of missing data, transition probabilities are estimated for only 835 individuals. Of these, 43 are removed because of low call rate. The overall call rate for the remaining 792 individuals is 99.82%.
The Genome Wide Association Study (GWAS) data includes 65,519 SNPs for chromosome 3, of which 59 are excluded because it is not possible to reliably obtain position information for these markers. The remaining 65,460 SNPs are considered for data quality control. Because of a low genotyping rate, 114 markers are removed; none are excluded by the Hardy-Weinberg equilibrium test; and 13,011 markers are removed because of low minor allele frequency. The remaining 52,357 markers are considered for analysis.
Analysis results
Selected transition models
Transition | Model |
---|---|
1→$j$ | $\mathsf{\text{log}}\left({y}_{1j}/{y}_{11}\right)={\gamma}_{1j0}+\left({\gamma}_{1j1}{x}_{sex}+{\gamma}_{1j2}{x}_{smoke}+{\gamma}_{1j3}{x}_{age}\right)*time\phantom{\rule{2.77695pt}{0ex}}j=2,3$ |
2→$j$ | $\mathsf{\text{log}}\left({y}_{2j}/{y}_{22}\right)={\gamma}_{2j0}+{\gamma}_{2j1}{x}_{sex}+{\gamma}_{2j2}{x}_{smoke}+{\gamma}_{2j3}{x}_{age}+time\phantom{\rule{2.77695pt}{0ex}}j=1,3$ |
3→$j$ | $\mathsf{\text{log}}\left({y}_{3\mathsf{\text{j}}}/{y}_{33}\right)={\gamma}_{3l0}+{\gamma}_{3j1}{x}_{sex}+{\gamma}_{3j2}{x}_{smoke}+{\gamma}_{3j3}{x}_{age}\phantom{\rule{0.3em}{0ex}}j=1,2$ |
Association analysis results
SNP | Gene | MA | MAF (%) | pValue (M1) | pValue (M2) |
---|---|---|---|---|---|
rs12492830 | PCCB | C | 7.22 | 1.1 × 10^{−7} | 3.9 × 10^{−8} |
Comparison of penalized regression under different levels of $c$
No. of parameters selected (iterations for convergence) | |||||||
---|---|---|---|---|---|---|---|
Data | # SNPs | $c=$ | 0 | 0.3 | 0.5 | 0.7 | 1 |
GWAS | 22 | 10 (260) | 10 (148) | 8 (135) | 7 (163) | 7 (174) | |
GENO | 607 | 42 (427) | 35 (199) | 24 (176) | 25 (136) | 19 (115) |
Discussion
The present work implements a multistate transition model that conveniently accommodates the longitudinal data structure. Whether the information contained by the available clinical variables is sufficient for predicting the hypertensive state is debatable, however.
Although the adjusted model (M2) is an improvement over the base model (M1), neither of the described models accounts for correlation between individuals nor heteroscedasticity. One way to possibly overcome this is to incorporate a latent variable into the model. Such an extension follows.
Model 3 (M3):$\mathsf{\text{log}}\left({\lambda}_{ij}\right)={\alpha}_{j}^{M3}+{\beta}_{j}^{M3}{g}_{i}^{k}+{u}_{i}$ where ${u}_{i}$ is the i^{th} element of a vector $\mathit{u}$ that follows a $\mathsf{\text{MVN}}\left(\mathbf{0},\mathbf{\text{K}}\right)$ distribution; here $\mathbf{\text{K}}$ is twice the estimated kinship matrix. In this case, however, the estimation of the parameters of interest, ${\beta}_{j}$, is not straightforward. Further research of this methodology is warranted.
With respect to the penalized regression, to avoid an arbitrary selection of c,a cross-validation method could be implemented.
Conclusions
We propose a methodology that conveniently uses the longitudinal data structure to define a probabilistic outcome, which, we believe, explains hypertension in a more suitable way. Dirichlet regression provides an interesting approach that, along with other more common responses, can be successfully used in genetic association analysis. Our model finds a statistically significant marker at the standard significant level for GWAS, which is noteworthy, considering that it is often difficult to find association. Moreover, when the penalized method is used on the GENO data we are able to find significant markers in addition to those have already found using GWAS data.
Declarations
Acknowledgements
In a personal communication, 2012, Daniel Merico (TCAG/Sick Kids) provided Gene annotation files. OEG thanks Ian Jones for his valuable support in the manuscript review. Thanks to the reviewers for their comments and suggestions. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
Authors’ Affiliations
References
- Winegarden CR: From "Prehypertension" to Hypertension? Additional Evidence. Ann Epidemiol. 2005, 15: 720-725. 10.1016/j.annepidem.2005.02.010.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al: PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalbfleisch JD, Lawless JF: The analysis of panel data under a Markov assumption. J Am Statist Assoc. 1985, 80: 863-871. 10.1080/01621459.1985.10478195.View ArticleGoogle Scholar
- Campbell G, Mosimann JE: Multivariate methods for proportional shape. ASA Proceedings of the Section on Statistical Graphics. 1987, Washington, DC, 10-17.Google Scholar
- Chen J, Li H: Variable selection for Dirichlet-multinomial regression for identifying covariates that are associated with microbiomes. Ann Appl Stat. 2013, 7: 418-442. 10.1214/12-AOAS592.View ArticleGoogle Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, New York, Springer, 4View ArticleGoogle Scholar
- Maier MJ: DirichletReg: Dirichlet Regression in R. Ver. 0.4-0, [http://dirichletreg.r-forge.r-project.org]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.