Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach

We propose a genetic association analysis using Dirichlet regression to analyze the Genetic Analysis Workshop 18 data. Clinical variables, arranged in a longitudinal data structure, are employed to fit a multistate transition model in which the transition probabilities are served as a response in the proposed analysis. Furthermore, a gene-based association analysis via penalized regression is implemented using the markers at a single-nucleotide polymorphism level that we previously identified via nonpenalized Dirichlet regression.


Background
Genetic association analyses have had tremendous successes in recent years; however, most of these analyses were based on binary or continuous responses. Thus we propose a multivariate response vector indicating probabilities of transitions to predefined hypertensive states. This enables us to reflect the inherent uncertainty involved in the probability that a patient will transfer to a given state.
An important feature of our approach is the incorporation of prehypertension as an intermediate state. As Winegarden argues, prehypertension blood pressure in young patients helps predict the development of hypertension [1].

Definition of response
We defined a response summarizing the phenotype information into a vector that will be used in a genetic association analysis. The response is defined as a 3-dimensional vector of probabilities y = y 1 , y 2 , y 3 , y j = 1 y 1 , y 2 , y 3 , y j = 1, such that each element measures the probability of a transition to a blood pressure level (normotensive, prehypertensive, or hypertensive) given the previous level. The analysis was done without the knowledge of the underlying simulation model and we used the real phenotype data only.

Data quality control
Data quality control was performed in PLINK [2]. We only considered the data from chromosome 3 for analysis. We used a call rate for individuals of 95%, a Hardy-Weinberg disequilibrium test at a significance level of 1 × 10 −6 , and a missing rate of 95% for each marker. Markers with a minor allele frequency of at least 5% were retained for analysis. Additionally, all individuals' time points with at least 1 missing clinical variable were excluded.

Multistate transition model
We describe hypertension, our trait of interest, using a 3-state model based on recorded blood pressure levels for each individual at each examination. The states are defined as follows: normal blood pressure (state 1) when the systolic blood pressure is less than 120 mm Hg and diastolic blood pressure is less than 80 mm Hg; prehypertension (state 2) when the blood pressure level is not in state 1, the systolic blood pressure is less than 140 mm Hg, and the diastolic blood pressure is less than 90 mm Hg; and hypertension (state 3) for all other cases. Also, if a patient used antihypertensive medication, the state assigned at that examination is hypertension (state 3) regardless of the recorded blood pressure levels. Once the states are defined, we consider a multistate transition model; it is important to note that all 9 transitions are possible.
Our interest in transition models lies in estimating of the transition probabilities as defined in Kalbfleisch and Lawless [3] which are given by where {S i (r) , r = 1, 2, . . .} and x i (r) = (x i1 (r) , . . . , x ip (r)), r = 1, 2, . . . denote the observed state and the covariates for subject i at the r th examination respectively. This model takes advantage of the longitudinal data structure and the definition of the response follows naturally. To estimate the transition probabilities, we fit a multinomial regression model, based on covariates (gender, smoking status and age) and the state at the previous examination.
To get expressions for y il = y il1 , y il2 , y il3 , l = 1, 2, 3, we consider a generalized logit model of the form is the observed vector of covariates for subject i plus a categorical variable denoting the effect of examination time in the model (and possible interactions), and γ lj is the vector of coefficients for the corresponding multinomial regression model.
Thus, a transition probability matrix (TPM) is defined for each individual as follows Therefore, the response for subject i is a row taken from TPM i and is determined by conditioning on the patient's last available observed state and covariates.

Dirichlet regression
Once the response is modeled our objective is to determine whether there is an association between it and the genotypes. We assess this association using Dirichlet regression [4], which suits this response structure. The advantage of this approach lies in its tractability in dealing with the proposed response. It also allows a more comprehensive understanding of the genetic effect on the expression of hypertension, and therefore in its possible interpretation. For instance, if a signal was detected for a marker, it would suggest an association between the marker and the transition of blood pressure states jointly rather than a single level. Therefore, the Dirichlet approach is more informative in the sense of explaining the plausibility of each defined state.
To relate the genetic information and the defined response under a Dirichlet regression approach, the likelihood given each individual's vector of covariates, s i, is (·) is the gamma function. The parameters, λ j (s i ), are defined in terms of a linear predictor using a logarithm link, where M is the number of covariates included in the model and β j is the vector of regression coefficients that explains the effects (in log scale) of the covariates on the j th component.
Considering the above, 2 models are analyzed: Here g k i represents the number of copies of the minor allele on the k th single-nucleotide polymorphism (SNP) for the i th individual under an additive genetic model; FAM i is the i th row of contrast matrix for the pedigree number considered as a categorical variable and θ h j = (α h j , β h j , δ h t j ) t is the vector of regression coefficients on the j th component. (Note δ M1 j = 0). Our interest in these models lies in the potential genetic effect of each marker on the proposed response. To assess this, Wald statistics were used to test the null hypothesis of no association between each SNP and the response, H 0 : β = 0 (vs.H A : not H 0 ), β = (β 1 , β 2 , β 3 ).

Gene-based association
Once we identify significant SNPs through the genetic association analysis as described above, we proceed to perform the analysis at a gene level. To achieve this, we propose a penalized regression. Including all the markers simultaneously, this penalized method aims to select those SNPs with higher association. The analysis is done on those candidate genes that contain at least 1 significant marker that has already been determined.

Variable selection on the SNPs is assessed via a penalized likelihood of the form
where l (η; Y, G) represents the log-likelihood of a dirichlet distributed sample with response matrix G = (g t 1 , . . . , g t n ) t G = (g t 1 , . . . , g t n ) t (or G : FAM for M2) is the design matrix, g i = (1, g 1 i , . . . , g p i ); p denotes the number of markers considered for variable selection; k is the number of states; η is the regression coefficients vector; c and λ are parameters for the penalized regression; and η .l 2 = k j=1 η 2 lj 1/2 and are the penalty norms. It is important to note that when c = 1 we have a ridge regression penalty, whereas when c = 0 we have a lasso penalty. We implement the variable selection for the penalized dirichlet regression using R code provided on the Statistical Genetics and Genomics Laboratory at the University of Pennsylvania webpage [5].

Data quality results
The Genetic Analysis Workshop 18 (GAW18) data consists of 855 individuals with genotype and phenotype information. As a result of missing data, transition probabilities are estimated for only 835 individuals. Of these, 43 are removed because of low call rate. The overall call rate for the remaining 792 individuals is 99.82%.
The Genome Wide Association Study (GWAS) data includes 65,519 SNPs for chromosome 3, of which 59 are excluded because it is not possible to reliably obtain position information for these markers. The remaining 65,460 SNPs are considered for data quality control. Because of a low genotyping rate, 114 markers are removed; none are excluded by the Hardy-Weinberg equilibrium test; and 13,011 markers are removed because of low minor allele frequency. The remaining 52,357 markers are considered for analysis.

Analysis results
The parameter estimates for the transition models are obtained using R [6]. To test examination time effect; likelihood ratio tests are performed in which the null model considers only the available clinical variables. Table 1 presents the final transition models.
After the response is estimated, models M1 and M2 are fit using R [7] for each available SNP. Figure 1 displays the Manhattan plots for the p values that result from testing the null hypotheses of no association between the markers and the response. The graphs show that only 1 marker under M2 is significant at the standard significance level for GWAS (5 × 10 −8 ). Interestingly, the same marker is the most significant marker under M1, although it is not significant at the standard threshold. This suggests that the adjustment for family incorporated in M2 accounts for the family structure in the data. Also, the proposed methodology demonstrates consistency in that the same marker proves to be the most significant under both models. Table 2 summarizes these findings.
Once significant markers were identified, a gene-level association analysis is performed using the penalized regression described above for different levels of c (0, 0.3, 0.5, 0.7, and 1). The analysis is conducted utilizing both the GWAS and the dosage imputed genotypes (GENO) information as the explanatory variables. Figure 2 shows the penalized regression results for the gene containing the significant SNP (rs12492830) for c = 0.5 only. This level of c is a blended penalty function, equally weighting the ridge and lasso penalties. Table 3 shows the results for different levels of c under M2 for gene PCCB.

Discussion
The present work implements a multistate transition model that conveniently accommodates the longitudinal data structure. Whether the information contained by the available clinical variables is sufficient for predicting the hypertensive state is debatable, however.
Although the adjusted model (M2) is an improvement over the base model (M1), neither of the described models accounts for correlation between individuals nor heteroscedasticity. One way to possibly overcome this is to incorporate a latent variable into the model. Such an extension follows.
Model 3 (M3):log λ ij = α M3 j + β M3 j g k i + u i where u i is the i th element of a vector u that follows a MVN(0, K) distribution; here K is twice the estimated kinship matrix. In this case, however, the estimation of the parameters of interest, β j, is not straightforward. Further research of this methodology is warranted.
With respect to the penalized regression, to avoid an arbitrary selection of c,a cross-validation method could be implemented.

Conclusions
We propose a methodology that conveniently uses the longitudinal data structure to define a probabilistic outcome, which, we believe, explains hypertension in a more suitable way. Dirichlet regression provides an interesting log y 1j /y 11 = γ 1j0 + γ 1j1 x sex + γ 1j2 x smoke + γ 1j3 x age * time j = 2, 3 2 j log y 2j /y 22 = γ 2j0 + γ 2j1 x sex + γ 2j2 x smoke + γ 2j3 x age + time j = 1, 3 3 j log y 3j /y 33 = γ 3l0 + γ 3j1 x sex + γ 3j2 x smoke + γ 3j3 x age j = 1, 2     approach that, along with other more common responses, can be successfully used in genetic association analysis. Our model finds a statistically significant marker at the standard significant level for GWAS, which is noteworthy, considering that it is often difficult to find association. Moreover, when the penalized method is used on the GENO data we are able to find significant markers in addition to those have already found using GWAS data.