Volume 2 Supplement 4

## Selected Proceedings of Machine Learning in Systems Biology: MLSB 2007

- Proceedings
- Open Access
- Published:

# A marginalized variational bayesian approach to the analysis of array data

*BMC Proceedings***volume 2**, Article number: S7 (2008)

## Abstract

### Background

Bayesian unsupervised learning methods have many applications in the analysis of biological data. For example, for the cancer expression array datasets presented in this study, they can be used to resolve possible disease subtypes and to indicate statistically significant dysregulated genes within these subtypes.

### Results

In this paper we outline a marginalized variational Bayesian inference method for unsupervised clustering. In this approach latent process variables and model parameters are allowed to be dependent. This is achieved by marginalizing the mixing Dirichlet variables and then performing inference in the reduced variable space. An iterative update procedure is proposed.

### Conclusion

Theoretically and experimentally we show that the proposed algorithm gives a much better free-energy lower bound than a standard variational Bayesian approach. The algorithm is computationally efficient and its performance is demonstrated on two expression array data sets.

## Background

Unsupervised clustering methods from machine learning are very appropriate in extracting structure from biological data sets. There has been extensive work in this direction using hierarchical clustering analysis [1], *K*-Means clustering [2] and self-organizing maps [3]. Bayesian methods are an effective alternative since they provide a mechanism for inferring the number of clusters. They can easily incorporate priors which penalise over-complexed models which would fit to noise and they allow probabilistic confidence measures for cluster membership. In this paper, we focus on Bayesian models which use Dirichlet priors. Examples of these models include Latent Dirichlet Allocation [4] (LDA) for use in text modeling and Latent Process Decomposition (LPD) [5] for analysis of microarray gene expression datasets. One appealing feature of the latter models is that each data point can be stochastically associated with multiple clusters. One approach to model inference is to use methods such as Markov Chain Monte Carlo and Gibbs sampling. However, for the large datasets which occur in many biomedical applications these methods can be too slow for certain tasks such as model selection. This motivates our interest in computationally efficient variational inference methods [4–6].

Typically, these inference methods posit that all the latent variables and model parameters are *independent* of each other (i.e. a fully factorized family) which is a strong assumption. In this paper we propose and study an alternative inference method for LPD, which we call marginalized variational Bayesian (MVB). In this approach the latent process (cluster) variables and model parameters are allowed to be *dependent* on each other. As we will show in the next section, this assumption is made feasible by marginalizing the mixing Dirichlet variables, and then performing inference in the reduced variable space. This new approach to constructing an LPD model theoretically and experimentally provides much better free-energy lower bounds than standard a variational Bayes (VB) approach [6, 7]. Moreover, the algorithm is computationally efficient and converges faster, as we demonstrate with experiments using expression array datasets.

## Methods

### The LPD probabilistic model

We start by recalling LPD [5]. Let *d* index samples, *g* the genes (attributes) and *k* the soft clusters (samples are represented as combinatorial mixtures over clusters). The numbers of clusters, genes and samples are denoted $\mathcal{K}$, $\mathcal{G}$, and $\mathcal{D}$ respectively. For each data *E*_{
d
}, we have a multiple process (cluster) latent variable *Z*_{
d
}= {*Z*_{
dg
}: *g* = 1,..., $\mathcal{G}$} where each *Z*_{
dg
}is a $\mathcal{K}$-dimensional unit-basis vector, i.e., choosing cluster *k* is represented by *Z*_{dg, k}= 1 and *Z*_{dg, j}= 0 for *j* ≠ *k*, otherwise. Given the mixing coefficient *θ*_{
d
}, the conditional distribution of *Z*_{
d
}is given by $p({Z}_{d}|{\theta}_{d})={\displaystyle {\prod}_{g,k}{\theta}_{dk}^{{Z}_{dg,k}}}$. The conditional distributions, given the latent variables, is given by $p({E}_{d}|{Z}_{d},\mu ,\beta )={\displaystyle {\prod}_{g,k}{[\mathcal{N}({E}_{dg}|{\mu}_{gk},{\beta}_{gk})]}^{{Z}_{dg,k}}}$, where $\mathcal{N}$ is the Gaussian distribution with mean *μ* and precision *β*.

Now we introduce conjugate priors over parameters *θ*, *μ*, *β*. Specifically, we choose *p*(*θ*_{
d
}) = Dir(*θ*_{
d
}|*α*), and $p(\mu )~{\displaystyle {\prod}_{g,k}\mathcal{N}({\mu}_{gk}|{m}_{0},{v}_{0})}$, and *p*(*β*) distributed as ∏_{
gk
}Γ(*β*_{
gk
}|*a*_{0}, *b*_{0}) where Γ is defined by $\Gamma (x|{a}_{0},{b}_{0})={x}^{{a}_{0}-1}\mathrm{exp}\left(-\frac{x}{b}\right)/{b}_{0}^{{a}_{0}}\Gamma ({b}_{0})$. We assume the data is i.i.d. and let Θ = {*μ*, *β* }. The joint distribution is given by

One can easily see that the marginal likelihood *p*(*E*|Θ) is the same as that in [5]. It is important to note that, in standard Gaussian mixture models [8], each data point is only related with a $\mathcal{K}$-dimensional latent variable which restricts the data to being in one cluster. Instead, in LPD each data point *E*_{
d
}is associated with multiple latent variables *Z*_{
d
}= {*Z*_{
dg
}: *g* = 1,..., $\mathcal{G}$}, and thus *E*_{
d
}is stochastically associated with multiple clusters.

### Marginalized variational Bayes

In this section we describe a marginalized variational Bayesian approach for LPD. The target of model inference is to compute the posterior distribution *p*(*θ*, *Z*, Θ |*E*) = *p*(*E*, *θ*, *Z*|Θ)*p*(Θ)/*p*(*E*). Unfortunately, this involves computationally intensive estimation of the integral in the evidence *p*(*E*). Hence, we approximate the posterior distribution in a *hypothesis family* whose element are denoted by *q*(*θ*, *Z*, Θ).

The standard variational bayesian method [7, 10] uses the equality:

Our optimization target is to maximize the free-energy: ${\mathbb{E}}_{q}\left[\mathrm{log}\frac{p(E,\theta ,Z|\Theta )p(\Theta )}{q(\theta ,Z,\Theta )}\right]$ which, equivalently, minimizes the KL-divergence. One standard way is to choose the hypothesis family in a factorized form *q*(*θ*, *Z*, Θ) = *q*(*θ*)*q*(*Z*)*q*(Θ). In this setting, the free-energy lower bound (2) for the likelihood can be written as:

In this paper we study an alternative approach motivated by [9] which only marginalizes the latent variable *θ* and do variational inference only with respect to the leftover latent variable *Z*. In essence, we assume that the latent variables *θ* can be dependent on *Z*, Θ and the hypothesis family is chosen in the form of *q*(*θ*, *Z*, Θ) = *q*(*θ* |*Z*, Θ)*q*(*Z*)*q*(Θ). Since the distribution *q*(*θ*|*Z*, Θ) is arbitrary, let it be equal to $p(\theta |E,Z,\Theta )=\frac{p(E,\theta ,Z,\Theta )}{p(E,Z,\Theta )}$. Putting this into equation (2) and observing that $\frac{p(E,\theta ,Z|\Theta )}{p(\theta |E,Z,\Theta )}p(E,Z|\Theta )$ gives

Therefore, it is sufficient to maximize the lower bound

Observe that log $\frac{p(E,Z|\Theta )}{q(Z)}\ge {\displaystyle \int q(\theta )\mathrm{log}\frac{p(E,\theta ,Z|\Theta )}{q(Z)q(\theta )}}d\theta $. Consequently, we see that

As mentioned above, since *θ* can be dependent on *Z*, Θ, marginalized VB (MVB) yields a tighter lower bound for the likelihood than the standard VB approach in [6], thus potentially yielding better clustering results.

### Model inference and learning

We now turn our attention to the derivation of updates for marginalized VB following the inference methodology [7, 10]. For simplicity, let the posterior distribution *q*(*Z*), *q*(*μ*), *q*(*β*) be indexed by parameters. Specifically, we assume that $q(Z)={\displaystyle {\prod}_{d,g,k}{r}_{dg,k}^{{Z}_{dg,k}}}$, $q(\mu )={\displaystyle {\prod}_{g,k}\mathcal{N}({\mu}_{gk}|{m}_{gk},{v}_{gk})}$, and *q*(*β*) = ∏_{g, k}Γ(*β*_{
gk
}|*a*_{
gk
}, *b*_{
gk
}). Correspondingly, the free-energy lower bound ℒ(*q*(*Z*), *q*(Θ)) in equation (6) becomes a variational functional over these parameters, and hence we use ℒ(*R*, *μ*, *β*) later on. The model inference can be summarized by the following coordinate ascent updates.

Let *Z*^{\dg}denote the random variables excluding *Z*_{
dg
}. For any *d*, *g* let Θ and *Z*^{\dg}be fixed, then we take the functional derivative of the free-energy ℒ(*q*(*Z*), *q*(Θ)) w.r.t. *q*(*Z*_{
dg
}) and obtain the update:

For the updates for *q*(Θ), we obtain

Marginalizing out *θ* in (1) yields

Estimating the expectations of the log likelihoods in equations (8) and (9), we derive the variational EM-updates as follows. Details are postponed to the Appendix.

**E-step**: using equation (8) and denoting the digamma function by *ψ*, we have

where *N*_{dg, k}is given by 0.5(*ψ*(*a*_{
gk
})+log *b*_{
gk
}) - 0.5*a*_{
gk
}*b*_{
gk
}((*E*_{
dg
}- *m*_{
gk
})^{2} + ${v}_{gk}^{-1}$) and *r*_{dg, k}should be normalized to one over *k*.

**M-step**: using equation (9):

We pursue the above iterative procedure until convergence of the lower bound ℒ(*R*; Θ) whose evaluation is given in the Appendix. Since *Z*_{dg, k}determines the cluster for the observed data point *E*_{
d
}at attribute *g* and *r*_{dg, k}is its expectation, we intuitively assign data *E*_{
d
}to cluster arg max{∑_{
g
}*r*_{dg;k}: *k* = 1,...., $\mathcal{K}$}. We can also do model selection over the number of clusters based on a free energy lower bound of the marginalized VB. Experiments in the next section show that this approach is reasonable.

## Results

We ran marginalized VB on three data sets. The first was the wine data set from the UCI Repository [11]: this has 178 samples and each sample has 13 features. This data set was chosen for the purpose of validating the proposed method since there are 3 distinct clusters present (derived from 3 cultivars). As more biologically relevant examples we then selected two cancer expression array datasets. The first of these was a lung cancer data set [12] consisting of 73 samples and 918 features. The second was a leukemia data set [13] with 90 samples and 500 features. All the data sets were normalized to zero mean and unit variance and the hyper-parameters *m*_{0}, *v*_{0}, *a*_{0}, and *b*_{0} were chosen to have the same values in both standard VB and marginalized VB. Since the datasets are normalized and *m*_{0}, *v*_{0} are hyper-parameters of the Gaussian prior distribution over the mean for the data, it is reasonable to choose *m*_{0} = 0, *v*_{0} = 1. For similar reasons, given *a*_{0}, *b*_{0} are hyper-parameters of the Gamma prior distribution over the precision (inverse variance) of the data and the mean of a Gamma distributed random variable is *a*_{0}*b*_{0}, we chose *a*_{0} = 20 and *b*_{0} = 0.05 throughout these experiments.

First we compared the free energy lower bound of marginalized VB and standard VB based on 30 random initialization. In Figure 1 (top row) we observe an improvement in the free energy as a function of iteration step, for marginalized VB over standard VB. In analogy to standard VB, marginalized VB can determine the appropriate number of soft clusters by estimating the free energy bound given by equation (6) in contrast to the hold-out cross-validation procedure for a maximum likelihood approach to LPD [5]. To investigate the effectiveness of this approach to model selection, free energies were averaged over 20 runs from different random initializations. As shown in Figure 1 (middle row), marginalized VB performed well in determining the correct number of clusters (three) in the UCI wine data set. For the cancer array datasets, the peak in the averaged free energy is less marked with an indication of six soft clusters for the leukemia data set and seven clusters for the lung cancer data set.

In the bottom row of Figure 1, we see that marginalized VB shows quite promising clustering results using the normalized ∑_{
g
}*r*_{dg, k}: these peaks indicate the confidence in the allocation of the *d*th sample to the *k*th cluster and accord well with known classifications. The lung cancer dataset of Garber *et al* [12] (middle column, Figure 1) consisted of 73 gene expression profiles from normal and tumour samples with the tumours labelled as squamous, large cell, small cell and adenocarcinoma. The samples are in the order in which they are presented in the original paper [12] with the dashed lines showing their original principal sample groupings. As with Garber *et al* [12] we identified seven clusters in the data with the adenocarcinoma samples falling into three separate clusters with strong correlation with clinical outcomes. For their ordering (which we follow) samples 1–19 belong to adenocarcinoma cluster 1, samples 20–26 belong to adenocarcinoma cluster 2, samples 27–32 are normal tissue samples, samples 33–43 are adenocarcinoma cluster 3, samples 44–60 are squamous cell carcinomas, samples 61–67 are small cell carcinomas and samples 68–73 are from large cell tumours.

As our last example, we applied the proposed MVB method to an oligonucleotide microarray dataset from 360 patients with acute lymphoblastic leukemia (ALL) from Yeoh *et al* [13]. ALL is known to have a number of subtypes with variable responses to chemotherapy. In many cases fusion genes are implicated in the genesis of the disease. For the Yeoh *et al* [13] dataset, samples were drawn from leukemias with rearrangements involving *BCR-ABL*, *E2A-PBX1*, *TEL-AML1*, rearrangements of MLL gene, hyperdiploid karyotope (more than 50 chromosomes) and T lineage leukemias (*T-ALL*). The free energy is plotted in Figure 1 (right column, middle row) with a peak suggesting 6 subtypes. The dashed lines represent the original demarcations of groups according to known genetic rearrangement. Samples 1–15 are *BCR-ABL*, 16–42 are *E2A-PBX1*, 43–106 *Hyperdiploid* > 50, 107–126 *MLL*, 206–248 *T-ALL*, 249–327 *TEL-AML1*, 328–335 *Group23* and 127–205 are labelled as *Others*. Some groupings, such as *E2A-PBX1*, are very distinct. However, the overall groupings are not as well defined as with lung cancer.

## Conclusion

We have proposed an efficient variational Bayesian inference method for LPD probabilistic models. By allowing the variables to be dependent on each other, the method can provide more accurate approximation than standard VB. Also, the method provides a principled approach to model selection via the free energy bound. Promising clustering results were also reported on lung cancer and leukemia data sets. Extensions of this method to semi-supervised clustering will be reported elsewhere.

## Appendix

In this appendix we derive the EM-updates and free energy bound for MVB.

### Derivation of updates

Noting that, for any *d*, *g*, ∑_{
k
}*Z*_{dg, k}= 1 and denoting the number of features by $\mathcal{G}$ we obtain from equation (10):

Since $\Gamma \left({\alpha}_{k}+{\displaystyle {\sum}_{{g}^{\prime}\ne g}{Z}_{d{g}^{\prime},k}+{Z}_{dg,k}}\right)={\left({\alpha}_{k}+{\displaystyle {\sum}_{{g}^{\prime}\ne g}{Z}_{d{g}^{\prime},k}}\right)}^{{Z}_{dg,k}}\Gamma \left({\alpha}_{k}+{\displaystyle {\sum}_{{g}^{\prime}\ne g}{Z}_{d{g}^{\prime},k}}\right)$, putting this observation into the log *p*(*E*, *Z*|Θ) yields:

where *constant terms* are independent of *Z*_{dg, k}. Hence, substituting this into equation (8) we conclude that

To estimate the expectation of the Normal distribution, we use the following observations (e.g. [7]) for the Gamma and Normal distributions:

and

Consequently, simple manipulation yields:

equals, up to a constant term:

We also use approximating methods [9] to estimate log ${\mathbb{E}}_{q\backslash dg}\left[\mathrm{log}({\alpha}_{k}+{\displaystyle {\sum}_{{g}^{\prime}\ne g}{Z}_{d{g}^{\prime},k}})\right]$. For this purpose, we observe, for any positive random variable *x*, that

and ${\mathbb{E}}_{q\backslash dg}\left[{\displaystyle {\sum}_{{g}^{\prime}\ne g}{Z}_{d{g}^{\prime},k}}\right]={\displaystyle {\sum}_{{g}^{\prime}\ne g}{r}_{d{g}^{\prime},k}}$, $\text{Var}\left({\displaystyle {\sum}_{{g}^{\prime}\ne g}{Z}_{d{g}^{\prime},k}}\right)={\displaystyle {\sum}_{{g}^{\prime}\ne g}{r}_{d{g}^{\prime},k}}(1-{r}_{d{g}^{\prime},k})$. Plugging the above observations into equation (17) yields the desired E-step updates.

For the updates for *q*(Θ), the updates are essentially the same as those in [6, 7] since the associated terms with variables with Θ in ${\mathbb{E}}_{q\backslash \mu}\left[\mathrm{log}p(E,Z|\Theta )\right]$[log *p*(*E*, *Z*|Θ)] are exact the same, that is, Θ only appears in the Normal distribution. Hence, noting that the product of two Gamma (Normal) distributions is a Gamma (Normal) distribution, we can obtain, from equations (16) and (9), the M-step updates.

### Free energy bound

The free-energy lower bound of marginalized VB is defined by equation (6):

From the fact that Γ(*x* + 1) = *x*Γ(*x*) for any *x* > 0, we know that $\Gamma ({\alpha}_{k}+{\displaystyle {\sum}_{g}{Z}_{dg,k}})=\Gamma ({\alpha}_{k}){\displaystyle {\prod}_{g=1}^{\mathcal{G}}{({\alpha}_{k}+{\displaystyle {\sum}_{j=g+1}^{\mathcal{G}}{Z}_{dj,k}})}^{{Z}_{dg,k}}}$, where we use the convention ${\sum}_{j=\mathcal{G}+1}^{\mathcal{G}}=0$. Putting this equation into the expression (10) of log likelihood, we obtain:

Since we used the convention ${\sum}_{j\ge \mathcal{G}+1}{Z}_{dj,k}=0$, ${\mathbb{E}}_{q}\left[\mathrm{log}\left({\alpha}_{k}+{\displaystyle {\sum}_{j\ge \mathcal{G}+1}{Z}_{dj,k}}\right)\right]=\mathrm{log}{\alpha}_{k}$. It remains to estimate the term ${\mathbb{E}}_{q}\left[\mathrm{log}\left({\alpha}_{k}+{\displaystyle {\sum}_{j\ge g+1}{Z}_{dj,k}}\right)\right]$ for *g* = 1,..., $\mathcal{G}$ - 1. To this end, we use the approximation (18) again to get:

Consequently, we conclude:

where the convention ${\sum}_{g\ge \mathcal{G}+1}^{\mathcal{G}}=0$ is used again.

In addition,

For the KL divergences, we have:

and

## References

- 1.
Eisen MB, et al: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.

- 2.
Tavazoie S, et al: Systematic determination of genetic network architecture. Nature Genetics. 1999, 22: 281-285. 10.1038/10343.

- 3.
Tamayo P, et al: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.

- 4.
Blei DM, Ng AY, Jordan MI: Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003, 3: 993-1022. 10.1162/jmlr.2003.3.4-5.993.

- 5.
Rogers S, Girolami M, Campbell C, Breitling R: The Latent Process Decomposition of cDNA Microarray Datasets. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2005, 2: 143-156. 10.1109/TCBB.2005.29.

- 6.
Carrivick L, Campbell C: A Bayesian approach to the analysis of micoarray datasets using variational inference. Journal submission. 2007

- 7.
Attias H: A variational Bayesian framework for graphical models. Advances in Neural Information Processing Systems. 2002, 12: 209-215.

- 8.
Bishop CM: Pattern recognition and machine learning. 2006, (Series: Information Science & Statistics), Springer

- 9.
Teh YW, Newman D, Welling M: A collapsed variational bayesian inference algorithm for latent dirichlet allocation. Advances in Neural Information Processing Systems. 2006, 19: 1353-1360.

- 10.
Jordan MI, Ghahramani Z, Jaakkola T, Saul LK: An introduction to variational methods for graphical models. Machine Learning. 1999, 37: 183-233. 10.1023/A:1007665907178.

- 11.
Blake CL, Newman DJ, Hettich S, Merz CJ: UCI repository of machine learning databases. 1998, [http://www.ics.uci.edu/~mlearn/mlrepository.html]

- 12.
Garber E, et al: Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci U S A. 2001, 98: 13784-13789. 10.1073/pnas.221451398.

- 13.
Yeoh E-J, et al: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002, 1: 133-143. 10.1016/S1535-6108(02)00032-6.

## Acknowledgements

This work is supported by EPSRC grant EP/E027296/1.

This article has been published as part of *BMC Proceedings* Volume 2 Supplement 4, 2008: Selected Proceedings of Machine Learning in Systems Biology: MLSB 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/2?issue=S4.

## Author information

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

YY and CC conceived the method and drafted the manuscript. PL and YY coded the algorithm and implemented the experiments. All authors read and approved the final manuscript.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Published

#### DOI

### Keywords

- Free Energy
- Latent Dirichlet Allocation
- Model Inference
- Soft Cluster
- Lung Cancer Data