Identification of functional modules based on transcriptional regulation structure

Background Identifying gene functional modules is an important step towards elucidating gene functions at a global scale. Clustering algorithms mostly rely on co-expression of genes, that is group together genes having similar expression profiles. Results We propose to cluster genes by co-regulation rather than by co-expression. We therefore present an inference algorithm for detecting co-regulated groups from gene expression data and introduce a method to cluster genes given that inferred regulatory structure. Finally, we propose to validate the clustering through a score based on the GO enrichment of the obtained groups of genes. Conclusion We evaluate the methods on the stress response of S. Cerevisiae data and obtain better scores than clustering obtained directly from gene expression.


Background
An important step in analyzing gene functions is to cluster genes according to their expression patterns. Such clusters can then be analyzed in several ways, for example by assigning unannotated genes to the majority function of each cluster's genes (see [1] for a review).
However, this approach has several limitations. On the one hand, genes of similar expression patterns may not necessarily have the same or similar functions; on the other hand, genes with related functions may not show close correlation in their expression patterns. For example, a transcription factor can activate some genes and repress others in the same pathway.
The principal assumption of this paper is that unsupervised clustering of genes on the basis of similar regulators (activators/inhibitors) should assemble functional co-regulated groups of genes. To compute a similarity measure between genes as a function of inferred regulators of these genes, we use the output of a data mining algorithm called LICORN [2], that infers cooperative regulation relations from expression data only. The resulting similarity matrix between genes is considered as the adjacency matrix of a weighted graph. Clustering is then performed to find functional modules of genes in the network.
To objectively evaluate clustering, we use Gene Ontology to determine if the obtained clusters can be associated with terms of the Biological Process ontology. The strength of such an association is given by a p-value from an Hypergeometric test. We compare different clusterings by calculating a score based on the p-values which becomes greater when the significant p-values are smaller and more numerous.
In section Methods, we introduce our model of gene regulation and briefly describe a data mining algorithm for inferring large-scale cooperative gene regulation. We then propose a similarity measure for the genes based on the inferred regulator sets and define the score of a clustering. Finally, in section Results and discussion, we evaluate our system on a yeast data set.

Cooperative regulation networks
Let us denote by the set of genes with a known or putative regulation activity and as the set of genes without such an activity. The input of the mining method is a discretised expression matrix for genes of . Each expression value can take the value -1 (under-expressed), 0 (normal), or 1 (over-expressed). A gene regulatory network (GRN) associated with a target gene g is a pair (A, I), where is a co-activator set, and is a co-inhibitor set. The set of GRNs for all target genes can also be seen as a bipartite graph where the top layer contains regulators, the bottom layer contains target genes, and edges code for a regulatory interaction between regulators and target genes, each edge being labelled with a regulatory mode (i.e., activator or inhibitor). The regulation relations we are interested in are combinatorial: each target gene has a number of activators and/or inhibitors. Activators on one side and inhibitors on the other side are aggregated in our model through an extended logical AND, i.e., a regulator set S (activator or repressor) is over-expressed (resp. under-expressed) if and only if all the regulators in S are over-expressed (resp. under-expressed). Finally, we describe in Figure 1 a discrete function called Regulatory Program RP, which, given the combined states of activators A and inhibitors I of g in a sample s computes s (A, I), the estimated state of g in s. The main features of our regulation model are therefore the explicit representation of activation and repression relationships for a given target gene, and the representation of co-operative transcriptional regulation.

Learning algorithm
We have recently proposed [2] an original, scalable technique called LICORN for deriving co-operative regulations, in which many co-regulators act together to activate or repress a target gene. LICORN uses an original heuristic approach to accelerate the search for an appropriate structure for the regulation network. It first computes frequent co-regulator sets, i.e., regulator sets that frequently occur together as over (1) or under (-1)-expressed in the discretised expression matrix. This is done by using an extension of the Apriori algorithm [3] to handle both 1 and -1 supports (The x-support of a co-regulator C in the three-valued expression matrix is the set of samples that include all the regulators of C with the state x).
From this representation, a limited subset of candidate coregulator sets is then associated with each gene. The learning algorithm looks for each gene for regulator sets which have a high "overlap" with the target gene. Intuitively, the overlap constraint checks the size of the intersection between supports of the target gene and a given candidate co-regulator set. A candidate activator set for a target gene The regulatory program g is frequently over-expressed when g is over-expressed or frequently under-expressed when g is under-expressed. On the opposite, a candidate repressor set for a target gene g is frequently over-expressed when g is under-expressed and vice-versa. This search can be efficiently performed because of the property of anti-monotonicity of the overlap constraint with respect to set inclusion. Then, once a limited number of candidate activator and inhibitor sets have been obtained, exhaustive search for the best gene regulatory network can be performed. Finally, a permutation-based procedure is used for selecting statistically significant regulation relations. We have shown in [2] that the co-operative regulation patterns inferred by LICORN cannot be identified by clustering or pairwise methods, and are only partly revealed by constrained Bayesian or decision tree-based techniques, such as those used in previous studies [4,5].

Identification of functional co-regulation modules
Partial overlap of the regulator sets for a set of target genes can be used as an alternative measurement of the distance between genes.

Computation of the co-regulation matrix
We design the co-regulation matrix by using a similarity measure defined as follows: let λ ∈ [0, 1] and (g 1 , g 2 ) be two genes. The similarity between g 1 and g 2 is defined by where |A| and |I| are respectively the number of activators and inhibitors of both g 1 and g 2 , |AI| is the number of regulators which activate one gene and inhibit the other and |TF| is the number of transcription factors regulating at least one of the genes. This similarity considers two genes as being far appart (ϕ(g 1 , g 2 ) = 0) if they do not share any regulators. Two genes are considered most similar if their set of activators and inhibitors are exactly the same (ϕ(g 1 , g 2 ) = 1). In intermediate situations, λ represents the weight given to common regulators which have opposite effects.

Clustering
To cluster genes from the similarity matrix, we use the MCL algorithm [6,7]. That algorithm, based on the fluxes φ λ ( , ) , g g A I AI TF 1 2 = + + Score and number of p-values for λ varying from 0 to 1

Figure 2
Score and number of p-values for λ varying from 0 to 1.
in a graph, is well suited to weighted graphs and does not require any prior knowledge about the number of clusters. Moreover, it does not require any initial conditions and is therefore reproducible. The inflation parameter of the algorithm is fixed to 1.8, as suggested in [8].

Mapping to GO-terms
To assess the functional significance of obtained clusters, and suggest putative functions for genes with unknown functions, we calculate the enrichment of gene ontology (GO) [9].
To determine the over-represented GO terms in each cluster, we apply the R package GOstats [10] with a p-value cut-off of 5% and the biological process ontology. For each cluster C, we obtain a set T C of GO terms over-represented in C with a rate of 5% and a set of associated p-values {p t , t ∈ T C }.
We define the score of the clustering by where C t is the set of genes of C associated with the GO term t. Parameters c min and c max allow us to avoid clusters that are too small, which don't have a biological meaning, as well as too big ones, which don't have any functional unity.

Results and discussion
As a proof of concept, we used gene expression data sets for S. Cerevisiae. The Gasch data set [11] measures the response of yeast to 173 stress conditions for 6152 genes. We used a set of 237 known and putative transcription factors.
We applied LICORN and retained only those GRNs (gene regulatory networks) identified as significant with a 5% FDR level (see [2] for details). 2041 GRNs (of 5703 GRNs) were identified as significant. The structural organization of the learned GRNs has been shown to be consistent with recent advances [12] concerning the characterization of topological transcriptional network features in yeast and provide the first evidence of the relevance of inferred GRNs.
In order to choose the parameter λ for the similarity matrix, we computed the matrices and the associated clusterings for several values of λ and compared their scores with parameters c min = 5 and c max = 200. Figure 2 shows that the best one is obtained for λ = 0.1.
For λ = 0.1, the clustering gives 30 clusters among which one is too big to be considered (407 genes) and 3 have less than 5 genes. Table 1 gives the best GO term association Table of the ten best clusters among the 59 ones obtained for λ = 0.3 ranked by their best association with a GO term. For each of them, the best associated GO term and the corresponding p-value are given, as well as the size of the cluster and the biological process associated to the GO term.
for the 10 best of them when ranked according to their best p-value. The biological evaluation of these clusters is ongoing.
The cluster number 15 that appears on the second line of Table 1 is in fact associated to 32 GO terms with a p-value lower than 1e -07, most of those terms being related to phosphorylation or triphosphate metabolic process. Moreover, five genes of that cluster belong to the 167 genes having no associated GO term, namely the genes YLR296W, YDR215C, YBL044W, YIR040C and YPR027C. All of them appear in the Entrez gene database but without known functions.
We have finally validated our method by comparing clustering performances based on other similarity matrices. We therefore have computed from the original expression data matrices of euclidian distance, partial correlation [13] and mutual information [14]. To compare clustering results with the same number of clusters, we used the hierarchical clustering method AGNES [15] to cluster the genes in 20, 30, 40, and 50 groups. Figure 3 shows the scores for those three methods as well as for ours with λ = 0.1. It clearly shows that inferring the regulatory network from LICORN preprocessing improves the score of the clustering and provide more biologically relevant clusters.

Conclusion
The problem of discovering functional modules from expression data is both biologically important and computationally challenging. From a biological perspective, identifying members of functional modules is the first step toward understanding the regulatory network of the cell. We provide here an alternative way for constructing gene modules: genes are clustered in the same module if they share many regulators, as they have been inferred by LICORN from gene expression data. We expect this way of clustering will discover modules that are out the scope of classical co-expression clustering techniques. From a computational perspective, one of the key challenges is dealing with over-fitting as the number of data samples is so small.
Comparison of the clustering based on LICORN with existing methods