 Proceedings
 Open Access
 Published:
BiCluE  Exact and heuristic algorithms for weighted bicluster editing of biomedical data
BMC Proceedings volume 7, Article number: S9 (2013)
Abstract
Background
The explosion of biological data has dramatically reformed today's biology research. The biggest challenge to biologists and bioinformaticians is the integration and analysis of large quantity of data to provide meaningful insights. One major problem is the combined analysis of data from different types. Bicluster editing, as a special case of clustering, which partitions two different types of data simultaneously, might be used for several biomedical scenarios. However, the underlying algorithmic problem is NPhard.
Results
Here we contribute with BiCluE, a software package designed to solve the weighted bicluster editing problem. It implements (1) an exact algorithm based on fixedparameter tractability and (2) a polynomialtime greedy heuristics based on solving the hardest part, edge deletions, first. We evaluated its performance on artificial graphs. Afterwards we exemplarily applied our implementation on real world biomedical data, GWAS data in this case. BiCluE generally works on any kind of data types that can be modeled as (weighted or unweighted) bipartite graphs.
Conclusions
To our knowledge, this is the first software package solving the weighted bicluster editing problem. BiCluE as well as the supplementary results are available online at http://biclue.mpiinf.mpg.de.
Introduction
Background
The enormous amount of available (sequential) data from laboratories around the world has greatly shifted the focus of biologically motivated studies. For instance, GenBank, as the largest database of genes, now stores over 197,000,000 sequences of more than 380,000 organisms [1]. UniProtKb/SwissProt provides a database containing more than 53,000 annotated sequences, extracted and integrated from 205,244 published references and Protein Data Bank (PDB) has incorporated over 78,400 molecule structures. Integrating, processing and analyzing large quantities of data from various sources have become the main challenge in modern bioinformatics. The requirement of carefully designed computational models and methodology increases rapidly, in order to discover novel interrelations and gain further insights. In our study, we focus on the exact and heuristic algorithms that cluster data from different types simultaneously, i.e. so called "bi cluster editing". A software package named BiCluE containing an exact algorithm and a heuristic algorithm is available for downloading http://biclue.mpiinf.mpg.de. We test and evaluate BiCluE on artificially generated data. Afterwards, we demonstrate its applicability to realworld GenomeWide Association Study data, also known as GWAS. GWAS examines the genetic variants (genotypes) of one species, aiming for searching associations with a certain phenotypic trait. In one typical GWAS project, millions of SNPs are investigated and statistical tests are performed to verify significant associations with the phenotypes. This is a typical example of a bipartite data type, i.e. two types of data objects and a measurement that finds relations between the instances of two types. Traditional analysis of GWAS data associates only one pair of SNP and trait/disease in one statistical test. This methodology tends to have false positives and false negatives and to overlook the joint effects of moderate risk SNPs. Therefore, we changed this strategy, forming "group to group" associations, rather than the traditional "one to one" relations (Figure 1). In our previous study [2], a preliminary version of exact algorithm was implemented. Now we bring forward a newly designed heuristics, largely shortening the running time without significant compromising the accuracy. A software package was implemented to integrate both two algorithms and make it more convenient for people to use. Both two algorithms were applied to the GWAS data set and were capable of dealing with all of the 415 problem instances but two, which have been too large (> 3,500 nodes). 86 putative new associations were discovered and we hope our results can serve as a guide for further investigations in the wet lab. In other words, we seek to explain the joint effect of multiple genotypes to multiple phenotypes by "virtually" adding/removing associations such that bicliques emerge in the underlying bipartite graph. Note that we chose GWAS data as an intuitive realworld example for data that can be modeled as bipartite graphs. Since BiCluE can be applied to many different biomedical data types (see Results section), the focus of this paper lies on the exact fixedparameter algorithm for weighted bicluster editing and on the edge deletion heuristics.
Cluster editing and bicluster editing
Clustering is a classical task in bioinformatics and computational biology. It partitions a data set into different clusters such that elements within a cluster are more similar to each other than to those objects belonging to different clusters, according to a certain criterion. Various clustering methods are used in every field of biological studies, including functional genomics, protein structure/sequence analyses and almost all types of network analyses (e.g., transcription regulatory network, proteinprotein interaction networks) [3]. Some specific types of clustering were designed for different scenarios. For instance, the clustering of gene expression data under different conditions, which can be modeled as a bipartite graph [4], is hardly suitable for standard clustering methods. Instead, one would like to cluster genes and conditions simultaneously such that we see a consistent "behavior", i.e. so called biclustering.
Clustering and biclustering are very similar problems, thereby sharing similar strategies. One of the common approaches of solving the problems is to compute a pairwise similarity matrix and to choose a similarity threshold for constructing the corresponding similarity graphs. Such graphs are built according to the following steps: (1) The vertices of the graph refer to the objects (for instance, genes or conditions), and (2) an edge between two objects is drawn if the similarity score between two vertices is above a certain threshold [5]. We call two arbitrary vertices u and v "similar" when the score is above a certain value, written as u ~ v. However, the resulting graph is not necessarily transitive, meaning for arbitrary three vertices uvw, u ~ v and u ~ w does not necessarily imply v ~ w. As a result, we aim to convert the preliminarily constructed graph into a transitive graph, which is a disjoint union of cliques, with minimal costs (minimal number of edge deletions/insertions, for instance). Such a problem is named "cluster editing". The formal problem statement follows:
Let V be the set of vertices (objects) to be clustered and uv be an unordered pair of elements in V , i.e., \left\{u,\phantom{\rule{0.1em}{0ex}}v\right\}\in \left(\begin{array}{c}\hfill V\hfill \\ \hfill 2\hfill \end{array}\right). We then define the similarity between two vertices as a symmetric function s : \left(\begin{array}{c}\hfill V\hfill \\ \hfill 2\hfill \end{array}\right) → R. A given threshold is then used to decide whether u and v are "similar " (if s(uv) ≥ threshold) or "not similar" (if s(uv) <threshold). Let E = {uv : u ~ v} denote the edge set of the similarity graph. Here in this study, selfloops are not permitted.
The graph is called transitive, if it satisfies any of the three equivalent conditions below:

Every set of three vertices uvw\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\left(\begin{array}{c}\hfill V\hfill \\ \hfill 3\hfill \end{array}\right) satisfies: uv ∈ E and vw ∈ E ⇒ uw ∈ E.

No path of three vertices is allowed.

Every disjoint component of G is is a clique. (A clique is a complete graph.)
Given an input graph G = (V, E), one asks whether G can be transformed into a transitive graph {G}^{\prime}=\phantom{\rule{0.1em}{0ex}}\left(V,\phantom{\rule{0.1em}{0ex}}{E}^{\prime}\right), by inserting and deleting edges. For each insertion or deletion, we have a certain penalty depending on s(uv). Let cost\left(G\phantom{\rule{0.1em}{0ex}}\to {G}^{\prime}\right)=s\left(E\backslash {E}^{\prime}\right)s\left({E}^{\prime}\backslash E\right) denote the cost function. Our task is to find a {G}^{\prime}, such that cost\left(G\phantom{\rule{0.1em}{0ex}}\to {G}^{\prime}\right)is minimized.
Bicluster editing, similar to "cluster editing", is a mathematical model of the "biclustering" problems, also serving as a strategy of solving biclustering problems. The graphs are built in the same way, with vertices referring to the entities and edges representing similarities. However, the resulting graph must be a bipartite graph. Bipartite graphs are special graphs satisfying the following criteria: (1) the vertices of the graph are divided into two subsets, and (2) edges can only be defined between vertices belonging to different subsets.
We consider a bipartite graph G = (V, E) transitive if it satisfies any of the following equivalent conditions:

For an arbitrary subset of four vertices, uvwx\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\left(\begin{array}{c}\hfill V\hfill \\ \hfill 4\hfill \end{array}\right), where u, w belong to the same subset and v, x belong to the other, we have uv ∈ E , wv ∈ E and wx ∈ E ⇒ ux ∈ E.

No paths of 4 vertices can be found, i.e., for each uvwx\in \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\left(\begin{array}{c}\hfill V\hfill \\ \hfill 4\hfill \end{array}\right), where u, w belong to the same subset and v, x belong to the other, we have E ∩ {uv, wv, ux, wx} ≠ 3.

G is a union of disjoint bicliques (i.e. complete bipartite graphs).
Bicluster editing is similar to its counterpart, cluster editing: We transform a given bipartite graph into a union of disjoint bicliques by edge insertions and deletions with minimal costs for these modifications. The definition of cost\left(G\phantom{\rule{0.1em}{0ex}}\to {G}^{\prime}\right) is the same. Note that bicluster editing, though related to biclustering (see [6]), is different in concept, methodology and biomedical applicability.
Problem statement
The weighted bicluster editing problem is defined as follows: Given an undirected bipartite graph G = (V, E, s), where s is a similarity function s:\phantom{\rule{0.1em}{0ex}}\left(\begin{array}{c}\hfill V\hfill \\ \hfill 2\hfill \end{array}\right)\to R. Let {G}^{\prime} be a union of disjointed bicliques. Find one or all {G}^{\prime} such that cost\left(G\phantom{\rule{0.1em}{0ex}}\to {G}^{\prime}\right) is minimized.
The input to our algorithm is a graph G = (V, E, s), with similarity function s(uv) → R and a similarity threshold. E denotes the set of the edges: E = {v_{1}, v_{2} : s(v_{1}v_{2}) > threshold}. The algorithm outputs a set of edited edges E* and a cost c* = cost(G → (V, E\E* ∪ E*\E).
We assume that the input graph consists of only one single connected component since we can apply the algorithms on each connected component separately, without loss of generality. An optimal solution of the bicluster editing problem would never join separate components, since we can always find a cheaper solution where all separated components remain separated [7].
In this study, we use "P4" as the short from of "a path of 4 vertices". As mentioned above, a bipartite graph is transitive if and only if it contains no P4. Denote B(G) to be the set of all P4s, i.e. B\left(G\right)\phantom{\rule{0.1em}{0ex}}=\left\{uvwx\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left(\begin{array}{c}\hfill V\hfill \\ \hfill 4\hfill \end{array}\right)\left\phantom{\rule{0.1em}{0ex}}\rightE\phantom{\rule{0.1em}{0ex}}\cap \left(uv,\phantom{\rule{0.1em}{0ex}}wv,\phantom{\rule{0.1em}{0ex}}ux,\phantom{\rule{0.1em}{0ex}}wx\right)\phantom{\rule{0.3em}{0ex}}=\mathsf{\text{3}}\right\}. G is transitive if and only if B(G) = ∅.
Previous studies and results
It has been proven that both unweighted and weighted bicluster editing problems are NPhard [8]. Although many studies focused on cluster editing [3, 5, 9], the study of bipartite transitive graph projection is far from complete. An algorithm based on graph module decomposition for unweighted bicluster editing was developed by F. Protti et al., with the time complexity of O(4^{k} + V + E) [10]. Later J. Guo et al. improved the running time to O(3.24^{k} + E), by a refined branching strategy [7]. However, we still lack algorithms solving weighted bicluster editing problem instances, which covers most of the cases in real life.
Fixedparameter algorithm
Fixedparameter algorithm and fixedparameter tractability were first introduced by Downey and Fellows in 1990s as a methodology of solving NPhard problems more efficiently [11]. An NPhard problem is called "fixedparameter tractable", if it can be determined with a running time complexity that increases polynomially with input size and exponentially or worse with the parameter k, namely O(f(k)·I^{c}), where I is the input size and c is a constant. Moreover, f must be a function that only depends on k. Niedermeier gives a more detailed introduction to the theories and applications of fixedparameter algorithms and fixedparameter tractability [12].
Our contributions
Here, we present BiCluE, a Java software package that deals with weighted bicluster editing. BiCluE implements an exact algorithm based on fixedparameter tractability theory and a new faster running heuristic algorithm based on optimal edge deletion estimation. We regard the parameter k of the fixedparameter algorithm as the cost of edge modifications. Given a problem instance, the exact algorithm finds the optimal solution with cost at most k, if there is such a solution. Assuming s(uv)>1 for all possible u, v, our exact algorithm finishes in O(4^{k}) time, checking whether there is an optimal solution or not, while the heuristic algorithm needs O(E·(E + V^{2}) + V^{3}) time to output an "approximate" optimal solution.
Although, we focus on the new algorithms here, we evaluated our BiCluE approach on artificially generated graphs. Afterwards, we exemplarily show that BiCluE may be applied to real biomedical data by means of two different GWAS data sets. In this intuitive setting of a bipartite graph we used our weighted bicluster editing algorithms to scan for new putative associations that can be deduced from the resulting "group to group" relations. We will discuss the new findings and hope that our results can serve as guidelines for further statistical investigations and wet lab studies.
Methods
Fixedparameter algorithm
Our fixedparameter algorithm contains two important parts: Data Reduction and Branching Strategy.
Data Reduction: The data reduction is a kind of preprocessing of the fixedparameter algorithm that reduces the instance size by removing those parts of the problem instance that do not need to be repaired and thereby do not need to be considered in the following steps. We first recognize all the separate components as individual input. Then the algorithm checks whether each component is already a biclique or not. If this is the case, then the algorithm removes the whole component from the input and outputs it as a part of the solution. This procedure finishes within O(V + E) time.
Branching Strategy: Branching strategy refers to a search tree procedure to search and edit the P4s using edge insertions and deletions. We have 4 possibilities to convert a P4 into bicliques: removing one of the three edges, resulting in two bicliques, or complete the P4 with one edge insertion (Figure 2). More specifically:
Suppose uvwx is an arbitrary P4 and let (uv), (wv), (wx) be the three edges in the P4. The following three cases are checked recursively as shown in Figure 2a:

Insert ux by setting the weight of ux to "permanent" (Figure 2b)

Delete uv by setting the weight of uv to "forbidden" (Figure 2c)

Delete wv by setting the weight of wv to "forbidden" (Figure 2d)

Delete wx by setting the weight of wx to "forbidden" (Figure 2e)
The search tree procedure starts when a P4 is located. Four branches are created for one P4 in the search tree; each represents one of the editing possibilities. Then we recursively visit the four branches one by one, performing the corresponding edge insertions or deletions and update k to {k}^{\prime} = k−(insertion or deletion cost). We implement the whole algorithm in a recursive manner. If the editing behavior of a certain branch leads to {k}^{\prime}\phantom{\rule{2.77695pt}{0ex}}<\phantom{\rule{0.3em}{0ex}}0, then the corresponding branch is skipped. The algorithm stops when the entire tree is visited, and returns the optimal solution found. This branching strategy accepts a worst case running time of O(4^{k} ).
Edge deletion heuristics
In the fixedparameter algorithm, we are aiming for repairing all the P4s to make G transitive. The repairing behavior is either an edge insertion or edge deletion. It is obvious that the difficult part of the problem is to correctly locate the edge deletions, for the edge deletions determine the number of resulting disjoint bicliques. Therefore, it would be beneficial to find the most promising positions of edge deletions first. Then edge insertions can easily be carried out by inserting all the edges required to make each disjoint component transitive. This is the main idea behind our edge deletion heuristic algorithm.
We define a function to score the edge removal candidates and greedily delete the edge with highest score in each step, until further deletions do not improve the solution. For each P4 uvwx (where (uv), (wv), (wx) ∈ E), we define deviation from transitivity of G, D(G) as:
The score of edge deletions are computed as follows: Let uv be an arbitrary edge in G = (V, E, s). {G}^{\prime}\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}\left(V,\phantom{\rule{0.1em}{0ex}}E\backslash \left\{uv\right\},\phantom{\rule{0.1em}{0ex}}{s}^{\prime}\right\} is G after the removal of uv, where {s}^{\prime}\left(xy\right)\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}s\left(xy\right), except {s}^{\prime}\left(uv\right)\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}\infty (uv set to "forbidden"). Then we define:
as the transitivity improvement of edge uv where s(uv) is the cost of edge deletion.
The edge deletion heuristic algorithm consists of three functions: REMOVE_CULPRIT(G), TRANSITIVE_CLOSURE_COST(G) and EDGE_DEL_MAIN(G). REMOVE_CULPRIT(G) returns the edge with highest transitivity improvement (argmax_{ uv∈E }{Δuv(G)}) and removes it from G; TRANSITIVE_CLOSURE_COST(G) returns the total cost of all edge insertions required to convert G into a biclique, assuming G is connected; EDGE_DEL_MAIN(G) is the main function of the edge deletion heuristic.
The first invocation of REMOVE_CULPRIT(G) can be finished in O(E·V^{2}) time, since computing each Δ_{ uv }(G) can be finished in O(V^{2}) time, for only those P4s that contain uv are considered. The subsequent routine calls require O(V^{2}) time to update the scores of the edges that were influenced by the deletion of uv, and finally O(E) time to find the maximum scored edge. This results in a total running time of O(V^{2} + E). TRANSITIVE_CLOSURE_COST(G) sums up the cost for a transitive closure, accepting a running time of O(V^{2}).
EDGE_DEL_MAIN(G) returns a solution object, containing the edge modifications and the costs needed for converting the input graph into a transitive one. We keep the assumption that G is connected. The pseudocode of the EDGE_DEL_MAIN(G) is in the appendix.
Our edge deletion heuristics removes a specific edge at most once across all recursions. Checking for connected components requires O(E + V) time and REMOVE_CULPRIT(G) requires O(V^{2} + E) time. TRANSITIVE_CLOSURE_COST(G) takes O(V^{2}) time for each disjoint component. Therefore, the total running time of our algorithm is O(E(E + V^{2}) + V^{3}).
Results
Artificial graphs
The artificial graphs were generated as follows: Initially we generate graphs consisting of n vertices; afterwards m vertices are picked up (m ∈ [1, n]) and defined to be in one biclique. Then the same procedure is carried out in the remaining n−m vertices until there is no vertex left. This random graph generating process gives us a graph consisting of random numbers of clusters of random sizes. The edge weights are obtained from two different Gaussian distributions N (μ_{ intra }, {\sigma}_{intra}^{2}) and N (μ_{ inter }, {\sigma}_{inter}^{2}). The former distribution is to generate weights for edges within the predesigned biclusters and the latter for "inter bicluster edges", i.e. the edges between vertices connecting different biclusters. We chose μ and σ carefully such that the generated graphs are "almost transitive" bipartite graphs. In our case, μ_{ intra } = 21, μ_{ inter } = −21, σ_{ intra } = σ_{ inter } = 18. Thus the probability of finding a "mistake edge" (an edge between vertices in different biclusters or a missing edge between vertices in the same bicluster) is about 0.123 for each pair.
The performance of the two algorithms on artificial graphs are shown in Table 1. All running time measurements were averaged over 5 repeats of the samesized graphs but with different edge structures. The fixedparameter algorithm is able to achieve very small running times on smallsized and mediumsized graphs, yet as the sizes of graphs grow, the performance of fixedparameter algorithm suffer, indicating the NPhardness of the underlying problem. When the number of vertices exceeds 40 vertices, the fixedparameter algorithm cannot finish within reasonable time. On the other hand, the edge deletion heuristic algorithm requires significantly less time than the fixedparameter algorithm on bigger graphs. In terms of costs, the performance of edge deletion heuristics is almost as good as that of the fixedparameter algorithm. In summary, the heuristic finds solutions that are almost equally good but is significantly faster. In Figure 3 we plot the running times of the fixedparameter algorithm and the edge deletion heuristic against the graph component complexity (we define graph complexity as V·E). Obviously, fixedparameter is faster for small components, but running times explode with growing input graphs sizes. The edge deletion heuristics performs better in terms of running time for mediumsized components without much worse accuracy, i.e. much higher modification costs.
We also generated 20 random graphs for different probabilities of "interedges" and "intramissingedges" (see above) ranging from 0.1 to 0.4 (5 repeats for each probability). The average graph size was V = 40. The solutions have been computed within one hour for all graphs. The box plots in Figure 4 show the variation of running time and costs for these graphs. With higher "mistake probabilities", the costs are expected to be higher as well. For run times, however, we do not see an increasing trend. Hence, our methods seem to be generally robust for graphs of different structures and complexities. The results are shown in Figure 4.
Genome wide association studies
To demonstrate the applicability of BiCluE to real biomedical data, we applied our software to a very intuitive bipartite graph data set: GWAS data. It was retrieved from two sources: (1) an online available database [13], containing 56,412 significant SNP associations for 87 different diseases/traits, and (2) the National Human Genome Research Institute (NHGRI) Catalog of Published Genome Wide Association Studies, an online catalog of SNPtraits from published GWASs, with 5,476 unique SNPs and 526 different diseases [14]. We defined edge weights as s(uv) = −log(P), where P is the pvalue of the given associations. We used a threshold of 0.05, corresponding to log(0.05) = 1.301 in our graphbased model.
The data from two different databases were not merged due to incompatibility of terminology. The resulting graphs contain 415 connected components in total (136 from Johnson's dataset and 279 from the NHGRI dataset).
Both fixedparameter algorithm and edge deletion heuristics were applied separately on each disjoint connected component. 413 of 415 components were solved within 24 hours (99.5% of all the components). The remaining two graph components have been too big to be solved within 24 hours (one from NHGRI dataset with V = 3, 609 and one from Johnson's dataset with V = 50, 161). Both BiCluE algorithms identified 86 putative associations, which were not detected as significant in the two GWAS datasets. Table 2 shows the distribution of the new associations and their corresponding diseases/traits. We found 11 new targets to be tested and evaluated for "Conduct disorder (case status)" and "Isochemic Stroke", followed by "Atrial fibrillation/atrial flutter" and "Permanent tooth development" (10 new candidate associations). For the details of the putative associations, please refer to Additional File 1. Note that our predictions depend on the usergiven similarity threshold, as those of any other clustering tool. A full analysis of all the putative association is beyond the scope of this study. Nevertheless, we examined the previous reports and literature for further supports. The SNP rs2548145 and rs3930234, which were discovered in our study as putatively associated to "Alcoholism (alcohol dependence factor score)" have been reported as associated to "Alcoholism (alcohol use disorder factor score)" [15]. Moreover, the SNP rs13376333, which was studied and found associated with "Atrial fibrillation" [16], was found to be putatively associated to "Atrial Flutter" and "Ischemic stroke". Another interesting result is related to the trait of "tooth development": SNP rs9674544 and rs1956529, reported as related to "primary tooth development" [17], were found to be associated with "permanent tooth development" in our study. Although the analyses above could not replace experiment verifications as the "ultimate" validation, yet it demonstrates that BiCluE tends to cluster related traits together in one group, which further implicates the correctness of the putative associations. However, final wet lab examination is still necessary and indispensable, though beyond the scope of this paper. Here, our focus is the new BiCluE algorithms that solve the weighted bipartite graph cluster editing problem. As any other data partitioning method it needs further statistical testing and parameter adjustment, which is highly applicationspecific and needs to be done with a certain intuition regarding the nature of the realworld data sets.
Discussion and conclusion
Here, we presented BiCluE, a software package dedicated to solve (weighted) bicluster editing problems. It offers a fixedparameter algorithm and an edge deletion heuristic. We showed that BiCluE is able to solve mediumsized bicluster editing problems within reasonable times. The running times of fixedparameter algorithm explode when the input size exceeds a certain value (40 vertices) while the edge deletion heuristic still works fine for graphs of larger sizes.
We demonstrated BiCluE's ability to cluster biomedical data with publicly available GWAS data sets. All but two instances (99.5%) have been solved. We found 86 putative new associations. These newly discovered associations might be useful as guidelines for further wet lab studies. Since deleting/inserting edges (associations between a phenotype and a SNP) does not directly affect the association of other SNPs to that phenotype, we implicitly imply a certain degree of independence between SNPs, which might not be true. However, when a set of SNPs is highly connected to a set of phenotypes, it is likely that we may neglect interSNP dependencies, since we concentrate on inserted edges (Table 2) emerging from the "grouptogroup" relationship in the bipartite graph.
Moreover, there are plenty of other potential applications of BiCluE. In the future, we will apply BiCluE to identify genetic variants that are responsible for certain bacterial life styles, a task that will require simultaneous clustering of both, genes and species. We will investigate more such applications in the future.
Further investigation will focus on the improvement of the performance of the fixed parameter algorithm. The counterpart of bicluster editing on general graphs, cluster editing, has been extensively studied. Thus it might be interesting to compare the two problems, making use of the ideas and techniques for cluster editing problems on bicluster editing in order to achieve better running times.
Implementation
BiCluE is implemented in JAVA 1.6 with support for parallel multicore computing. All measurements for the evaluations were taken on Compute Clusters with 78 computing nodes consisting of 2 × Intel XEON E5430 2.66 Ghz (Quadcore) CPUs and 16 GB RAM.
References
Benson DA, Boguski MS, Lipman DJ, Ostell J: GenBank. Nucleic Acids Res. 2011, 39: D32D37. 10.1093/nar/gkq1079.
Sun P, Guo J, Baumbach J: Integrated simultaneous analysis of different biomedical data types with exact weighted bicluster editing. J Integr Bioinform. 2012, 17:
Wittkop T, Emig D, Lange S, Rahmann S, Ablrecht M, Morris JH, Böcker S, Stoye J, Baumbach J: Partitioning Biological data with transitivity clustering. Nat Methods. 2010, 7 (6): 419420. 10.1038/nmeth0610419.
Aluru S: Handbook of Computational Molecular Biology. 2004, Boca Raton: Chapman & Hall/CRC Computer & Information Science Series
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J: Comprehensive cluster analysis with Transitivity Clustering. Nat Protocols. 2011, 6 (3): 285295. 10.1038/nprot.2010.197.
Preliç A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22 (9): 11221129. 10.1093/bioinformatics/btl060.
Guo J, Hüffner F, Komusiewicz C, Zhang Y: Improved algorithms for bicluster editing. In TAMC'08: Proceedings of the 5th international conference on Theory and applications of models of computation. 2008, Berlin, Heidelberg: Springer Verlag
Amit N: The bicluster graph editing problem. PhD thesis. 2004, Tel Aviv University School of Mathematical Sciences
Guo J: A more effective linear kernelization for cluster editing. heor Comp Sc. 2009, 410: 718726. 10.1016/j.tcs.2008.10.021.
Protti F, da Silva MD, Szwarcfiter JL: Applying modular decomposition to paramterized bicluster editing. IWPEC 2006 LNCS. Edited by: Bodlaender HL, Langston MA. 2006, Heidelberg: Springer
Downey RG, Fellows MR: Parameterized Complexity. 1999, Springer
Niedermeier R: Invitation to FixedParameter Algorithm. 2006, Oxford University Press
Johnson AD, O'Donnell CJ: An open access database of genomewide association results. BMC Med Genet. 2009, 10: 6
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome wide association loci for human diseases and traits. PNAS. 2009, 10: 1073
Heath AC, Whitfield JB, Martin NG, Pergadia ML, Goate AM, Lind PA, McEvoy BP, Schrage AJ, Grant JD, Chou YL, Zhu R, Henders AK, Medland SE, Gordon SD, Nelson EC, Agrawal A, Nyholt DR, Bucholz KK, Madden PA, Montgomery GW: A quantitativetrait genomewide association study of alcoholism risk in the community: findings and implications. Biol Psychiatry. 2011, 70 (6): 513518. 10.1016/j.biopsych.2011.02.028.
Ellinor PT, Lunetta KL, Glazer NL, Pfeufer A, Alonso A, Chung MK, Sinner MF, de Bakker PI, Mueller M, Lubitz SA, Fox E, Darbar D, Smith NL, Smith JD, Schnabel RB, Soliman EZ, Rice KM, Van Wagoner DR, Beckmann BM, van Noord C, Wang K, Ehret GB, Rotter JI, Hazen SL, Steinbeck G, Smith AV, Launer LJ, Harris TB, Makino S, Nelis M, Milan DJ, Perz S, Esko T, Kottgen A, Moebus S, NewtonCheh C, Li M, Mohlenkamp S, Wang TJ, Kao WH, Vasan RS, Nothen MM, MacRae CA, Stricker BH, Hofman A, Uitterlinden AG, Levy D, Boerwinkle E, Metspalu A, Topol EJ, Chakravarti A, Gudnason V, Psaty BM, Roden DM, Meitinger T, Wichmann HE, Witteman JC, Barnard J, Arking DE, Benjamin EJ, Heckbert SR, Kaab S: Common variants in KCNN3 are associated with lone atrial fibrillation. Nat Genet. 2010, 42 (3): 240244. 10.1038/ng.537.
Pillas D, Hoggart CJ, Evans DM, O'Reilly PF, Sipila K, Lahdesmaki R, Millwood IY, Kaakinen M, Netuveli G, Blane D, Charoen P, Sovio U, Pouta A, Freimer N, Hartikainen AL, Laitinen J, Vaara S, Glaser B, Crawford P, Timpson NJ, Ring SM, Deng G, Zhang W, McCarthy MI, Deloukas P, Peltonen L, Elliott P, Coin LJ, Smith GD, Jarvelin MR: Genomewide association study reveals multiple loci associated with primary tooth development during infancy. PLoS Genet. 2010, 6 (2): e100085610.1371/journal.pgen.1000856.
Acknowledgements
All authors wish to thank the Cluster of Excellence for Multimodal Computing and Interaction of the German Research Foundation for financial support. Cluster of Excellence for Multimodal Computing and Interaction of the German Research Foundation.
This article has been published as part of BMC Proceedings Volume 7 Supplement 7, 2013: Proceedings of the Great Lakes Bioinformatics Conference 2013. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/7/S7.
Funding
Funded by DFG Cluster of Excellence, MMCI.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
PS designed and implemented the algorithm. JG and JB supervised the whole work. All authors contributed equally to the manuscript.
Electronic supplementary material
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Sun, P., Guo, J. & Baumbach, J. BiCluE  Exact and heuristic algorithms for weighted bicluster editing of biomedical data. BMC Proc 7 (Suppl 7), S9 (2013). https://doi.org/10.1186/175365617S7S9
Published:
DOI: https://doi.org/10.1186/175365617S7S9
Keywords
 Bipartite Graph
 Exact Algorithm
 Input Graph
 GWAS Data
 Edge Deletion