Data
Biological network data was retrieved from the Pathway Commons database [6] as bulk downloads of available triple relationships in SIF-RDF format. While the Pathway Commons database is designed as a repository for numerous public pathway databases, we choose to retrieve data from the Reactome [7] (Neuronal System, Signal Transduction, and Metabolism components), curated NCI PID [8] and HumanCyc [9] data sets.
GeneWeaver addresses cross-species data harmonization by assigning a unique identifier to clusters associated with biological components, such as genes or transcripts. An extensive, cross species repository of identifier clusters allows each node to be mapped onto a known GeneWeaver identifier, such as the UniProt URI's [10] available from Pathway Commons. Pathways of interest were decomposed from native RDF relationships to lists of associated nodes, representing connected biological components [Figure 1]. The process for transposing RDF to a pairwise graph structure required the application of external reference lists of pathways of interest to seed the graph search. For each pathway, the resulting condensed graphs only contained relationships between genes that were either (a) directly connected or (b) connected through a relationship that could be interpolated as a contiguous uninterrupted path between two genes. A unique edge id is assigned to these GeneWeaver translated node-node associations, creating a list of edges, each associated with a pathway network of interest. Lists of edges operate as components in the bipartite graph analysis toolset, as previously described [3, 11].
Analysis
Rapid bipartite analysis allows for the real time discovery of maximal biclques in very large bipartite graphs. After decomposing pathway networks into discretized bipartite graphs, we can efficiently identify hierarchical relationships of conserved edges among lists of edges [Figure 2]. This allows a user to identify common connected components in a variety of networks, spanning species, network types, and independently derived lists of nodes.
Visualization
Graphs are built from the edges in each maximal biclique. Because each set of edges and its consequent graph is most likely not completely connected, the largest connected component of each graph is calculated. To find the largest connected component of each graph, a simple breadth first search algorithm is used. This algorithm runs in linear time in terms of the number of vertices and edges since the graph is represented by an adjacency list. The algorithm finds one connected component at a time by doing a breadth first search on each node that has not been found in a previous search, until there are no more unknown nodes.
The graphs are then hierarchically arranged, placing the largest connected component of the maximal biclique at the top level of the graph and all connected components submaximal intersection on lower levels. In order to aid in visualization, each biclique is assigned a color and the nodes in this graph are colored according to the highest hierarchical graph to which they belong. This creates a key for the relevance of specific nodes in a network. Before finding the largest connected component of the largest graph, the edges of graphs within the two levels below the graph are added to make the root graph have interpolated edges.