We demonstrate our tool on a TIM protein-family application. These proteins play an important role in efficient energy production and can be found in nearly every organism, including animals, fungi, plants, and bacteria. In this section we report on our experience using the tool for this application. We follow with a formal evaluation by two structural biologists, expanded from our IEEE BioVis Data Contest Visualization Award-winning submission. We last report the feedback from the contest organizers.
TIM protein-family exploration
The application examines the scTIM protein (saccharomyces cerevisiae triosephosphate isomerase), a member of the TIM family that was mutated towards the family consensus: a number of amino acids in the sequence were replaced by the most common residue found at that location in the TIM family. The resulting amino acid sequence is dTIM. Unfortunately, dTIM is functionally defective--one or more of the modifications made to scTIM caused the protein to lose its metabolic transport properties. Identifying which modifications caused the loss of functionality is an interesting open research problem.
For this application, we obtained the scTIM PDB, the TIM family sequence data and alignment information from the Battelle Center for Mathematical Medicine, through http://www.biovis.net. We used the tool to fetch 28 additional PDB files from RCSB, and to further generate more than 620 PDB files from the provided sequence data. We used the database backend to link PDB and FASTA IDs for preprocessing, and added data from ModBase and Uniprot.
Using our tool, we start by identifying the differences between the dTIM and scTIM sequences. There are 49 different subsequences of residues, encompassing 104 residues modified, created, or deleted in the creation of dTIM. By selecting some or all of these residues in the protein sequence viewer, we can highlight their locations on both 3D structures (Figure 4). We can pan, zoom, and rotate the structures to more closely examine the distribution of these alterations on the protein structure. We can also adjust the rendering properties of the structure.
To determine which models from the TIM family are most similar to the original scTIM, we use the trend-image view in the lower panel. In Figure 4 we can quickly see, for example, that only a few sequences have the same fragment in position 142 with scTIM. A step further, selecting any of the sorting modes from the menu allows comparisons to be made to scTIM. For example, when sorting by common residues, we find that TPIS HAEDU, the TIM protein homolog found in bacterial species Haemophilus ducreyi, shares the greatest number of residues with scTIM. Selecting a particular coloring method displays specific information for each residue.
Manipulating the vertical selection paddle allows us to explore subsequences of the full TIM sequence. Distribution information about residues in the highlighted subsequences are displayed below the trend image and show the most common amino acid in the TIM family at each sequence index. The bars in the residue viewer that are nearly empty imply that very few members of the TIM family share the same residue as scTIM, making it an ideal candidate for mutation towards the family consensus.
Manipulating the horizontal selection paddle allows us to further explore the individual TIMs in the family, with a fish-eye lens expanding the selected row to more clearly show the residue sequence and coloring. Right-clicking on a selected row allows us to load the structure of that specific TIM into the structure view. If this TIM is unfamiliar to the user, a number of reference databases can be accessed.
In terms of limitations, while the trend image provides a scalable approach to viewing large amounts of sequence data, finding a particular sequence in a protein family remains a challenge. Similarly, attempting to code too much information into the color schemes results in an overload of colors, rendering the trend image unreadable and ineffective. A reduction in the number of colors restores readability to the view, at the cost of removing some information from the trend image.
Structural biologist feedback
Two senior structural biology researchers (co-authors DK and TT) have provided feedback and testing throughout the software development process. They are also providing the following example workflow through our system.
In this evaluation session, the researchers sought to explore the mutations in the BioVis Data Contest dataset. Given their structural biology background, the researchers began their analysis by loading and interacting with the 3D structures of the dTIM and scTIM proteins. Their interaction focused on searching for the residues that make up the active site and the protein-protein interface. In their estimation, these two sites were likely candidates for the location of functional mutations.
The researchers selected next the key residues in the 3D Viewer. This action highlights those residues in both the Trend Image Panel and the Protein Sequence Viewer, as shown in Figure 5. Using the Protein Sequence Viewer, the researchers identified which of these residues had changed in the conversion from scTIM to dTIM. For each different residue, the researchers returned to the 3D Viewer to inspect the structure of each of the residues and examine their interactions. At this stage the researchers did not use the remaining genomic data, as they were unsure of how to best use this information for the purpose of the contest. Under these circumstances, they identified and proposed two relevant mutations as the most likely candidates: Y101 (to E100) and D81 (to P80).
However, by using the trend image panel, the researchers were able to identify several further matching sequences in the trend image. Although during the evaluation session the trend alignment and residue numbering within a sequence were slightly off due to insertions and deletions in the sequence (later accounted for and corrected in the software), the most senior researcher was able to identify the same set of candidate mutations as captured in the case study above.
In the researchers' assessment, it "would be possible to come up with some reasonable hypotheses without using [this] tool, but it would definitely take more time." In the default workflow, the researchers believe they would start by building a homology model for dTIM using Modeller and then align this model with the known structure for scTIM within PyMOL, followed by proposing a list of mutations that could be relevant, for example those localized to the active site. However, this approach would not leverage the information of the protein sequence family. To access this type of information without our tool, one could create a multiple sequence alignment using any of a number of online servers, then load the result in an alignment viewer such as JalView [12], and then go back and forth between the structures in PyMOL and the alignment viewer, in order to refine the previous list of mutations. However, in the researchers' opinion, this alternative approach would be tedious. As such, they particularly appreciated our tool's integration of the capabilities of existing spatial and sequence viewers, along with other useful functionality built within our software, and the speedup to such workflows provided by our approach.
BioVis contest organizer feedback
Feedback from the BioVis 2013 conference organizers further confirmed the ability of our tool to successfully identify the dysfunctional protein mutations. The experts hypothesized that the most harmful mutations to the protein existed in the active site--the area of the protein which is responsible for its function. Since dTIM was created by combining the mutations of its 640 family proteins, the Trend Image Panel was first observed by the experts in order to gauge the difference of scTIM to the rest of its family. When sorted by the weighted edit distance between scTIM and its protein family members, the trend image exposed five distinct residue locations where dTIM varied from scTIM, but was consistent with the rest of its family. These locations are in our assessment (A)58(G), (K)107(D), (E)138(L), (L)146(V), and (A)22(R) and (L)218(V). From these mutations, the 107, 138 and 146 residues are almost fully conserved throughout the entire family, but differ in scTIM to dTIM. While residue 138 looks promising, since it is very frequent across the entire family, closer inspection shows that a mutation did not occur between dTIM and scTIM. In contrast, mutation 58 is also highly conserved throughout the TIM family, but is also a mutation from scTIM to dTIM. Finally, the last two mutations are both symmetrical in position, at an offset of 22 from either end of the sequence; these residues are also highly conserved in the TIM family, but not between scTIM and dTIM, which indicates they are not responsible for the loss of functionality.
Further examining the 3D structure of the four remaining residues (excluding the two symmetric ones) from the candidate list above, we notice that they lie in or just outside of the active site. Again, this is where most chemical reactions occur, since the active site is the binding site of molecules. This observation brings us full circle to the earlier structural biologist feedback: the structural biology experts initially suspected the most damaging mutations would lie in the active site, and that restoring these mutations could restore functionality.
In terms of tool features, the trend image and its sorting capabilities--based on our proposed metrics of similarity--were greatly appreciated by the IEEE BioVis domain experts evaluating the tool. A closer examination through the use of the 3D Model Viewer provided evidence that the residues identified above were part of the active site of the protein. This finding matched the initial hypothesis of where the most critical mutations existed, and demonstrates the benefits of combining spatial and non-spatial information in a single tool.
Without the aid of our tool, each sequence would have to be collected and aligned to determine the areas of residue variance. Once found, individual models of dTIM and scTIM have to be examined to locate the area where the mutations were occurring on the 3D structure. By using our tool, the experts were able to quickly identify the residue mutations and link them to the 3D model with a single action. Each expert biologist that tested our tool noted the ease in interaction between the different data representations--sequence alignment, model interaction, etc. These features, coupled via the linked data views, proved very efficient for the task of identifying protein mutations. With the insights gained, the locations of where the sequence needs to be repaired were quickly identified.