An important aspect of studying the relationship between protein sequence, structure and function is the molecular characterization of the effect of protein mutations. To understand the functional impact of amino acid changes, the multiple biological properties of protein residues have to be considered together.
Here, we present a novel visual approach for analyzing residue mutations. It combines different biological visualizations and integrates them with molecular data derived from external resources. To show various aspects of the biological information on different scales, our approach includes one-dimensional sequence views, three-dimensional protein structure views and two-dimensional views of residue interaction networks as well as aggregated views. The views are linked tightly and synchronized to reduce the cognitive load of the user when switching between them. In particular, the protein mutations are mapped onto the views together with further functional and structural information. We also assess the impact of individual amino acid changes by the detailed analysis and visualization of the involved residue interactions. We demonstrate the effectiveness of our approach and the developed software on the data provided for the BioVis 2013 data contest.
Our visual approach and software greatly facilitate the integrative and interactive analysis of protein mutations based on complementary visualizations. The different data views offered to the user are enriched with information about molecular properties of amino acid residues and further biological knowledge.
Understanding and predicting the effect of amino acid mutations on the structure and function of a protein is still a challenging problem despite recent advances [1, 2]. In the case of multiple sequence changes, it is even more difficult to distinguish the mutations with a significant effect from the ones without. Many approaches that tackle this problem have been presented in the last couple of years as reviewed in [3–8]. Computational methods such as the well-known SIFT tool  use evolutionary conservation derived from a multiple sequence alignment to predict that mutations of highly conserved residues have a considerable impact on function. Other methods such as the well-established PolyPhen2 tool  combine sequence features with structural and physico-chemical protein properties to assess the effect of a mutation. A notable disadvantage of most tools is that that they do not provide the user with a fine-grained control over the set of features used for the prediction, and the results are often difficult to interpret. In addition, those tools cannot easily cope with the speed at which new information on sequences, structures, and functions is made publicly available.
Thus, the BioVis contest selected this area of research for the 2013 data challenge. The organizers posed the question how protein function depends on the underlying protein sequence and whether it is possible to predict the effect of sequence changes. They also encouraged the use of visualization and data integration as the key to solving the problem. In particular, given the sequence of a functionally defective triosephosphate isomerase mutant (dTIM) and its parent, the yeast triosephosphate isomerase (scTIM), the task was to identify the mutations that abolish its function.
For our entry to the BioVis 2013 data contest challenge, we focused on improving the integrative visualization of a wide variety of available information on sequences, structures and functions. Our objective was to provide the biological data for a manual visual analysis and interactive exploration by the user in an integrated fashion by making it accessible through a small number of carefully designed, linked views. In this way, the user is able to generate hypotheses based on a specific view (e.g. of the protein structure) in the context of the other linked views and the provided data. As there are many biological aspects of protein sequence mutations that might affect protein structure and function, we developed visualizations that provide different levels of detail and enriched them by mapping additional data onto the graphical representations. We aimed at a generic solution that is suitable for a wide range of proteins and will support a comprehensive analysis of the impact of mutations for a large class of sequence changes. This was accomplished by a visual analytics approach integrating several software tools into a prototypic implementation freely available at the RINalyzer webpage .
As detailed below, we applied our approach to the data provided for the BioVis 2013 data contest. For this proof-of-concept study, we assessed the sequence changes between scTIM and dTIM by different visualizations of the protein structure together with further functional and structural information and by an exploratory analysis based on the complementary network views for both sequences.
General concept and views
To offer the available information to the user on different levels of abstraction and to support interactive synchronized exploration (Figure 1), we have carefully selected suitable visualizations as described in the following:
First, we use the standard representations of the three-dimensional (3D) structure and sequence of proteins as provided by UCSF Chimera [12, 13] because sequence changes and their impact on the structure might give valuable insight. UCSF Chimera offers a variety of tools that support the interactive crosstalk between sequences and structures, affording advanced exploration of multiple sequence alignments, comparison of structures and incorporation of user-specific data. In particular, the user can study the amino acid changes between two sequences and their locations on the corresponding protein structures. It is also possible to construct a structure-based sequence alignment from the superposition of two structures. This deep integration of sequences and structures is further complemented by a multitude of molecular graphics features.
Second, we apply the RINerator tool  to create a two-dimensional (2D) residue interaction network (RIN) from the protein structure and visualized the RIN with the help of RINalyzer  within the Cytoscape platform . Such a network representation is very useful to demonstrate the impact of mutations at the detailed residue interaction level by highlighting the changes of local interactions as well as long-range interaction paths, e.g. indirect interactions between residues.
Third, we offer less complex, aggregated overviews that focus on functional or structural subunits like secondary structure elements and illustrate the location and distribution of the mutations on the protein structure. In particular, we utilize the cartoon view as provided by the Pro-origami web service . The main advantage of this view is that it gives a clear depiction of the chain and the secondary structure elements, while it leaves out the exact spatial location and the interrelations between those elements, which are provided by the other more detailed views. As the visual mapping from a RIN to the corresponding cartoon might be difficult for the user, a network representation that shows the RIN together with aggregated secondary structure elements can be created as an intermediate visualization.
Fourth, we extract additional structural and functional information from external databases and map these data as visual cues onto the visualizations. Functional residue annotations such as protein domain localization as well as binding and catalytic sites are important for identifying mutations that could have a direct impact on the function of the protein because they are in or near such sites. Structural properties of residues such as hydrophobicity, solvent accessible surface area, and polarity are used to characterize their potential effect on protein structure and function. Last but not least, evolutionary conservation information is crucial for distinguishing between residue changes in conserved (less tolerable of sequence changes) or variable regions.
Finally, the linkage between the different views is maintained by several mechanisms. Regarding the interactive exploration, we propagate the selection of elements in one view to the others. We synchronize orientation and location between RINs and structures using a special layout algorithm that we developed for this purpose. In particular, we want to ensure a consistent use of information mapping and similar cues over all views. All of the above is accomplished by adapting and extending our plugins RINalyzer  and structureViz  to integrate the freely available software tools Cytoscape, UCSF Chimera, and Pro-origami into a prototypic implementation (Figure 2). Download links and further documentation can be found at the RINalyzer webpage .
RIN view and layout
The residue interaction networks (RINs) are generated by RINerator from a 3D protein structure as described previously and shown as standard network visualization within Cytoscape using RINalyzer [14, 18]. In this visualization, network nodes represent amino acid residues and edges depict non-covalent residue interactions. To transfer the spatial localization information of the mutations from the structure view to the network view, we replaced the previous force-directed layout algorithm by a more appropriate stress minimization variant (Figure 1 and 3).
The new layout method is distance-based, i.e., allows specifying distances between the residues. During the layout computation, it minimizes the weighted mean square error between the given distances for pairs of residues and the geometric distance in the layout with an emphasis on local accuracy. The layout is initialized using a projection of the 3D coordinates on a 2D plane based on the UCSF Chimera view perspective. To allow for a flexible representation of the residue network and, at the same time, to preserve the user's spatial orientation using the fixed projection coordinates, we compute the stress as a balanced combination of both and increase the priority for the latter over the course of the optimization. In order to emphasize the secondary structure, the distance error weights are larger for distances between residues within the same secondary structure element. Alternatively, the layout method can prioritize certain distances based on user-defined edge weights that represent additional structural or functional information.
The aggregated views are intended to give the user a quick overview on the mutation locations with respect to specific known structural or functional regions. While it would be possible to map additional information directly onto the network representation, the RIN might become quite complex for the user. Thus, we utilize views that aggregate regions based on secondary structures, protein domain information, or functional annotations. These views serve as an intermediate visualization when switching between the 3D structure view and the 2D RIN view.
The simple cartoon view provided by the Pro-origami web service reduces the complex 3D protein structure to the essential secondary and super-secondary structure information and presents it with an easily readable layout (Figure 1). Pro-origami provides SVG images, which are enriched with further information in the form of highlighted regions of interest such as the localization of mutated residues. As Pro-origami can decompose proteins into domains, we can also obtain a combined representation of secondary structure and protein domains within the cartoon view.
The representation of protein structures as RINs enables network comparison and alignment to explore the differences between parent and mutant structures further. Besides the comparison of two networks or structures side-by-side, we provide a comparison network view based on the alignment of the underlying sequences (Figure 4). In this view, each node represents a pair of aligned residues and two nodes are connected if the corresponding residues have a non-covalent interaction in either of the two compared RINs. We use visual cues to highlight interactions that were gained or lost upon amino acid change, and we score the fraction of such interactions for each residue to quantify the mutational effect on protein structure and function.
Furthermore, to distinguish more or less likely mutations, we integrated the amino acid substitution scores from the Blosum62 matrix  in RINalyzer and assigned a score to each mutated residue in the comparison network. Each score can be used to highlight sequence changes with a stronger impact on the protein.
An important component of our visual analytics approach is the mapping of available knowledge onto the visualized sequences and structures. The availability of this information in an easily accessible way while the user works with the different views should facilitate the biological knowledge discovery considerably. This is accomplished by importing the relevant data as node attributes in Cytoscape, which automatically associates them with the RIN and the protein structure. An additional benefit of this integration is that it enables the use of the built-in Cytoscape functionality to create filters based on the imported data and to highlight the residue nodes with attribute values within a given range, e.g. with high or low conservation scores (see Figure 5).
Therefore, in addition to the data given in the contest, we generated or retrieved data from multiple external sources to enrich our visualizations. The following information is regarded as potentially useful for protein analysis:
Family conservation. ConSurf-DB  provides pre-computed profiles of evolutionary sequence conservation.
Residue interactions. The RINerator package creates a network of noncovalent residue interactions such as contacts and hydrogen bonds for any 3D protein structure.
Functional sites. Active and binding site information is retrieved manually from UniProtKB .
Domain annotation. Protein domain information is obtained from the SCOP  online resource.
Structural properties. Data for the solvent accessible surface area, secondary structure, hydrophobicity, and other structural properties is retrieved automatically from UCSF Chimera.
The data used to enrich our visualizations is mapped as visual cues like color, shape, or line stroke in the network view and transferred to the other views where possible. Furthermore, the differences caused by the mutations can be highlighted by such cues in all visualizations.
We decided to control most visual properties via user-adjustable options with reasonable defaults. For example, different node shapes are used to distinguish the mutated residues in both the parent and the defective protein (Figure 3). Additionally, several visual styles are offered that map different functional and structural information on the views so that the user sees the distribution of corresponding values for the whole protein. Dark colors usually correspond to significant values such as strong hydrophobicity, large solvent accessible surface area or high number of changed residue interactions (Figure 4). For evolutionary conservation, the pink-to-turquoise coloring as applied by ConSurf-DB is used (Figure 5).
The visual cues are particularly useful for illustrating the changes in residue interactions due to the mutations in the comparison network view generated from the alignment of the respective sequences in UCSF Chimera. Residue interactions that are either lost or gained upon mutation are highlighted by differently colored and shaped lines (Figure 4). Residues that cannot be aligned are depicted by nodes with different node borders.
Linkage and coordination of views
To ease the user's cognitive load when switching between different views and tools, we link them in multiple important ways. For an interactive exploration, we implemented a global selection concept, that is, the selection of elements in one view leads to the immediate selection of their corresponding representatives in all other views. Our linkage concept also ensures the consistent use of information mapping and similar cues over all views, particularly, regarding the usage of colors.
Further coordination is achieved due to the synchronized orientation and location of the graphical representations in the different views. For instance, the user can freely explore the 3D structure within the UCSF Chimera window, e.g. by rotating the protein structure. The network view can then be adjusted according to the new orientation of the rotated structure by applying the 3D-structure based RIN layout described above.
In order to implement the full linkage between Cytoscape and UCSF Chimera, we made use of their new software versions. We also ported the plugins RINalyzer and structureViz to work with Cytoscape 3, which also allowed us to link them closely. For example, while the direct communication between Cytoscape and UCSF Chimera is handled by structureViz, the structure-based layout algorithm is implemented in RINalyzer and invokes structureViz to retrieve the current spatial coordinates.
Results and discussion
Visual analytics approach
Our visual analytics approach assists the user's reasoning about the biological impact of mutations by interactive visualizations of sequence and structure information enriched with additional biological knowledge such as evolutionary sequence conservation and functional annotations. To show the different aspects of the data, we combine the well-known 3D structure view and the one-dimensional sequence view with the 2D RIN view. In addition, we create simplified network representations to enable the user to focus on certain biological aspects, e.g. protein domains, secondary structure elements, and functional annotations.
Besides the sequence that is given as input, a variety of information is available that can be used to interpret the functional effects of sequence changes. This includes sequence conservation, which might point to highly conserved regions responsible for some function, protein domain information, functional annotations (e.g. on molecular binding), structural properties such as hydrophobicity and solvent accessible surface area, and already known mutations and their impact. We incorporate a number of sources for such information in our approach as described above and map the data mainly as visual cues on top of the graphical representations of the protein structure and the RINs. In addition, we make use of the network representation provided by RINalyzer as well as the Cytoscape analysis capabilities to facilitate data exploration by filtering and combining the available information on individual residues.
Furthermore, to present sequence changes on the structure and residue interaction level simultaneously, we provide both a single cumulative view and two separate views of the parent and the defective mutant side-by-side. While a single view facilitates the identification of changed sites, the dual view solution allows the user to identify the structural impact of the changes, for example, lost residue interactions might alter the protein structure.
A general analysis workflow is presented in Figure 6. Normally, the user starts with one or more experimentally determined protein structures and retrieves or generates RINs for them. In case only sequences are available, external tools for predicting the 3D structure could be used instead. External data such as evolutionary conservation and functional annotations need to be prepared in a format compatible with Cytoscape and the RIN specifications. Then the data is loaded by the user into Cytoscape and UCSF Chimera. Further views such as the secondary structure cartoon, the aggregated secondary structure network or the comparison network can be created from within Cytoscape. The sequences of the structures can be displayed and manipulated from within UCSF Chimera. Functional annotations and evolutionary conservation have to be imported manually into Cytoscape as node attributes of the RINs, while structural properties can be retrieved automatically from the protein structures currently opened in UCSF Chimera. These data can then be applied to create the visual cues and semi-automatically propagate them to the different views. Finally, by browsing and filtering the data in Cytoscape and UCSF Chimera, the user can identify relevant amino acids, in particular, mutated residues with a potentially strong effect on the protein function. Even if the visual analysis does not immediately reveal the functional consequences of mutations, our software will provide the user at least with very useful biological indications for the molecular analysis and further experiments.
Contest use case
In the following, the effectiveness of our integrative visual analytics approach is illustrated with the help of a typical use case based on the data provided for the BioVis 2013 data contest. For the specific case in which a functionally defective dTIM sequence is given together with its yeast scTIM parent sequence and structure, we perform a comprehensive assessment of the structural and functional impact of the sequence mutations and highlight the differences between the sequences in complementary views.
For scTIM, we retrieved the 3D structure from the RCSB Protein Data Bank  [PDB:2YPI] and downloaded the precomputed RIN from the RINdata web service . Since there is no experimentally resolved protein structure of dTIM, we used the SCWRL Server  at BIC-JCSG with default settings and the parent structure as template to generate a three-dimensional model. A RIN for the defective mutant was created from the modeled structure by our RINerator package.
External data such as functional annotations, conservation information and structural properties was parsed and imported as attributes in Cytoscape to allow for mapping the data as visual cues on the network and structure views. The UCSF Chimera sequence tool was used to view, align and explore the parent and defective TIM sequences. Based on the sequence alignment, the nodes representing mutated residues were depicted as diamonds instead of circles (Figure 3). Especially mutations of residues buried in the structure or close to the functional sites might have a relatively strong impact on protein stability and function. Different node coloring schemes were prepared to map the different types of structural and functional information. This allowed us to identify relevant mutations with functional effects.
In the default secondary structure-colored view, we observed that most mutations are located on the surface of the protein, i.e., in helices (51 out of 100) and loops (45 out of 100), rather than in the interior consisting of strands (only 4) (Figure 3). The conservation-colored view indicated that residues in the protein exterior tend to be more variable in contrast to the ones in the interior where the active site of the enzyme is located (Figure 5). Thus, we could conclude from the visualizations that most mutations are located in more variable regions on the surface of the protein. Thus, mutated residues with strong conservation (F11, L13, Q82, I83, I109, K134, K135, L174, A175, D180, A212, N213, V226) might be responsible for the functional deficit of the mutant (Figure 5).
Since scTIM functions as a dimer, another important aspect is the binding interface between the two monomers. We used RINalyzer to extract the residue interactions of the interface and visualized them in a separate network view. As can be seen in Figure 7, 9 out of the 69 residues are mutated (L13, S16, T45, S71, N78, Q82, V86, H103, F108). These changes might impair the dimer formation and thus affect the function of scTIM. Residue L13 is particularly interesting as it is both conserved and in the dimer interface. A similar analysis can be performed with other functional sites. For instance, we found that none of the residues in the active or substrate binding site (N10, K12, H95, E165) are mutated. However, 24 residues possess direct non-covalent interactions with functionally important residues and thus could have a severe impact on their function if mutated. This is the case for the residues F11, L13, and C41, and this observation is further strengthened by the fact that the first two of them are conserved as described above.
The comparison network view provided further information about the location and nature of the mutations (Figure 8). From the overall distribution of red and green edges that indicate changes of non-covalent interactions, it is apparent that many mutations lead to a large number of differences primarily on the protein surface. Additionally, the active site residues form different interactions with their neighbors in the parent compared with the mutant structure. Furthermore, there is an insertion (E156 in dTIM) and a deletion (A30 in scTIM) in the dTIM sequence in contrast to the parent sequence according to the sequence alignment in UCSF Chimera. However, they are not close to the active site or the dimer binding interface and thus the functional effect is difficult to judge. Finally, the residue nodes in Figure 8 are colored according to the fraction of interactions they gained or lost upon mutation.When combining this information with the conservation scores mapped to the node border colors, particularly interesting mutations can be found. Mutations with the largest change of local residue interactions are highlighted in Figure 8 (A30, S31, I32, E34, N35, L68, N78, K89, S100, V154 in scTIM and E156 in dTIM), and the mutated residue with a high conservation score (N78) is especially conspicuous.
By combining the different views and data in an interactive fashion, it became possible to pinpoint a number of residue mutations as candidates for having a pronounced effect on the enzymatic activity of dTIM. Further experimental validation will be needed to determine which mutations have to be replaced in the mutant by amino acids from the parent to rescue functionality. Other structural properties such as hydrophobicity, solvent accessible surface area or polarity can also be mapped onto the RIN view to characterize mutations with particular properties. Another strategy described in our previous work  would be the application of network topology analysis of the RIN for the detection of important residues.
We presented a novel approach for the integrative visual analysis of protein sequence mutations. We extended several existing software tools and combined different visualizations in such a way that biological information can be exchanged between them and additional external data can be included. We also devised a new layout algorithm for the RINs provided by the RINalyzer app in Cytoscape. Additionally, we created a new aggregation network view, improved and enriched the existing comparison network view, incorporated an interface to the Pro-origami web service, and fully utilized the interface to the UCSF Chimera tool through the structureViz app.
In the future, to assess the usefulness and effectiveness of our approach and to improve the current implementation, we intend to collect more user feedback. This will result in a comprehensive evaluation which visual cues are best suited for gaining insight into the impact of mutations, how they should be best mapped onto the sequence, structure, and network representations, and how they should be integrated into the visual layout. Another issue is the aggregation of network regions to reduce the visual complexity as only some of them might be of actual interest to assess the potential impact of mutations. In this way, patterns of mutations with specific functional consequences might become more apparent, in particular, when multiple proteins are analyzed.
We also plan to improve the software integration of the different tools such that our approach can be realized in an automated fashion. This includes better synchronization over linked views and automated retrieval of external data.
Hecht M, Bromberg Y, Rost B: News from the protein mutability landscape. Journal of Molecular Biology. 2013, 425 (21): 3937-3948.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Research. 2003, 13 (11): 2498-2504.
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995, 247 (4): 536-540.
Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell DS, Prlic' A, Quesada M, Quinn GB, Ramos AG, Westbrook JD, Young J, Zardecki C, Berman HM, Bourne PE: The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Research. 2013, 41 (D1): 475-482.
We gratefully acknowledge the dataset provided by Thomas Magliery and Brandon J. Sullivan at The Ohio State University for the purpose of the BioVis 2013 contest. NTD was partially funded by a Boehringer Ingelheim Fonds travel grant, and her research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction. KK was financially supported by Australian Research Council Linkage grant H2814 A4421, Tom Sawyer Software and NewtonGreen Technologies, JHM by NIGMS P41-GM103311, MW by the Australian Research Council Discovery Project grant DP110101390, and MA by the projects GANI MED and BioTechMed-Graz.
Publication costs were covered by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, the School of Information Technologies at The University of Sydney (Tom Sawyer ARC Grant), and the Max Planck Society.
This article has been published as part of BMC Proceedings Volume 8 Supplement 2, 2014: Proceedings of the 3rd Annual Symposium on Biological Data Visualization: Data Analysis and Redesign Contests. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S2
Authors and Affiliations
Max Planck Institute for Informatics, 66123, Saarbücken, Germany
Nadezhda T Doncheva
University of California, San Francisco, 94143-2240, San Francisco, USA
Nadezhda T Doncheva & John H Morris
The University of Sydney, 2006, Sydney, Australia
Monash University, 3145, Melbourne, Australia
EURAC research, 39100, Bolzano, Italy
Francisco S Domingues
University Medicine Greifswald, 17475, Greifswald, Germany
Graz University of Technology, 8010, Graz, Austria
The authors declare that they have no competing interests.
NTD and KK drafted the paper. NTD, KK, MA, and FSD were involved in the design of the project. NTD, KK, JHM, and MW carried out the implementation. All authors edited, read and approved the manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.