Integrative visual analysis of protein sequence mutations
- 2.7k Downloads
An important aspect of studying the relationship between protein sequence, structure and function is the molecular characterization of the effect of protein mutations. To understand the functional impact of amino acid changes, the multiple biological properties of protein residues have to be considered together.
Here, we present a novel visual approach for analyzing residue mutations. It combines different biological visualizations and integrates them with molecular data derived from external resources. To show various aspects of the biological information on different scales, our approach includes one-dimensional sequence views, three-dimensional protein structure views and two-dimensional views of residue interaction networks as well as aggregated views. The views are linked tightly and synchronized to reduce the cognitive load of the user when switching between them. In particular, the protein mutations are mapped onto the views together with further functional and structural information. We also assess the impact of individual amino acid changes by the detailed analysis and visualization of the involved residue interactions. We demonstrate the effectiveness of our approach and the developed software on the data provided for the BioVis 2013 data contest.
Our visual approach and software greatly facilitate the integrative and interactive analysis of protein mutations based on complementary visualizations. The different data views offered to the user are enriched with information about molecular properties of amino acid residues and further biological knowledge.
KeywordsSecondary Structure Element Conservation Score Solvent Accessible Surface Area Triosephosphate Isomerase Residue Interaction
Understanding and predicting the effect of amino acid mutations on the structure and function of a protein is still a challenging problem despite recent advances [1, 2]. In the case of multiple sequence changes, it is even more difficult to distinguish the mutations with a significant effect from the ones without. Many approaches that tackle this problem have been presented in the last couple of years as reviewed in [3, 4, 5, 6, 7, 8]. Computational methods such as the well-known SIFT tool  use evolutionary conservation derived from a multiple sequence alignment to predict that mutations of highly conserved residues have a considerable impact on function. Other methods such as the well-established PolyPhen2 tool  combine sequence features with structural and physico-chemical protein properties to assess the effect of a mutation. A notable disadvantage of most tools is that that they do not provide the user with a fine-grained control over the set of features used for the prediction, and the results are often difficult to interpret. In addition, those tools cannot easily cope with the speed at which new information on sequences, structures, and functions is made publicly available.
Thus, the BioVis contest selected this area of research for the 2013 data challenge. The organizers posed the question how protein function depends on the underlying protein sequence and whether it is possible to predict the effect of sequence changes. They also encouraged the use of visualization and data integration as the key to solving the problem. In particular, given the sequence of a functionally defective triosephosphate isomerase mutant (dTIM) and its parent, the yeast triosephosphate isomerase (scTIM), the task was to identify the mutations that abolish its function.
For our entry to the BioVis 2013 data contest challenge, we focused on improving the integrative visualization of a wide variety of available information on sequences, structures and functions. Our objective was to provide the biological data for a manual visual analysis and interactive exploration by the user in an integrated fashion by making it accessible through a small number of carefully designed, linked views. In this way, the user is able to generate hypotheses based on a specific view (e.g. of the protein structure) in the context of the other linked views and the provided data. As there are many biological aspects of protein sequence mutations that might affect protein structure and function, we developed visualizations that provide different levels of detail and enriched them by mapping additional data onto the graphical representations. We aimed at a generic solution that is suitable for a wide range of proteins and will support a comprehensive analysis of the impact of mutations for a large class of sequence changes. This was accomplished by a visual analytics approach integrating several software tools into a prototypic implementation freely available at the RINalyzer webpage .
As detailed below, we applied our approach to the data provided for the BioVis 2013 data contest. For this proof-of-concept study, we assessed the sequence changes between scTIM and dTIM by different visualizations of the protein structure together with further functional and structural information and by an exploratory analysis based on the complementary network views for both sequences.
General concept and views
First, we use the standard representations of the three-dimensional (3D) structure and sequence of proteins as provided by UCSF Chimera [12, 13] because sequence changes and their impact on the structure might give valuable insight. UCSF Chimera offers a variety of tools that support the interactive crosstalk between sequences and structures, affording advanced exploration of multiple sequence alignments, comparison of structures and incorporation of user-specific data. In particular, the user can study the amino acid changes between two sequences and their locations on the corresponding protein structures. It is also possible to construct a structure-based sequence alignment from the superposition of two structures. This deep integration of sequences and structures is further complemented by a multitude of molecular graphics features.
Second, we apply the RINerator tool  to create a two-dimensional (2D) residue interaction network (RIN) from the protein structure and visualized the RIN with the help of RINalyzer  within the Cytoscape platform . Such a network representation is very useful to demonstrate the impact of mutations at the detailed residue interaction level by highlighting the changes of local interactions as well as long-range interaction paths, e.g. indirect interactions between residues.
Third, we offer less complex, aggregated overviews that focus on functional or structural subunits like secondary structure elements and illustrate the location and distribution of the mutations on the protein structure. In particular, we utilize the cartoon view as provided by the Pro-origami web service . The main advantage of this view is that it gives a clear depiction of the chain and the secondary structure elements, while it leaves out the exact spatial location and the interrelations between those elements, which are provided by the other more detailed views. As the visual mapping from a RIN to the corresponding cartoon might be difficult for the user, a network representation that shows the RIN together with aggregated secondary structure elements can be created as an intermediate visualization.
Fourth, we extract additional structural and functional information from external databases and map these data as visual cues onto the visualizations. Functional residue annotations such as protein domain localization as well as binding and catalytic sites are important for identifying mutations that could have a direct impact on the function of the protein because they are in or near such sites. Structural properties of residues such as hydrophobicity, solvent accessible surface area, and polarity are used to characterize their potential effect on protein structure and function. Last but not least, evolutionary conservation information is crucial for distinguishing between residue changes in conserved (less tolerable of sequence changes) or variable regions.
RIN view and layout
The new layout method is distance-based, i.e., allows specifying distances between the residues. During the layout computation, it minimizes the weighted mean square error between the given distances for pairs of residues and the geometric distance in the layout with an emphasis on local accuracy. The layout is initialized using a projection of the 3D coordinates on a 2D plane based on the UCSF Chimera view perspective. To allow for a flexible representation of the residue network and, at the same time, to preserve the user's spatial orientation using the fixed projection coordinates, we compute the stress as a balanced combination of both and increase the priority for the latter over the course of the optimization. In order to emphasize the secondary structure, the distance error weights are larger for distances between residues within the same secondary structure element. Alternatively, the layout method can prioritize certain distances based on user-defined edge weights that represent additional structural or functional information.
The aggregated views are intended to give the user a quick overview on the mutation locations with respect to specific known structural or functional regions. While it would be possible to map additional information directly onto the network representation, the RIN might become quite complex for the user. Thus, we utilize views that aggregate regions based on secondary structures, protein domain information, or functional annotations. These views serve as an intermediate visualization when switching between the 3D structure view and the 2D RIN view.
The simple cartoon view provided by the Pro-origami web service reduces the complex 3D protein structure to the essential secondary and super-secondary structure information and presents it with an easily readable layout (Figure 1). Pro-origami provides SVG images, which are enriched with further information in the form of highlighted regions of interest such as the localization of mutated residues. As Pro-origami can decompose proteins into domains, we can also obtain a combined representation of secondary structure and protein domains within the cartoon view.
Furthermore, to distinguish more or less likely mutations, we integrated the amino acid substitution scores from the Blosum62 matrix  in RINalyzer and assigned a score to each mutated residue in the comparison network. Each score can be used to highlight sequence changes with a stronger impact on the protein.
Therefore, in addition to the data given in the contest, we generated or retrieved data from multiple external sources to enrich our visualizations. The following information is regarded as potentially useful for protein analysis:
Family conservation. ConSurf-DB  provides pre-computed profiles of evolutionary sequence conservation.
Residue interactions. The RINerator package creates a network of noncovalent residue interactions such as contacts and hydrogen bonds for any 3D protein structure.
Functional sites. Active and binding site information is retrieved manually from UniProtKB .
Domain annotation. Protein domain information is obtained from the SCOP  online resource.
Structural properties. Data for the solvent accessible surface area, secondary structure, hydrophobicity, and other structural properties is retrieved automatically from UCSF Chimera.
The data used to enrich our visualizations is mapped as visual cues like color, shape, or line stroke in the network view and transferred to the other views where possible. Furthermore, the differences caused by the mutations can be highlighted by such cues in all visualizations.
We decided to control most visual properties via user-adjustable options with reasonable defaults. For example, different node shapes are used to distinguish the mutated residues in both the parent and the defective protein (Figure 3). Additionally, several visual styles are offered that map different functional and structural information on the views so that the user sees the distribution of corresponding values for the whole protein. Dark colors usually correspond to significant values such as strong hydrophobicity, large solvent accessible surface area or high number of changed residue interactions (Figure 4). For evolutionary conservation, the pink-to-turquoise coloring as applied by ConSurf-DB is used (Figure 5).
The visual cues are particularly useful for illustrating the changes in residue interactions due to the mutations in the comparison network view generated from the alignment of the respective sequences in UCSF Chimera. Residue interactions that are either lost or gained upon mutation are highlighted by differently colored and shaped lines (Figure 4). Residues that cannot be aligned are depicted by nodes with different node borders.
Linkage and coordination of views
To ease the user's cognitive load when switching between different views and tools, we link them in multiple important ways. For an interactive exploration, we implemented a global selection concept, that is, the selection of elements in one view leads to the immediate selection of their corresponding representatives in all other views. Our linkage concept also ensures the consistent use of information mapping and similar cues over all views, particularly, regarding the usage of colors.
Further coordination is achieved due to the synchronized orientation and location of the graphical representations in the different views. For instance, the user can freely explore the 3D structure within the UCSF Chimera window, e.g. by rotating the protein structure. The network view can then be adjusted according to the new orientation of the rotated structure by applying the 3D-structure based RIN layout described above.
In order to implement the full linkage between Cytoscape and UCSF Chimera, we made use of their new software versions. We also ported the plugins RINalyzer and structureViz to work with Cytoscape 3, which also allowed us to link them closely. For example, while the direct communication between Cytoscape and UCSF Chimera is handled by structureViz, the structure-based layout algorithm is implemented in RINalyzer and invokes structureViz to retrieve the current spatial coordinates.
Results and discussion
Visual analytics approach
Our visual analytics approach assists the user's reasoning about the biological impact of mutations by interactive visualizations of sequence and structure information enriched with additional biological knowledge such as evolutionary sequence conservation and functional annotations. To show the different aspects of the data, we combine the well-known 3D structure view and the one-dimensional sequence view with the 2D RIN view. In addition, we create simplified network representations to enable the user to focus on certain biological aspects, e.g. protein domains, secondary structure elements, and functional annotations.
Besides the sequence that is given as input, a variety of information is available that can be used to interpret the functional effects of sequence changes. This includes sequence conservation, which might point to highly conserved regions responsible for some function, protein domain information, functional annotations (e.g. on molecular binding), structural properties such as hydrophobicity and solvent accessible surface area, and already known mutations and their impact. We incorporate a number of sources for such information in our approach as described above and map the data mainly as visual cues on top of the graphical representations of the protein structure and the RINs. In addition, we make use of the network representation provided by RINalyzer as well as the Cytoscape analysis capabilities to facilitate data exploration by filtering and combining the available information on individual residues.
Furthermore, to present sequence changes on the structure and residue interaction level simultaneously, we provide both a single cumulative view and two separate views of the parent and the defective mutant side-by-side. While a single view facilitates the identification of changed sites, the dual view solution allows the user to identify the structural impact of the changes, for example, lost residue interactions might alter the protein structure.
Contest use case
In the following, the effectiveness of our integrative visual analytics approach is illustrated with the help of a typical use case based on the data provided for the BioVis 2013 data contest. For the specific case in which a functionally defective dTIM sequence is given together with its yeast scTIM parent sequence and structure, we perform a comprehensive assessment of the structural and functional impact of the sequence mutations and highlight the differences between the sequences in complementary views.
For scTIM, we retrieved the 3D structure from the RCSB Protein Data Bank  [PDB:2YPI] and downloaded the precomputed RIN from the RINdata web service . Since there is no experimentally resolved protein structure of dTIM, we used the SCWRL Server  at BIC-JCSG with default settings and the parent structure as template to generate a three-dimensional model. A RIN for the defective mutant was created from the modeled structure by our RINerator package.
External data such as functional annotations, conservation information and structural properties was parsed and imported as attributes in Cytoscape to allow for mapping the data as visual cues on the network and structure views. The UCSF Chimera sequence tool was used to view, align and explore the parent and defective TIM sequences. Based on the sequence alignment, the nodes representing mutated residues were depicted as diamonds instead of circles (Figure 3). Especially mutations of residues buried in the structure or close to the functional sites might have a relatively strong impact on protein stability and function. Different node coloring schemes were prepared to map the different types of structural and functional information. This allowed us to identify relevant mutations with functional effects.
In the default secondary structure-colored view, we observed that most mutations are located on the surface of the protein, i.e., in helices (51 out of 100) and loops (45 out of 100), rather than in the interior consisting of strands (only 4) (Figure 3). The conservation-colored view indicated that residues in the protein exterior tend to be more variable in contrast to the ones in the interior where the active site of the enzyme is located (Figure 5). Thus, we could conclude from the visualizations that most mutations are located in more variable regions on the surface of the protein. Thus, mutated residues with strong conservation (F11, L13, Q82, I83, I109, K134, K135, L174, A175, D180, A212, N213, V226) might be responsible for the functional deficit of the mutant (Figure 5).
By combining the different views and data in an interactive fashion, it became possible to pinpoint a number of residue mutations as candidates for having a pronounced effect on the enzymatic activity of dTIM. Further experimental validation will be needed to determine which mutations have to be replaced in the mutant by amino acids from the parent to rescue functionality. Other structural properties such as hydrophobicity, solvent accessible surface area or polarity can also be mapped onto the RIN view to characterize mutations with particular properties. Another strategy described in our previous work  would be the application of network topology analysis of the RIN for the detection of important residues.
We presented a novel approach for the integrative visual analysis of protein sequence mutations. We extended several existing software tools and combined different visualizations in such a way that biological information can be exchanged between them and additional external data can be included. We also devised a new layout algorithm for the RINs provided by the RINalyzer app in Cytoscape. Additionally, we created a new aggregation network view, improved and enriched the existing comparison network view, incorporated an interface to the Pro-origami web service, and fully utilized the interface to the UCSF Chimera tool through the structureViz app.
In the future, to assess the usefulness and effectiveness of our approach and to improve the current implementation, we intend to collect more user feedback. This will result in a comprehensive evaluation which visual cues are best suited for gaining insight into the impact of mutations, how they should be best mapped onto the sequence, structure, and network representations, and how they should be integrated into the visual layout. Another issue is the aggregation of network regions to reduce the visual complexity as only some of them might be of actual interest to assess the potential impact of mutations. In this way, patterns of mutations with specific functional consequences might become more apparent, in particular, when multiple proteins are analyzed.
We also plan to improve the software integration of the different tools such that our approach can be realized in an automated fashion. This includes better synchronization over linked views and automated retrieval of external data.
We gratefully acknowledge the dataset provided by Thomas Magliery and Brandon J. Sullivan at The Ohio State University for the purpose of the BioVis 2013 contest. NTD was partially funded by a Boehringer Ingelheim Fonds travel grant, and her research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction. KK was financially supported by Australian Research Council Linkage grant H2814 A4421, Tom Sawyer Software and NewtonGreen Technologies, JHM by NIGMS P41-GM103311, MW by the Australian Research Council Discovery Project grant DP110101390, and MA by the projects GANI MED and BioTechMed-Graz.
Publication costs were covered by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, the School of Information Technologies at The University of Sydney (Tom Sawyer ARC Grant), and the Max Planck Society.
This article has been published as part of BMC Proceedings Volume 8 Supplement 2, 2014: Proceedings of the 3rd Annual Symposium on Biological Data Visualization: Data Analysis and Redesign Contests. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S2
- 7.Gnad F, Baucom A, Mukhyala K, Manning G, Zhang Z: Assessment of computational methods for predicting the effects of missense mutations in human cancers. BMC Genomics. 2013, 14 (Suppl 3): 7-Google Scholar
- 11.RINalyzer Webpage. [http://www.rinalyzer.de]
- 23.Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell DS, Prlic' A, Quesada M, Quinn GB, Ramos AG, Westbrook JD, Young J, Zardecki C, Berman HM, Bourne PE: The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Research. 2013, 41 (D1): 475-482.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.