SnipViz: a compact and lightweight web site widget for display and dissemination of multiple versions of gene and protein sequences
- 985 Downloads
As high throughput sequencing continues to grow more commonplace, the need to disseminate the resulting data via web applications continues to grow. Particularly, there is a need to disseminate multiple versions of related gene and protein sequences simultaneously—whether they represent alleles present in a single species, variations of the same gene among different strains, or homologs among separate species. Often this is accomplished by displaying all versions of the sequence at once in a manner that is not intuitive or space-efficient and does not facilitate human understanding of the data. Web-based applications needing to disseminate multiple versions of sequences would benefit from a drop-in module designed to effectively disseminate these data.
SnipViz is a drop-in client-side module for web sites designed to efficiently visualize and disseminate gene and protein sequences. SnipViz is open source and is freely available at https://github.com/yeastrc/snipviz.
KeywordsDynamic Interface Multiple Version Sequence Representation Cascade Style Sheets Interactive Graphical User Interface
Cascading Style Sheets
Graphical User Interface
HyperText Transfer Protocol
HyperText Markup Language
Extensible Markup Language.
However, there is often a need to simultaneously disseminate multiple versions of related sequences. Examples include displaying sequences for a given protein from many strains of a particular species of yeast, displaying multiple alleles from the same gene across a population, and displaying the results of a multiple sequence alignment of a homologous protein across many species. While the above format works well for displaying a single sequence, it is not well suited for simultaneously displaying multiple sequences in a way that is easily read by humans.
Strategies for simultaneously displaying multiple sequences on the web include displaying FASTA-formatted data ,where each version of the sequence is sequentially displayed in its entirety as plain text; or more commonly, displaying ClustalW-formatted data , where the first N positions of each aligned sequence is listed sequentially in some order, then the next N positions, then the next, until all versions of the entire aligned sequences have be displayed. On web pages, it is common to color code ClustalW-format data based on sequence variability at specific positions or based on some property of nucleotides or amino acids. However, this method rapidly becomes cumbersome, space-inefficient, and incomprehensible to human readers as more versions of the sequence are included and as the sequence gets longer.
Many programs exist for generating or viewing multiple sequence alignment information [3, 4, 5, 6, 7, 8]. Miew  is of particular note as very mature and feature-rich program designed specifically for the display of multiple sequence information. Mview supports several input formats, is highly configurable with regard to output, and supports saving the output as HTML that may be used to display the data in a web page. However, this HTML is a static display of the data that is subject to the same limitations described above. Additionally, Mview requires running an external program, either in advance or at runtime, which adds to the complexity. The JalviewLite applet  is a feature-rich Java applet that may be used by websites to disseminate multiple sequences using a dynamic interface. In this interface, the sequences may be presented using a sliding window, which eliminates the space and legibility problems created by statically displaying the entire sequence. However, JalviewLite requires that the end user have Java and Java web browser plugins installed. As a general-purpose dissemination platform, this is less than ideal as not every user running a web browser has Java installed and configured and Java applets are not compatible with many portable devices. Ideally, the data would be presented using a dynamic interface that runs entirely within the web browser and requires no external plugins.
The only required data for SnipViz are either DNA or protein sequence data in FASTA format. SnipViz must be configured with the location of the FASTA data (either a static file or the output of a dynamic program) that contains all of the sequences and their associated labels. If the sequences require alignment, the sequences must be aligned prior to being loaded.
SnipViz may optionally display a dendrogram indicating the hierarchical clustering of the sequences based on any property the implementer wishes (e.g., phylogeny of originating organisms or similarity of sequences being displayed). To display the dendrogram, the user must indicate the location of Newick-formatted data  (either static file or output of a dynamic program) that contains the desired clustering for all of the labels in the FASTA sequence file.
This HTML will result in an instance of SnipViz appearing wherever this HTML is placed on the web page, displaying the indicated data—in this case, hierarchically clustered DNA sequences from the indicated Newick and FASTA files.
Graphical user interface
Beneath the whole sequence representation is the display of the sequences, themselves. To the left are the labels supplied for the sequences from the FASTA file and, optionally, the dendrogram representing the hierarchical clustering of the sequences from a Newick file. To the right the section of the sequences corresponding to the currently-viewed window are displayed. Beneath the sequences, users may click the arrows to page left or right through the sequences.
Very long sequences
Indicators of variation
As previously mentioned, the whole-sequence representation above the sequences contains red lines that serve as indicators of locations of sequence variation among all of the sequences or among the currently-highlighted sequences. Both the shade of red and the height of the line contain information.
The height of the line indicates the number of positions represented by that line that contain variation. Because the width of the graphical whole-sequence representation may contain fewer pixels than the number of positions in the sequence, each line may represent more than one position in the sequence. A taller line indicates more relative variation at the represented position in the sequence than a shorter line.
The shade of red is meant to indicate the significance of the variation at a given position, with darker red indicating more significant variation. In the case of DNA sequences, there are two shades of red: pale and dark. Pale red indicates that all variation in that position result in the same amino acid being encoded (silent mutations). Dark red indicates that there is at least one substitution at that position among all the sequences that results in a different encoded amino acid. In the case of protein sequences, the shade of red is determined by a calculation using the BLOSUM 80  amino acid substitution matrix. For all positions with sequence variation, a score is calculated by comparing all amino acids at that position against all other amino acids at that position and summing the BLOSUM 80 substitution values. The resulting values are used to linearly scale the intensity of red in the indicator line between a pale and dark shade of red.
The main sequence viewing area also contains indicators of location of sequence variability. These appear as a shade of blue that highlights specific columns in the sequence and appears as an indicator block above the column. The shade of blue is determined using the same logic as the shade of red in the indicator lines described above.
To view simple demonstrations of implementations of SnipViz, see http://www.yeastrc.org/snipviz/. There are demonstrations of multiple configurations (DNA, protein, very long sequences, and clustering), and each demonstration includes the HTML required to implement it.
To see an example of how SnipViz may be integrated into a web application, see http://www.yeastrc.org/g2p/, a web application developed to support a study examining the phenotypic consequences of sequence variation among 22 phylogenetically diverse strains of the budding yeast S. cerevisiae. For this implementation, SnipViz is used to visualization locations of sequence variation in the same gene or protein across all of the sequences strains of yeast. For a specific example, see http://www.yeastrc.org/g2p/phenomeviewProtein.do?orfName=YCR088W&listing=ABP1+%2f+YCR088W.
Central among the future plans for SnipViz are (1) removing the requirement that the loaded sequences be either DNA or protein sequences and (2) making the highlighting system more modular so that custom logic may be easily used to override the default highlighting system. Not only should users be able to display RNA or, indeed, any conceivable type of sequence, they should be able to easily implement some highlighting that indicates their own determination of significance for variability at specific positions.
We welcome other developers to download the code, make improvements, and contribute to the project at https://github.com/yeastrc/snipviz.
Snipviz is a client-side web module for efficiently displaying multiple versions of DNA or protein sequences. It has a highly dynamic, interactive graphical user interface written using standard World Wide Web technologies and may be simply installed without any programming being necessary. It is cross-platform and compatible with all current web browsers. SnipViz is open source and freely available at https://github.com/yeastrc/snipviz.
Availability and requirements
Project name: SnipViz
Project home page: https://github.com/yeastrc/snipviz
Operating system(s): Platform independent
Other requirements: None
License: Apache 2.0
Any restrictions to use by non-academics: None
The authors would like to particularly thank Eric Muller and Maitreya Dunham for thoughtful feedback and suggestions. This work is supported by grants P41 GM103533 from the National Institute of General Medical Studies from the National Institutes of Health and the University of Washington Proteomics Resource (UWPR95794).
- 2.Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMedCentralCrossRefGoogle Scholar
- 8.Di Tommaso P, Moretti S, Xenarios I, Orobitg M, Montanyola A, Chang JM, Taly JF, Notredame C, Web Server issue: T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res. 2011, 39: W13-W17. 10.1093/nar/gkr245.PubMedPubMedCentralCrossRefGoogle Scholar
- 9.The Newick tree format.http://evolution.genetics.washington.edu/phylip/newicktree.html,
- 11.Skelly DA, Merrihew GE, Riffle M, Connelly CF, Kerr EO, Johansson M, Jaschob D, Graczyk B, Shulman NJ, Wakefield J, Cooper SJ, Fields S, Noble WS, Muller EG, Davis TN, Dunham MJ, Maccoss MJ, Akey JM:Integrative phenomics reveals insight into the structure of phenotypic diversity in budding yeast. Genome Res. 2013, 23 (9): 1496-1504. 10.1101/gr.155762.113.PubMedPubMedCentralCrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.