Comparative analysis and visualization of multiple collinear genomes
- 2.5k Downloads
Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research.
We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations.
Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains.
KeywordsGenome Browser Haplotype Group Collaborative Cross Dynamic Visualization Haplotype Identity
Genome browsers are one of the most common bioinformatics tools used by biologists. Browsers allow biologists to visualize genomic features such as genes, SNPs, CG islands, transcription factor binding sites, and many others and to place these features in their genomic context. They are also useful in adding and viewing genome annotations and feature-specific information. Generally, genome browsers support analysis of a single genome, but there is often a need to compare features between one or more genomes. Existing tools are not well-suited to doing this. Many visualization methods have been developed to support comparative genomics of animals from different species. These include phylogenetic trees, alignment viewers, Circos diagrams , and dot-matrix methods . Tools which perform comparative analysis include BLAST (pairwise alignment analysis)  and VISTA . Generally these methods support only comparisons between a small number of genomes. There is a need for comparative analysis and visualization tools supporting members of the same species with largely collinear genomes. Our goal was to develop a system which supports simultaneous and dynamic analysis of many (10 s to 100 s) collinear genomes.
A web-based resource for investigating genomic data from multiple samples simultaneously would aid many common comparative genome analyses including disease association studies and expression analysis. Our system supports any generic genomic data set, allowing it to be an extensible framework for analysis, not simply a data resource. Like existing genome browsers and viewers, we represent different categories of genomic data as horizontal tracks covering a particular region of the genome. Unlike previous work, we use color to better indicate important regions and facilitate more intuitive comparison. In addition, we allow dynamic sorting and local reordering of tracks.
Comparison between genomes of different samples of the same species, particularly the analysis of local haplotype and phylogeny, can provide insight into gene origins and individual variations. They also aid in understanding population structure. Understanding local genomic variations and population structure is the key to studies of individual genes and their association with disease. We need to be able to not only determine similarities and differences between samples genome-wide, but also at the level of individual loci.
There are many genome browsers and viewers that can integrate multiple data sets pertaining to a particular genome sequence whether it is specific or a species consensus. Many of these are standalone desktop applications. There also exist several web-based genome browsers. These browsers, including the UCSC genome browser , GBrowse , Ensembl , NCBI Map Viewer , and JBrowse , display multiple tracks of data and support a variety of useful navigation techniques that allow the genome to be traversed and visualized at various resolutions. However, existing browsers are limited in their ability to support dynamic and comparative analysis between multiple genomes.
The UCSC Genome Browser  is the standard and most prevalent web-based genome browser. The UCSC browser originally targeted the human genome data as a part of the Human Genome Project. It has since been extended to numerous other species. The goal of the UCSC browser is to make a particular set of data broadly accessible and navigable. It does not focus on any particular analysis but is a comprehensive resource for integrating, displaying, and navigating publicly accessible genome data. The browser supports standard functions including navigation by panning and zooming. Data sets of interest can be displayed in tracks and reordered manually by the user. The UCSC browser functions as a window into very comprehensive sets of data for many different species, but does not support comparisons between either inter- or intraspecific genomes. The UCSC browser does not support dynamic interactions with the displayed data. Instead, pages must be reloaded in their entirety any time that new data is requested. Due to this limitation, data retrieval is necessarily limited to a small window or few data types to allow quick and easy analysis.
The Generic Genome Browser (GBrowse)  is another widely used web-based genome browser available for human, mouse and other model organisms. The main difference between GBrowse and the UCSC browser is extensibility. GBrowse is designed to be extended with new and user-provided data sets, and as such it provides a flexible framework for displaying and navigating arbitrary genome information. Otherwise, GBrowse uses the same basic navigation and display structure as the UCSC browser. Data sets can be individually selected and are displayed as horizontal tracks stacked on top of one another and aligned to a common genomic scale. Unlike the UCSC browser, GBrowse supports asynchronous retrieval and navigation of data, meaning the entire page does not need to be reloaded to update the genomic regions displayed. This reduces the computational overhead on both the server and client, refreshing only those parts that need to be changed. However, GBrowse is limited in its ability to display small-scale details at high resolutions. Since the representation and visualization of data is essentially fixed, fine details such as SNPs are often omitted when viewing large regions.
The Ensembl genome database project , a joint venture between the Sanger Institute and the European Bioinformatics Institute (EBI), was initiated with a goal of providing full genome data along with various annotation as a public resource for researchers. The Ensembl genome browser serves as a publicly available web-based browser for this data. Although initially focusing on the human genome, the browser now includes many model-organism genomes with annotations including genes, DNA and RNA alignments, and many other annotations. The browser function itself is very similar to the UCSC browser, supporting traditional navigation techniques. Ensembl also uses asynchronous data requests to retrieve data when it is needed. In addition, detailed annotations and links to more thorough information are displayed when a feature such as a gene or contig is selected.
The National Center for Bioinformatics Information (NCBI) provides the NCBI Map Viewer  as an online tool for browsing genomes. Unlike others, the NCBI Map Viewer displays the genome vertically with tracks for only the assembly, contigs, and genes while focusing on detailed description and annotation for these features linking to other useful NCBI tools for directly accessing related genes, SNPs, proteins, and more. Map Viewer also does not provide any dynamic navigation mechanism, therefore the entire page must be reloaded each time the genome window is adjusted. The browser serves best as a hub through which other resources are accessed by genomic position and is not a viable analysis tool by itself.
Existing genome browsers are well suited for generic genome annotation and are useful for analysis of the specific data sets they are tailored to, but there are many limitations. Available data is essentially static. In many cases, users have the ability to customize the browser to use different data or display only what they are interested in, but the underlying information representation remains constant. The visualization is essentially static, where the current region of interest is shipped to the viewer. Data can be viewed at multiple resolutions, but no further attempt is made to improve upon the usability of the visualization for a particular purpose. It is hard to quickly glean information and understanding from the visualization. These tools are frequently used to provide access to publicly available data sources rather than to support novel visualizations for analysis. Our browser addresses the following limitations of existing genome browsers: it supports simultaneous exploration of multiple aligned genomes, it allows for dynamic rearrangements of tracks to support comparisons, and it provides alternative visualization modes based on the current displayed scale.
Our genome browser is available as a public website allowing users to view, explore, and analyze multiple genomic data without requiring a standalone application http://msub.csbio.unc.edu. Data is stored on the web server and the client side consists of only the web browser. It has been tested and works on most modern web browsers and operating systems. Tested browsers include Chromium/Google Chrome 10.0, Firefox 3.6, Firefox 4.0, Internet Explorer 8, and Safari 5.0. It even loads on iPhones (iOS 4.2.1). Platform interoperability and constant availability make it an easy and useful tool for genetic analysis.
We provide easy access to the data underlying the visualization through the browser interface. For most data types, the displayed information can be retrieved as a delimited text file by clicking the output button below each track, which retrieves the underlying data for the currently selected sets of active genomes and within the displayed window so that no further filtering is required.
The basic data representation used by our browser is a set of possibly overlapping intervals specified by their genome coordinates (typically chromosome and position). Intervals are displayed as horizontal blocks that are displayed along the viewing window based on the bounding positions of the interval. If intervals are smaller than the display resolution, they are presented as histograms. Overlapping intervals can also be displayed on subsequent stacked tracks. This data representation supports a wide variety of genome annotations and allows the browser to be easily extended to novel data sets.
There are many critical design and resource allocation decisions which arise when handling very large sets of data. In traditional genome browsers, a relatively small amount of data needs to be handled at any one time. Existing browsers only need to handle a single sequence. In order to visualize multiple sequences simultaneously (10 s to 100 s) as in the case of our implementation, it is important to consider different methods for efficient data transfer and visualization. In addition to handling multiple sequences, we also support dynamic visualizations that vary based on the scale and local context. Existing browsers, such as the UCSC Genome Browser , do not support large-scale visualization of fine-scale features, like SNPs.
To support faster and more interactive visualization while dealing with remote data, we addressed issues of data transfer and efficiency and how to best allocate the rendering tasks. Our implementation loads data as it is needed into the page using asynchronous requests to the server. To reduce data transfer costs in memory and speed, the page is loaded only once at the beginning of a session and, subsequently, only data is loaded. In addition, visualization and display are handled in the browser by dynamic scripts on the page so that complete images do not have to be transferred from the server. Data rescaling, panning, and drawing are all handled by the client. Requests are made asynchronously so that the tool is available to the user even while new data is transferred.
We have deployed an instance of our visualization tool to aid analysis and interpretation of a recently published Nature Genetics paper . This browser analyzes a set of 100 classical laboratory and 62 wild-derived mouse strains along with 36 wild-caught mice. This study answers open questions regarding the subspecific origin of the laboratory mouse and provides the first detailed view of the haplotype diversity in most common laboratory mouse strains. We use our tool to visualize eight different data types to aid in comparative analysis of these 198 mouse samples.
Several data sets are included to aid in analysis by placing features in a genomic context. We include SNPs from the Mouse Diversity Array  used in genotyping the mouse strains. When viewing small sections of the genome, SNPs are displayed individually as vertical bars along the track. In addition, alleles at each SNP for each strain are displayed at fine-scale resolutions overlaying the subspecific origin and haplotype coloring tracks to allow for direct comparison (Figure 3). At coarse-scale resolutions, where SNPs are dense and thus cannot be displayed individually, SNPs are aggregated into a histogram representing the frequency of SNPs within uniformly sized windows. Known genes  are displayed in a similar manner. As with the SNPs, genes smaller than a pixel are displayed in a histogram and larger genes are displayed as horizontal bars. In the case where genes overlap, overlapping genes are displayed in additional stacked horizontal tracks.
Genome mosaic representations, such as subspecific origins and heterozygous regions, are useful for revealing the evolutionary history or identifying more recent introgressions between mouse strains. Existing laboratory and wild-derived strains are a mosaic of ancestral genomes that were selected for desired traits and subsequently inbred. Our genome browser provides the first tool for exploring this genomic diversity at both a high level and at a fine scale.
For classical laboratory strains, several additional data tracks are displayed to show local variation and haplotype structure. These data include a mosaic of possibly overlapping intervals or compatible haplotype blocks that show no evidence of recombination. Within these blocks, we dynamically compute identity-by-descent between the selected set of strains. We introduce an innovative visualization of haplotype identity among classical laboratory strains based on these compatible blocks. Lastly, local phylogeny trees can be displayed for each interval.
We also support a method for exploring the extent of shared haplotypes among the selected strains. Blocks of color are used to depict haplotype similarity. Colors are assigned and reused so that transitions are minimized. This provides a relative comparison of strain similarities (Figure 4). At any position along the genome, the haplotype identity among the displayed strains can be understood visually as dividing strains into haplotype groups according to their color such that strains with substantially identical haplotypes are the same color. Over larger regions, haplotype identity is represented by a shared color pattern.
Initial haplotype colors are precomputed for all classical laboratory strains, leading to frequent color/haplotype changes at a genomic scale. When viewing only a small sample of strains, this coloring can be simplified, essentially changing colors only when there are haplotype group changes among the selected strains. An interactive aspect of this visualization is that the colors can be dynamically reassigned according to the order of the selected strains such that colors are assigned in descending order (Figure 4). The topmost displayed strains is assigned a single color across the genome. The second strain is assigned the color of the previous strain where its haplotype matches the first and a second color where it differs. This process is repeated for subsequent strains. This has the effect of, for example, highlighting all regions where the first selected sample shares a haplotype with subsequent samples by using the same color. In this way, the haplotype coloring scheme can be substantially simplified for a small sample of strains allowing more intuitive analysis. A generic feature of the browser is that strain tracks can be dragged vertically to reorder their position within a track group, allowing the coloring order to be customized for the analysis required.
A final interactive tool facilitates similarity analysis at a particular position by allowing sorting of tracks within all groups at a user-selected position within the displayed genomic window. Strains are sorted vertically according to the haplotype coloring at the selected position such that strains with identical haplotypes are grouped together. In addition, strains are further sorted according to their haplotypes at increasingly distant positions radiating in both directions from the selected position until either the edge of the displayed window is reached or all strains are distinct.
An instance of our genome browser and its dynamic analysis methods has been deployed to display results of our recent publication  at http://msub.csbio.unc.edu. It is continually used in comparative genome analyses of the mouse genomes presented. In the past twelve months of our tool's availability, we have had over 4000 users make almost 50,000 queries. The tool is used by researchers to perform comparative analysis between 198 common mouse strains. Our tool is particularly well suited for selecting and partitioning strains while simultaneously considering phenotype variation as it relates to a given gene or genomic region. A recent focus of our browser has been to explore the predictive power of our local phylogeny and haplotype assignments. Local comparative genomic analysis has been shown to be particularly effective in predicting disease susceptibility and other phenotypic states of the available set of mouse strains given the known state of a small sample. Our notion of sequence similarity has also been used to inform genotype imputation by constructing a haplotype mosaic . Work is continuing to enhances the browser's support in this area.
There are many technical as well as structural improvements that can be made in the future to make our browser more useful, general, and effective for visualization and analysis of multiple genome data. Although the browser is constructed in a modular format, separating our data from the browser itself, to add new user-specified data types or novel visualizations requires changes to the source code and recompiling. In order to support a larger range of users and wider adoption, it is possible to add a simple web-based user interface for adding new tracks and visualizations within the existing framework. We present our genome browser's application to a specific data set here, but it is suitable to other organisms and other data types where comparative analysis of multiple genomes is useful. Likewise, we could provide an API for custom analysis of the existing data set. A more fundamental improvement we would like to make is to support local structural variations, such as insertions, deletions, repeats, and translocations, where the compared genomes are not strictly collinear as we assume. Even samples of the same subspecies can have small scale copy-number variations and are not strictly collinear. Assessing these differences is an important part of local haplotype and phylogeny analysis.
Funding provided in part by grants from the National Institutes of Health (NIH P50 GM 076468, P50 HG 006582, U54 AI 081680) and NSF project grant "Visualizing and Exploring High-dimensional Data" (NSF ISS 0534580).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S3.
- 7.Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M: The Ensembl genome database project. Nucleic Acids Res. 2002, 30 (1): 38-41. 10.1093/nar/30.1.38.PubMedCentralCrossRefPubMedGoogle Scholar
- 8.Dombrowski SM, Maglott D: Using the map viewer to explore genomes. The NCBI Handbook. 2002, National Center for Biotechnology InformationGoogle Scholar
- 10.Yang H, Wang J, Didion JP, Buus RJ, Bell TA, Welsh CE, Bonhomme F, Yu AH, Nachman MW, Pialek J, Tucker P, Boursot P, McMillan L, Churchill GA, Villena FPM: Subspecific origin and haplotype diversity in the laboratory mouse. Nat Genet. 2011, 43: 648-655. 10.1038/ng.847.PubMedCentralCrossRefPubMedGoogle Scholar
- 13.Didion JP, Yang H, Sheppard K, Fu C, McMillan L, Villena FPM, Churchill GA: Discovery of novel variants in genotyping arrays significantly improves genotype retention and corrects ascertainment bias. BMC Genomics.Google Scholar
- 14.Wang J, Moore KJ, Zhang Q, Villena FPM, Wang W, McMillan L: Genome-wide compatible SNP intervals and their properties. Proceedings of ACM International Conference on Bioinformatics and Computational Biology. 2010Google Scholar
- 15.Wang JR, Villena FPM, Lawson HA, Cheverud JM, Churchill GA, McMillan L: Imputation of SNPs in inbred mice using local phylogeny. Genetics. 2011,Google Scholar
- 16.Aylor DL, Valdar W, Foulds-Mathes W, Buus RJ, Verdugo RA, Baric RS, Ferris MT, Frelinger JA, Heise M, Frieman MB, Gralinski LE, Bell TA, Didion JD, Hua K, Nehrenberg DL, Powell CL, Steigerwalt J, Xie Y, Kelada SNP, Collins F, Yang IV, Schwartz DA, Branstetter LA, Chesler EJ, Miller DR, Spence J, Liu EY, McMillan L, Sarkar A, Wang J, Wang W, Zhang Q, Broman KW, Korstanje R, Durrant C, Mott R, Iraqi FA, Pomp D, Threadgill D, Villena FPM, Churchill GA: Genetic analysis of complex traits in the emerging Collaborative Cross. Genome Res. 2011, 21: 1213-1222. 10.1101/gr.111310.110.PubMedCentralCrossRefPubMedGoogle Scholar
- 17.Collaborative Cross Consortium: The genome architecture of the Collaborative Cross mouse genetic reference population. Genetics. 2011,Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.