Integrated functional visualization of eukaryotic genomes
- 3.6k Downloads
Increasing amounts of data from large scale whole genome analysis efforts demands convenient tools for manipulation, visualization and investigation. Whole genome plots offer an intuitive window to the analysis. We describe two applications that enable users to easily plot and explore whole genome data from their own or other researchers' experiments.
STRIPE and GFFtool (General Feature Format Tool) are softwares designed to support integration, visualization and exploration of whole genome data from eukaryotic genomes. STRIPE, in addition to providing a highly customizable and interactive data plot, provides access to numerous well-selected databases with updated information on all genes of a genome. GFFtool provides a user-friendly solution to integrating experimental data with the genomic information available in public databases. They also obviate the need for users to maintain large annotation resources, as they link to well-known resources using standard gene and protein identifiers.
The programs provide the user with broad genomic overviews of data distribution, fast access to data of interest, and the ability to navigate speedily from one resource to another, and gain a better understanding of result of whole genome analysis experiments.
KeywordsGene Length Stanford Microarray Database Genetic Association Database Cytogenetic Location General Feature Format
Protein Families Database
Kyoto Encyclopedia of Genes and Genomes
The continuously growing availability of genomic information exercises pressure on the systems used to capture it and on users concerned with its interpretation. Analysis of large scale genomic data is a demanding task, requiring extensive input from diverse sources of biological significance, statistical methodologies and data exchange standards. To answer interesting biological questions, biologists need accessible interfaces that enable convenient visualization of information, searching multiple databases and flexible maneuvering within the data. When confronted with the lists of significantly differentially expressed genes from the microarray experiments performed, it is important to get a feel for the genome-wide distribution of the data and to be able to quickly navigate between diverse sources of information. Visualization on a genomic scale is also helpful in identifying and representing clusters of genes that are co-regulated and map close to each other in the genome. There are several examples of regions in the genome where genes implicated in the same biological processes are clustered together on the genome, e.g the cytokine-receptor cluster on mouse chromosome 16 , and a group of cytokine related genes associated with IL-4 on mouse chromosome 11 . Chromatin remodeling events control transcription of closely mapped genes, and chromosomal clustering may point to regions where such events are actively induced.
Although a few tools that have the ability to provide such plotting capability exist currently, none of them are sufficiently user-friendly or provide ways of extracting additional biological information about genes of interest. Users have to depend on extensive bioinformatics capabilities in order to get to the point of plotting the data. However, even after the data is plotted only limited interaction with the data is possible. Caryoscope provides a genome-wide view of microarray data and some linking capabilities to the web. However, in spite of the provided guidelines, some prior experience of handling data from varied databases and with scripting languages such as Perl or Python is required. Similarly although it is possible to export appropriately formatted files from the Stanford Microarray Database, this is of little help to researchers who are interested in visualizing data from their own experiments, that may or may not be from microarrays. Caryoscope offers the advantage of being fully scriptable and easy to embed in a workflow, but has the disadvantage of being less useful as a platform for data exploration. Other applications, such as SeeGH , and CGHanalyzer , are designed for viewing dual channel array data derived from comparative genomic hybridization studies. ChromoViz  is implemented as an R package for visualizing genomic data. Although it is possible to plot several datasets for each chromosome at a time, obtain a karyotype plot for the chromosome in question, and explore data by zooming, the search capability is limited and web linking to publicly available databases is absent.
STRIPE and GFFtool have been programmed in Perl/Tk. We have tested both successfully on Windows, Linux and Solaris operating systems, but caution that some online resources may work better on the Windows platform. We recommend a minimum of 512 MB RAM to run the programs.
Results and discussion
We describe the features of both programs, and provide illustrative examples of data plotting and exploration.
It is important to be able to integrate experimental data with annotation sources to be able to ask relevant biological questions. We use the well defined GFF (General Feature Format, ) file format for representing data. The program GFFtool was designed to collate our microarray data together with the annotations and locations of the genes while STRIPE provides an interactive environment for exploring the data on the genomic landscape.
The GFF format is well suited to store both numerical and textual data for genes and their features. Choosing a format is not enough though, and one must be able to easily create, modify, query and check if the data is in the correct format as well. GFFtool provides a number of methods to deal with the GFF file. Importantly, firstly it allows users to create their own GFF file for the genome of their choice. Detailed instructions for creating a new GFF file for a genome are provided in the program manual. For the large number of researchers interested in human, mouse and rat genomes, pre-prepared GFF files have already been created and may be used for browsing the genomes even without integrating it with any other data. All GFF files distributed with the programs have been created using GFFtool itself, using data distributed by NCBI and EBI. These files contain ~27,000 genes for human, ~26,000 genes for mouse and ~22,000 genes for the rat genomes.
In addition, integrating more data to an existing GFF file is a simple, one-step procedure. As both programs use standard NCBI Gene Database identifiers, once the experimental data has been linked to these identifiers, the data can be immediately added to an existing GFF file to create a new GFF file containing the experimental data. The experimental data need only be in the commonly used tab-delimited form to be accepted by GFFtool. Moreover, GFFtool also provides methods to re-format a few popular data files distributed by NCBI and EBI (e.g. xrefs files from EBI, gene2pubmed files from NCBI).
GFFtool also allows one to query the GFF file and extract subsets of interest (e.g. query for all transcription factors in the human genome), and create a GFF file exclusively for this subset of genes. It is also possible to subset a whole genome GFF file based on a text file containing a gene list of standard NCBI Gene Database identifiers (e.g. differentially expressed genes in a microarray experiment). Obtaining NCBI Gene IDs for the genes is a pre-requisite for combining the user's data with the GFF files used. Such conversions may be routinely performed by several already available tools, e.g DAVID  and MatchMiner.
The data from a GFF file can also be exported to a tab-delimited or a comma-delimited format for use in other applications (e.g. like spreadsheets). Also, GFFtool includes a format checking feature which allows one to thoroughly check the file before plotting in STRIPE.
STRIPE allows easy searching, highlighting and importantly, brisk navigation through several diverse annotation sources for any gene. It uses the standard NCBI Gene Database identifiers as minimal units for plotting. The pre-prepared GFF files for the human, mouse and the rat genomes are ready to plot directly with STRIPE. These files contain several annotation fields, e.g. the NCBI Entrez Gene ID, Gene Name, Gene Length, Gene Symbol, Cytogenetic location, GenBank accession, and Swissprot ID for all genes that have been mapped to a definite chromosomal location. These fields are used to link the application to a variety of different publicly available databases. Additional annotation fields may also be added to a GFF file using GFFtool.
STRIPE offers up-to forty different methods for plotting and coloring the data. It is possible to plot raw, zero-centered, mean-centered, log-transformed, and mean log-transformed data. One can plot a histogram plot and then overlay on it several line plots. Each plot can be colored in three different ways: using any user-defined color, a defined COLOR column in the GFF file, or a separate tag file that allows different groups of genes (e.g. genes belonging to the same biological process category as defined by Gene Ontology, or genes in the same pathway) to be appropriately labeled and colored. Upregulated and downregulated genes may also be colored differently at any time. In the absence of any user data, one can create and navigate a plot based only on the locations of the genes. This plot can also be colored as described above. It is useful when the user is mainly interested in knowing the location of the genes and exploring the genome.
It is easy to reset the plot to its original state, and then select another area for zooming in. The lasso window provides detailed information on genes each time a search is performed or the plot is zoomed in. "Panning" on the canvas is done by pressing and dragging the left mouse button.
STRIPE provides access to a large number of different resources for each gene. Basic information, e.g. name, gene symbol, cytogenetic location etc, is provided right away. Deep links to several NCBI databases are provided, e.g. PubMed, GenBank, OMIM, Gene Expression Omnibus (GEO), HomoloGene. Three choices for genome browsers are provided, the NCBI Map Viewer, UCSC Genome Browser, and the Ensembl Genome Browser, so that users interested in a different display can easily switch to any of these and access additional information available there. All of the important databases for protein domains, PFAM, Interpro and SMART are available. Gene Ontology searches are available from the GO database. Transcripts with known alternative splice variants can be looked up the in Alternative Splicing Database. Pathway information can queried from KEGG and BioCarta, while literature connections can be explored using PubGene.
STRIPE and GFFtool are programs that make whole genome plotting easy for users, provide organized access to information resources, reducing time for manual navigation, and allow using a single plot as a gateway for exploration. Both STRIPE and GFFtool are still under development. Detailed documentation for both programs is available from the project home page.
Availability and requirements
Project Name: STRIPE and GFFtool
Project Home Page: http://www.uniklinikum-giessen.de/genome
Operating system(s): Windows, Linux, Solaris
Programming Language: Perl/Tk
Other requirements: none
License: Free for academic use
Any restrictions on use by non-academics: Contact corresponding author for a license
The work described herein was made possible by grants from the Deutsche Forschungsgemeinshaft through the Graduate College of Biochemistry of Nucleoprotein Complexes (GK370) at Justus Liebig University, Giessen, Germany and the German National Genome Network (NGFN-2) to T.C. This work forms part of the doctoral thesis of R.G. H. L. is carrying out his civilian community service at the Justus-Liebig University. The authors would like to thank Dr. Uday Kishore for helpful comments on the manuscript.
- 8.GFF Format Specification[http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.