Next generation tools for genomic data generation, distribution, and visualization
- 14k Downloads
With the rapidly falling cost and availability of high throughput sequencing and microarray technologies, the bottleneck for effectively using genomic analysis in the laboratory and clinic is shifting to one of effectively managing, analyzing, and sharing genomic data.
Here we present three open-source, platform independent, software tools for generating, analyzing, distributing, and visualizing genomic data. These include a next generation sequencing/microarray LIMS and analysis project center (GNomEx); an application for annotating and programmatically distributing genomic data using the community vetted DAS/2 data exchange protocol (GenoPub); and a standalone Java Swing application (GWrap) that makes cutting edge command line analysis tools available to those who prefer graphical user interfaces. Both GNomEx and GenoPub use the rich client Flex/Flash web browser interface to interact with Java classes and a relational database on a remote server. Both employ a public-private user-group security model enabling controlled distribution of patient and unpublished data alongside public resources. As such, they function as genomic data repositories that can be accessed manually or programmatically through DAS/2-enabled client applications such as the Integrated Genome Browser.
KeywordsCommand Line Client Application Genomic Dataset Distribute Annotation System Tiling Microarray
Laboratory Information Management System
Graphical User Interface
Integrated Genome Browser
Distributed Annotation System
Microarray Gene Expression Databases
single nucleotide polymorphism
Application: GNomEx (Genomic Experiment Data Repository and Analysis Project Center)
GNomEx was developed to track samples for experimentation in our microarray and next generation sequencing core facility, associate raw data with biological samples, and link downstream computational analysis with the generated data. It is both a genomic LIMS and analysis project center designed for use by institutional core facilities and large research laboratories. Our installation of GNomEx  currently hosts ~7000 experiment requests, ~30,000 raw microarray and next generation sequencing datasets, and ~130 processed genomic analyses.
Implementation and Results
1) Web browser based interface
Adobe's open framework rich client Flex interface is used to provide a front-end graphical interface in one's preferred web browser using the Flash media player.
2) Platform independent
Particular attention was made to achieve platform independence for all aspects of the software. These include a client-side Flex/Flash interface, Java programming language, an open source object-database mapping (Hibernate) that supports most relational databases (e.g. MySQL, Microsoft SQL Server, Oracle), and the deployment of the web-based applications using an open access J2EE application server (Orion). These choices allow other groups to install and use these applications within their existing infrastructure.
3) Sample annotation
GNomEx is built around the concept of projects in which individual experiments are grouped. Users are encouraged, through a wizard-like interface, to associate annotations with their projects and experiments. Where appropriate, MGED ontologies , have been used to populate these annotation categories to assist in organizing, grouping, and searching of projects and experiments.
4) Public-private access
Experiment annotations, data files, and associated data analysis files are safeguarded by a robust security manager that restricts access to authenticated users. The visibility of an experiment is set to either public, members, or members and collaborators. Following publication, researchers are encouraged to make their raw and analyzed data publicly available by changing their visibility settings. This will allow guest users to browse, search, and download published data.
5) Experiment submission
7) Laboratory workflows
10) Raw data access
11) Processed/analyzed data repository
12) Browsing and Searching
Experiments are organized within project folders that can be browsed according to experiment platform, submission date, or by name of the researcher or lab. Simple text searches as well as advanced, criteria-based searches can be performed on experiments, protocols, and associated analyses. Text searching relies on the high-performance, open source Apache Lucene text search engine . GNomEx keyword searching uses Lucene indexes, built nightly, that contain all text associated with experiments and downstream analysis, including free-form descriptions, structured annotations, sample names, and protocols. Post-search processing culls the results so that only view-permissible data are returned.
Application: GWrap (Genomic analysis command line tool wrapper)
Often the best person to analyze genomic data is the person who submitted the samples to the genomics core facility. They typically have an intimate knowledge of the biology behind the project, have a list of key questions to address, and are aware of potentially confounding issues associated with the experiment. Moreover, when they perform their own genomic analysis, they become aware of the various choices made in generating the processed data that limit and bias its contents. As such, a key goal in our bioinformatics shared resource is to enable users to analyze their own data. For some genomic datasets users can choose from a variety of mature, open access, user friendly, GUI based applications for data processing. (e.g. gene expression, SNP genotyping). For other more recently emerging datasets, such as those derived from tiling microarray and next generation sequencing platforms, sophisticated well characterized analysis tools do exist but are often challenging to use given their command line interface. This is to be expected. Analysis software evolves from minimalistic command line scripts, to integrated command line packaged tools, to web and stand alone GUI applications. When novel analysis approaches change frequently, designing and updating GUIs is often viewed as unproductive by application developers. On the other hand, many scientists avoid command line programming. To break this impasse, web based wrapper applications such as Galaxy  and GenePattern  have proven useful. Users upload their data to a remote server, use web forms to execute command line applications, and download their analysis all in the framework of a web browser. Although effective, it can be less than ideal for processing large tiling microarray and next generation sequencing datasets. The gigabyte size of these datasets poses problems for timely data upload and download, for data storage on a central server, and requires extensive computational resources to process one dataset, let alone multiple datasets from multiple users. Lastly, from a developer standpoint, creating the web forms for each command line application and keeping them up to date requires effort that is often better spent improving the underlying algorithms.
Implementation and Results
Application: GenoPub (Genomic Annotation Publisher)
Another key issue associated with effective use of genomic experiments in laboratories and clinics is the difficulty in efficiently distributing analyzed data. Too often, analyzed data are placed in a supplemental data folder on an author's or journal's web site where annotation of the analysis is non-standard and typically incomplete. Determining which methods were used in generating the data, or even the genome build, is often difficult. Submission of analyzed data to a public repository such as GEO  or ArrayExpress  is an improvement but is rarely done except when publishing the original unprocessed data. Some bioinformatic groups such as UCSC Genome Bioinformatics [12, 13] will host external datasets provided one can convince them it is of interest to their users. In all cases, the data cannot be integrated in a subsequent analysis without extensive manual file downloading, filtering, and reformatting. Making a simple visual comparison between different datasets from different data sources in a genome browser requires considerable effort. Hundreds of genomic datasets are currently buried in web archives or customized databases. As such they are effectively inaccessible. Ideally, a researcher would distribute their own data on the internet using a common protocol so that other groups could see it and could programmatically download portions of it for subsequent comparison with other datasets.
A solution to this problem exists and has been in development for more than ten years. It makes use of a Distributed Annotation System (DAS) protocol, and a DAS server [14, 15, 16, 17, 18, 19]. DAS is a communication protocol developed to exchange annotations on genomic and protein sequences between servers and client applications over the internet. Hundreds of DAS/1 servers are in use at bioinformatic data centers such as WormBase, UCSC, Ensembl, FlyBase, TIGR, and UniProt. Unfortunately, the DAS/1 protocol is not amenable for distributing large genomic datasets given its requirement that datasets be formatted using verbose text based DAS XML. DAS/2  is a recent extension of the DAS/1 protocol and is optimized for distributing large genomic datasets in both text and binary formats (e.g. bed, gff3, wig, bar, fasta, useq, dasXML, sam, bam). The difference in file size and corresponding download time between gzip compressed DAS XML and a binary format like useq is typically >100 fold (e.g. 85 MB vs 0.6 MB for the ENCODE's wgEncodeBroadChipSeqSignalGm12878Ctcf chIP-seq graph data for chr21). Any dataset that can be associated with a specific genome build and genome coordinates (e.g. gene expression, SNP, CNV, chIP-chip, chIP-seq, RNA-seq, chromosomal rearrangements) can be efficiently shared between DAS/2 servers and DAS/2-enabled clients such as IGB  and GBrowse  or incorporated into data objects from the Cancer Biomedical Informatics Grid (caBIG).
Implementation and Results
We have adopted DAS as our genomic data distribution model and have been working with the GenoViz open source project [1, 19, 22] to extend the functionality of the GenoViz Genometry DAS/2 server in three key areas. The first improvement was to implement a user-group public-private security model using http md5 digest authentication to enable restricted access of designated genomic datasets to particular users. Researchers need to be able to compare their unpublished data with public datasets. Clinicians working with patient data require controlled access under all situations. If needed, these servers can leverage other internet based security protocols such as secure socket layers and virtual private networks used by banks and hospitals for securing internet data exchange.
A second improvement was to develop a compressed, pre-indexed, binary data format called useq, that would support the majority of high throughput genomic text based data formats (e.g. bed, gff, gtf, wig, sgr, gr) in a manner that would not require indexing upon server start up nor loading of the data into memory. The GenoViz DAS/2 server was built using an in memory data distribution model. This is appropriate for reference annotations and enables a rapid response to DAS/2 requests. The useq data format provides a mechanism for hosting a large number of high-density datasets limited only by disk space. Tools for generating and extracting information from useq archives are distributed with the USeq package. A detailed description of the format is included in the USeq documentation .
Presented here are three software applications developed to assist with generating, annotating, analyzing, organizing, distributing, and visualizing genomic data. GNomEx is the first published open source genomic LIMS that supports next generation sequencing and microarray platforms. It is an enterprise level application built for integrating multiple university core facilities and dovetails with the Bio Sample Tracking database in use at the University of Utah and Huntsman Cancer hospitals. Unlike most other LIMS, GNomEx contains an analysis project center where multiple users can upload, annotate, and associate analysis with the raw data archived in GNomEx. This is a critical feature needed to maintain a chain of custody type tracking of patients to samples to raw data to analyzed data. To efficiently distribute this processed data, we developed an easy to use web application called GenoPub. GenoPub associates and distributes meta data with each analyzed dataset through the GenoViz DAS/2 server. Analysis can be organized under multiple views (e.g. by patient, disease, or factor) and restricted to particular users enabling the controlled distribution of patient and unpublished data alongside public datasets. To obtain analysis, users either manually download it to their local computer or access it programmatically through DAS/2-enabled client applications such as IGB.
These tools provide critical infrastructure for efficiently managing and distributing genomic data for use in the laboratory and the clinic and return the focus of genomic bioinformatics to data analysis. The development of novel analysis methods is accelerating as fast as next generation sequencing costs fall. Unfortunately, making these cutting edge analysis tools accessible to a wide spectrum of users is proving difficult. One solution presented here makes use of a stand alone GUI, GWrap, to convert 120 command line applications found in two widely used next generation sequencing and tiling microarray analysis packages, USeq and TiMAT2, into a user friendly GUI without placing a burden on developers nor compromising the command line interface. GWrap can be incorporated into other analysis packages with minimal effort. In summary, we believe these next generation tools are well suited for making the best use of datasets from the post-genomic era.
Availability and requirements
Project names: GNomEx, GWrap, GenoPub
Operating systems: Platform independent
Programming languages: Java
Other requirements: Java 1.6+, a relational database (e.g. MySQL, Microsoft SQL Server), object/relational database mapping tool Hibernate 3.2+ https://www.hibernate.org, a Java servlet container (e.g. Apache Tomcat, Orion)
Licenses: GPLv3 for GNomEx, BSD for GWrap and USeq, Common Public License for GenoPub
Restrictions: For profit organizations are required to obtain a commercial license before deploying GNomEx in whole or part. No such restrictions are in place for USeq, GWrap, or GenoPub. See the licence.txt document in the individual package downloads for details.
The authors would like to acknowledge the tremendous resources provided by both the open source GenoViz project  (Gregg Helt, Steve Chervitz, Ed Erwin, Allen Day, Brian O'Connor, Ehsan Tabari, Hiral Vora, Ido M Tamir, Marc RJ Carlson, Nomi Harris) and Ann Loraine's BioViz group  (University of North Carolina at Charlotte: Ann Loraine, John Nicol, Steve Blanchard, Hiral Vora, Archana Raja; most of whom are GenoViz developers). The authors also thank the Huntsman Cancer Institute and National Institutes of Health (grant P01CA24014) for funding and releasing GNomEx, GenoPub, and GWrap to the non-profit community as free open source software.
- 2.University of Utah's GNomEx Installation[https://hci-as1.hci.utah.edu/gnomex/gnomex.html]
- 3.Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-Serra P, Sansone SA, Taylor C, White J, Stoeckert CJ Jr: The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 2006, 22(7):866–73. 10.1093/bioinformatics/btl005CrossRefPubMedGoogle Scholar
- 4.Apache Lucene Search Engine[http://lucene.apache.org]
- 5.Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, 1–21. Chapter 19:Unit 19.10 Chapter 19:Unit 19.10Google Scholar
- 7.USeq Next Generation Analysis Package[http://useq.sourceforge.net]
- 9.TiMAT2 Tiling Microarray Analyis Tools[http://timat2.sourceforge.net]
- 10.Gene Expression Omnibus and Short Read Archive[http://www.ncbi.nlm.nih.gov/geo]
- 12.12. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, Meyer L, Hsu F, Hinrichs AS, Harte RA, Giardine B, Fujita P, Diekhans M, Dreszer T, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 2009, (37 Database):D755â€“61. 10.1093/nar/gkn875CrossRefGoogle Scholar
- 13.UCSC Genome Bioinformatics[http://genome.ucsc.edu]
- 16.Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The distributed annotation system. BMC Bioinformatics. 2001, 2: 7.Google Scholar
- 18.Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, Kähäri A, Kulesha E, Macías JR, Reeves GA, Prlić A: Integrating biological data--the Distributed Annotation System. BMC Bioinformatics 2008, 9(Suppl 8):S3. 10.1186/1471-2105-9-S8-S3CrossRefPubMedPubMedCentralGoogle Scholar
- 22.GenoViz Project[http://genoviz.sourceforge.net]
- 23.USeq binary file format[http://useq.sourceforge.net/useqArchiveFormat.html]
- 24.DAS Registry[http://www.dasregistry.org]
- 25.UNC BioViz Group[http://igb.bioviz.org]
- 26.26. Rosenbloom KR, Dreszer TR, Pheasant M, Barber GP, Meyer LR, Pohl A, Raney BJ, Wang T, Hinrichs AS, Zweig AS, Fujita PA, Learned K, Rhead B, Smith KE, Kuhn RM, Karolchik D, Haussler D, Kent WJ: ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res 2010, (38 Database):D620â€“5. 10.1093/nar/gkp961CrossRefGoogle Scholar
- 27.Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome Cell. 2007, 129(4):823–37.Google Scholar
- 28.Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007, 448(7153):553–60. 10.1038/nature06008CrossRefPubMedPubMedCentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.