Background

Today, more than 1000 genomes of cellular organisms, mostly microbes, have been completely sequenced and deposited in public databases, in addition to over 2000 viral genomes, and these numbers are expected to skyrocket in the near future. While sequencing projects remain largely biased towards genomes linked to human interests [1] (e.g., domestic animals and plants, microbial pathogens, and microbes exploited in industry and agriculture), some serious initiatives are being launched for sequencing organisms that represent all branches of the tree of life [2].

Concomitant with the genomic revolution, unprecedented advances in sequencing technology have also led to the emergence of the field of metagenomics, which offers a novel, revolutionary approach for studying (microscopic) life in different environments. Metagenomics allows investigators to assess the biodiversity in a given ecosystem by directly sequencing DNA sampled from that ecosystem [35]. As so-called next-generation sequencing technologies evolve, producing tremendous amounts of data [6], the existing tools for sequence annotation are not fast enough to cope with the technological advances. Consequently, manual annotation has almost become impossible; however, automated annotation tools often lead to error propagation and biologically irrelevant ontologies.

Materials and methods

Here, I demonstrate how the use of the subsystems [7] and FIGfams [8, 9] technologies, initiated by the Fellowship for Interpretation of Genomes (FIG) and the University of Chicago National Microbial Pathogen Data Resource (NMPDR) project [10], has improved the accuracy and consistency of genome and metagenome annotation [11]. Using subsystems allows the combination of careful human annotation and the rapid computational propagation of assertions made by human experts through the RAST [8] pipeline for genome annotation, the MG-RAST server for metagenome annotation [12], and Phage-RAST for phage genome annotation (work in progress).

Results and conclusion

Still, although these servers offer relatively rapid annotation, the increasing throughput of sequencing platforms requires even faster pipelines, and annotating a large metagenomic data set can take weeks to months. To address this challenge, researchers at San Diego State University, FIG, and the Argonne National Laboratory are developing a protein family signature-based technology (Robert A. Edwards, Ross Overbeek, et al. submitted) to reduce the annotation speed by an order of magnitude and create a real-time annotation server (URL: http://edwards.sdsu.edu/rtmg). Such server will not only improve speed, but will allow the implementation of annotation pipelines on cell phones (Josh Hoffman et al., unpublished data) and social networks (Daniel Cuevas et al., unpublished data).