Encyclopedia of Metagenomics

Living Edition
| Editors: Karen E. Nelson

123 of Metagenomics

  • Torsten ThomasEmail author
  • Jack Gilbert
  • Folker Meyer
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-6418-1_728-4

Keywords

Read Length Nucleoside Triphosphate Metagenomic Data Metagenomic Dataset Metagenomic Sample 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

Microbial ecology aims to comprehensively describe the diversity and function of microorganisms in the environment. Culturing, microscopy, and chemical or biological assays were not too long ago the main tools in this field. Molecular methods, such as 16S rRNA gene sequencing, were applied to environmental systems in the 1990s and started to uncover a remarkable diversity of organisms (Barns et al. 1994). Soon, the thirst for describing microbial systems was no longer satisfied by the knowledge of the diversity of just one or a few genes. Thus, approaches were developed to describe the total genetic diversity of a given environment (Riesenfeld et al. 2004). One such approach is metagenomics, which involves sequencing the total DNA extracted from environmental samples. Arguably, metagenomics has been the fastest growing field of microbiology in the last few years and has almost become a routine practice. The learning curve in the field has been steep, and many obstacles still need to be overcome to make metagenomics a reliable and standard process. It is timely to reflect on what has been learned over the past few years from metagenome projects and to predict future needs and developments.

This brief primer gives an overview for the current status and practices as well as limitations of metagenomics. We present an introduction to sampling design, DNA extraction, sequencing technology, assembly, annotation, data sharing, and storage.

Sampling Design and DNA Processing

Metagenomic studies of single habitats, for example, acid mine drainage (Tyson et al. 2004), termite hindgut (Warnecke et al. 2007), cow rumen (Hess et al. 2011), and the human gastrointestinal tract (Gill et al. 2006), have provided an insight into the basic diversity and ecology of these environments. Moreover, comparative studies have explored the ecological distribution of genes and the functional adaptations of different microbial communities to specific ecosystems (Tringe et al. 2005; Dinsdale et al. 2008; Delmont et al. 2011). These pioneering studies were predominately designed to develop and prove the general metagenomic approach and were often limited by the high cost of sequencing. Hence, desirable scientific methodology, including biological replication, could not be adopted, a situation that precluded appropriate statistical analyses and comparison (Prosser 2010).

The significant reduction, and indeed continuing fall, in sequencing costs (see below) now means that the central tenants of scientific investigation can be adhered to. Rigorous experimental design will help researchers explore the complexity of microbial interactions and will lead to improved catalogs of proteins and genetic elements. Individual ecosystems can now be studied with appropriate cross-sectional and temporal approaches designed to identify the frequency and distribution of variance in community interaction and development (Knight et al. 2012). Such studies should also pay close attention to the collection of comprehensive physical, chemical, and biological data (see below). This will enable scientists to elucidate the emergent properties of even the most complex biological system. This capability will provide the potential to identify drivers at multiple spatial, temporal, taxonomic, phylogenetic, functional, and evolutionary levels and to define the feedback mechanisms that mediate equilibrium.

The frequency and distribution of variance within a microbial ecosystem are basic factors that must be ascertained by rigorous experimental design and analysis. For example, to analyze the microbial community structure from 1 l of seawater in a coastal pelagic ecosystem, one must also ideally define how representative this will be for the ecosystem as a whole and what the bounds of that ecosystem are. Numerous studies of marine systems have shown how community structure can vary between water masses and over time (e.g., Gilbert et al. 2012; Fuhrman 2009; Fuhrman et al. 2006, 2008; Martiny et al. 2006), and metagenomics currently helps further define how community structure varies in these environments (Ottesen et al. 2011; DeLong et al. 2006; Rusch et al. 2007; Gilbert et al. 2010a). In contrast, in soil systems variance in space appears to be far larger than in time (Mackelprang et al. 2011; Barberan et al. 2012; Bergmann et al. 2011; Nemergut et al. 2011; Bates et al. 2011). Considerable work still is needed in order to determine spatial heterogeneity, for example, how representative a 0.1 mg sample of soil is with respect to the larger environment from which it was taken.

The design of a sampling strategy is implicit in the scientific questions asked and the hypotheses tested, and standard rules outside of replication and frequency of observation are hard to define. However, the question of “depth of observation” is prudent to address because researchers now can sequence microbiomes of individual environments with exceptional depth or breadth. By enabling either deep characterization of the taxonomic, phylogenetic, and functional potential of a given ecosystem or a shallow investigation of these elements across hundreds or thousands of samples, current sequencing technology (see below) is changing the way microbial surveys are being performed (Knight et al. 2012).

DNA handling and processing play a major role in exploring microbial communities through metagenomics (see also DNA extraction methods for human studies, “Extraction Methods, DNA” and “Extraction Methods, Variability Encountered in”). Specifically, it is well known that the type of DNA extraction used for a sample will affect the community profile obtained (e.g., Delmont et al. 2012). Therefore, with projects like the Earth Microbiome Project that aim to compare a large number of samples, efforts have been made to standardize DNA extraction protocols for every physical sample. Clearly, no single protocol will be suitable for every sample type (Gilbert 2011, 2010b). For example, a particular extraction protocol might yield only very low DNA concentrations for a particular sample type, making it necessary to explore other protocols in order to improve efficiency. However, differences among DNA extraction protocols may limit comparability of data. Therefore, researchers need to further define in qualitative and quantitative terms how different DNA extraction methodologies affect microbial community structure.

Sequencing Technology and Quality Control

The rapid development of sequencing technologies over the past few years has arguably been one of the driving forces in the field of metagenomics. While shotgun metagenomic studies initially relied on hardware-intensive and costly Sanger sequencing technology (Tyson et al. 2004; Venter et al. 2004) available only to large research institutes, the advent and continuous release of several next-generation sequencing (NGS) platforms has democratized the sequencing market and has given individual laboratories or research teams access to affordable sequencing data. Among the available NGS options, the Roche (Margulies et al. 2005), Illumina (Bentley et al. 2008), Ion Torrent (Rothberg et al. 2011), and SOLiD (Life Technologies) platforms have been applied to metagenomic samples, with the former two being more intensively used than the latter. The features of these sequencing technologies have been extensively reviewed – see, for example, Metzker (2010) and Quail et al. (2012) – and are therefore only briefly summarized here (Table 1).
Table 1

Next-generation sequencing technologies and their throughput, errors, and application to metagenomics

Machine (manufacturer)

Throughput (per machine run)

Reported errors

Error/metagenomic example references

GLX Titanium (454/Roche)

~1 M reads @ ~500 nt

0.56 % indels; up to 0.12 % substitution

(McElroy et al. 2012; Fan et al. 2012)

HiSeq 2000 (Illumina)

~3 G reads @ 100 nt

~0.001 % indels; up to 0.34 % substitution

(McElroy et al. 2012; Quail et al. 2012; Hess et al. 2011)

Ion Torrent PGM (Life Technologies)

~0.1–5 M reads @ ~200 nt

1.5 % indels

(Loman et al. 2012; Whiteley et al. 2012)

SOLiD (Life Technologies)

~120 M reads @ ~50 nt

Up to 3 %

(Salmela 2010; Zhou et al. 2011; Iverson et al. 2012)

Roche’s platform utilizes pyrosequencing (also often referred to as 454 sequencing because of the name of the company that initially developed the platform) as its underlying molecular principle. Pyrosequencing involves the binding of a primer to a template and the sequential addition of all four nucleoside triphosphates in the presence of a DNA polymerase. If the offered nucleoside triphosphate matches the next position after the primer, then its incorporation results in the release of diphosphate (pyrophosphate, or PPi). PPi production is coupled by an enzymatic reaction involving an ATP sulfurylase and a luciferase to the production of a light signal that is detected through a charge-coupled device. The Ion Torrent sequencing platform uses a related approach; however, here, protons that are released during nucleoside incorporation are detected through semiconductor technology. In both cases, the production of light or charge signals relates to the incorporation of the sequentially offered nucleoside, which can be used to deduce the sequence downstream of the primer. Homopolymer sequences create signals proportional to the number of positions; however, the linearity of this relationship is limited by enzymatic and engineering factors leading to well-investigated insertion and deletion (Indel) sequencing errors (Prabakaran et al. 2011; McElroy et al. 2012).

Illumina sequencing is based on the incorporation and detection of fluorescently labeled nucleoside triphosphates to extend a primer bound to a template. The key feature of the nucleoside triphosphates is a chemically modified 3′ position that does not allow for further chain extension (“terminator”). Thus, the primer gets extended by only one position, whose identity is detected by different fluorescent colors for each of the four nucleosides. Through a chemical reaction, the fluorescent label is then removed, and the 3′ position is converted into a hydroxyl group allowing for another round of nucleoside incorporation. The use of a reversible terminator thus allows for a stepwise and detectable extension of the primer that results in the determination of the template sequence. In theory, this process could be repeated to generate very long sequences; in practice, however, misincorporation of nucleosides in the many clonal template strands results in the fluorescent signal getting out of phase, and thus reliable sequencing information is only obtained for about 200 positions (Quail et al. 2012).

SOLiD sequencing utilizes ligation, rather than polymerase-mediated chain extension, to determine the sequence of a template. Primers are extended through the ligation with fluorescently labeled oligonucleotides. The high specificity of the ligase ensures that only oligonucleotides matching the downstream sequence will be incorporated; and by encoding different oligonucleotides with different fluorophores, the sequence can be determined.

It is important to understand the features of the sequencing technology in terms of throughput, read length, and errors (see Table 1), because these will have a significant impact on downstream processing. For example, the relative high frequency of homopolymer errors for the pyrosequencing technology can impact ORF identification (Rho et al. 2010) but might still allow for reliable gene annotation, because of its comparatively long read length (Wommack et al. 2008). Conversely, the short read length of Illumina sequencing might reduce the rate of annotation of unassembled data, but the substantial throughput and data volume generated can facilitate assembly of entire draft genomes from metagenomic data (Hess et al. 2011). These considerations are also particularly relevant with new sequencing technologies coming online. These include single-molecule sequencing using zero-mode waveguide nanostructure arrays (Eid et al. 2009), which promises read lengths beyond 1,000 bp and has been shown to improve the hybrid assemblies of genomes (Koren et al. 2012), as well as nanopore sequencing (Schneider and Dekker 2012), which also promises long read lengths.

One important practical aspect to consider when analyzing raw sequencing data is the quality value assigned to reads. For a long time, the quality assessment provided by the technology vendor was the only available option for data consumers. Recently, however, a vendor-independent error detection and characterization has been described that relies on error estimate-based reads that are accidentally duplicated during the PCR stages (a fact described for Ion Torrent, 454, and Illumina sequencing technologies) (Trimble et al. 2012). Moreover, a significant number of publicly available metagenomic datasets contain sequence adaptors (apparently because quality control is often performed on the level of assembled sequences, not raw reads). Simple statistical analyses with tools such as FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) will rapidly detect most of these adapter contaminations. An important aspect of quality control is therefore that each individual dataset requires error profiling and that relying on general properties of the platform used is not sufficient.

Assembly

Assembly of shotgun sequencing data can in general follow two strategies: the overlap-layout-consensus (OLC) and the de Bruijn graph approach (see also: “A de novo metagenomic assembly program for shotgun DNA reads, Human Microbiome, Assembly and Analysis Software, Project”). These two strategies are employed by a number of different genome assemblers, and this topic has been reviewed recently (Miller et al. 2010). Basically, the OLC assembly involves the pairwise comparison of sequence reads and the ordering of matching pairs into an overlap graph. These overlapping sequences are then merged into a consensus sequence. Assembly with the de Bruijn strategy involves representing each sequence’s reads in a graph of all possible k-mers. Two k-mers are connected when the sequence reads have them in sequential, overlapping positions. Thus, all reads of a dataset are represented by the connection within the de Bruijn graph, and assembled contigs are generated by traversing these connections to yield a sequence of k-mers.

The OLC assembly has the advantage that pairwise comparison can be performed to allow for a defined degree of dissimilarity between reads. This can compensate for sequencing errors and allows for the assembly of reads from heterogeneous populations (Tyson et al. 2004). However, memory requirement for pairwise comparisons increases exponentially with the numbers of reads in the dataset; hence, the OLC assembler often cannot deal with large datasets (e.g., Illumina data). Nevertheless, several OLCs, including the Celera Assembler (Miller et al. 2008), Phrap (de la Bastide and McCombie 2007), and Newbler (Roche), have been used to assemble partial or complete draft genomes from metagenomic data; see, for example, Tyson et al. (2004), Liu et al. (2011), and Brown et al. (2012).

In contrast, memory requirements of de Bruijn assemblers are largely determined by the k-mer size chosen to define the graph. Thus, these assemblers have been used successfully with large numbers of short reads. Initially, de Bruijn assemblers designed for clonal genomes, such as Velvet (Zerbino and Birney 2008), SOAP (Li et al. 2008), and ABySS (Simpson et al. 2009), were used to assemble metagenomic data. Because of the heterogeneous nature of microbial populations, however, assemblies often ended up fragmented. One reason is that every positional difference between two reads from the same region of two closely related genomes will create a “bubble” in the graph. Another reason is that sequence errors in low-abundance reads cause terminating branches. Traversing such a highly branched graph leads to large number of contigs. These problems have been partially overcome by modification of existing de Bruijn assemblers such as MetaVelvet (Namiki et al. 2012) or by newly designed de Bruijn-based algorithms such as Meta-IDBA (Peng et al. 2011; see also “Meta-IDBA, overview”). Conceptually, these solutions often include the identification of subgraphs that correspond to individual genomes or the abundance information of k-mers to find an optimal solution path through the graph.

These subdividing approaches are analogous to binning metagenomic reads or contigs, in order to identify groups of sequences that define a specific genome. These bins or even individual sequence reads can also be taxonomically classified by comparison with known reference sequences. Binning and classifying of sequences can be based on phylogeny, similarity, or composition (or combinations thereof), and a large number of algorithms and software is available. For recent comparisons and benchmarking of binning and classification software, please see Bazinet and Cummings (2012) and Droge and McHardy (2012). Obviously, care has to be taken with any automated process, since nonrelated sequences can be combined to produce genomic chimera bins or classes. It is thus advisable that any binning or classification strategy is thoroughly tested through appropriate in vitro and in silico simulations (Mavromatis et al. 2007; Morgan et al. 2010; McElroy et al. 2012). Also, manual curation of contigs and iterative assembly and mapping can produce improved genomes from metagenomic data (Dutilh et al. 2009). Through such carefully designed strategies and refined processes, nearly complete genomes can be assembled, even for low-abundance organisms from large numbers of short reads (Iverson et al. 2012).

Annotation

Initially, techniques developed for annotating clonal genomes were applied to metagenomic data, and several tools for metagenomic analysis, such as MG-RAST (Meyer et al. 2008) and IMG/M (Markowitz et al. 2008), were derived from existing software suites. For metagenomic projects, the principal challenges lie in the size of the dataset, the heterogeneity of the data, and the fact that sequences are frequently short, even if assembled prior to analysis.

The first step of the analysis (after extensive quality control; see above) involves identification of genes from a DNA sequence. Fundamentally, two approaches exist: the extrinsic approach, which relies on similarity comparison of an unknown sequence to existing databases, and the intrinsic (or de novo) approach, which applies statistical analysis of sequence properties, such as frequently used codon usage, to define likely open reading frames (ORFs). For metagenomic data, the extrinsic approach (e.g., running a similarity search with BLASTX) comes at a significant computational cost (Wilkening et al. 2009), rendering it less attractive. De novo approaches based on codon or nucleotide k-mer usage are thus more promising for large datasets. De novo gene-calling software for microbial genomes are trained on long contigs and assume clonal genomes. For metagenomic datasets this approach is often however unsuitable, because training data is lacking and multiple different codon usage (or k-mer) profiles are present due to the multiple, different genomes present.

However, several software packages have been designed to predict genes for short fragments or even reads (see Trimble et al. 2012 for a review). The most important finding of that review is the effect of errors on gene prediction performance, reducing the reading frame accuracy of most tools to well below 20 % at 3 % sequencing error. Only the software FragGeneScan (Rho et al. 2010; see also FragGeneScan, overview) accounted for the possibility that metagenomic sequences may contain errors, thus allowing it to clearly outperform its competitors.

Once identified, protein-coding genes require functional assignment. Here again, numerous tools and databases exist. Many researchers have found that performing BLAST analysis against the NCBI nonredundant database adds little value to their metagenomic datasets. Preferable are databases that contain high-level groupings of functions, for example, into metabolic pathways as in KEGG (Kanehisa 2002) or into subsystems as in SEED (Overbeek et al. 2005). Using such higher-level groupings allows for the generation of overviews and comparison between samples after statistical normalization.

The time and resources required to perform functional annotations are substantial, but approaches that project multiple results derived from a single sequence analysis into multiple namespaces can minimize these computational costs (Wilke et al. 2012). Numerous tools are also available to predict, for example, short RNAs and/or other genomic features, but these tools are frequently less useful for large metagenomic datasets that exhibit both low sequence quality and short reads.

Several integrations package annotation functionality into a single website. The CAMERA (Seshadri et al. 2007) website, for example, provides users with the ability to run a number of pipelines on metagenomic data (see “Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis”). The Joint Genome Institute’s IMG/M web service also provides an analysis for assembled metagenomic data, which has been used so far for over 300 metagenomic datasets. The European Bioinformatics Institute provides a service aimed at smaller, typically 454/pyrosequencing-derived metagenomes. The most popular service is the MG-RAST system (Meyer et al. 2008), used for over 50,000 metagenomes with over 140 billion base pairs of data. The system offers comprehensive quality control, tools for comparison of datasets, and data import and export tools to, for example, QIIME (Caporaso et al. 2010) using standard formats such as BIOM (McDonald et al. 2012).

Metadata, Standards, Sharing, and Storage

With over 50,000 metagenomes available, the scientific community has realized that standardized metadata (“data about data”) and higher-level classification (e.g., a controlled vocabulary) will increase the usefulness of datasets for novel discoveries (see also Metagenomics, Metadata and MetaAnalysis). Through the efforts of the Genomic Standards Consortium (GSC) (Field et al. 2011), a set of minimal questionnaires has been developed and accepted by the community (Yilmaz et al. 2010) that allows effective communication of metadata for metagenomic samples of diverse types. While the “required” GSC metadata is purposefully minimal and thus provides only a rough description, several domain-specific environmental packages exist that contain more detailed information.

As the standards evolve to match the needs of the scientific community, the groups developing software and analysis services have begun to rely on the presence of GSC-compliant metadata, effectively turning them into essential data for any metagenome project. Furthermore, comparative analysis of metagenomic datasets is becoming a routine practice, and acquiring metadata for these comparisons has become a requirement for publication in several scientific journals. Since reanalysis of raw sequence reads is often computationally too costly, the sharing of analysis results is also advisable. Currently only the IMG/M and MG-RAST platforms are designed to provide cross-sample comparisons without the need to recompute analysis results. In the MG-RAST system, moreover, users can share data (after providing metadata) with other users or make data publicly available.

Metagenomic datasets continue to grow in size. Indeed the first multi-hundred gigabase pair of metagenomes already exists. Therefore, storage and curation of metagenomic data have become a central theme. The on-disk representation of raw data and analyses has led to massive storage issues for groups attempting meta-analyses. Currently there is no solution for accessing relevant subsets of data (e.g., only reads and analyses pertaining to a specific phylum or a specific species) without downloading the entire dataset. Cloud technologies may in the future provide attractive computational solutions for storage and computing problems. However, specific and metadata-enabled solutions are required for cloud systems to power the community-wide (re-)analysis efforts of the first 50,000 metagenomes.

Conclusion

Metagenomics has truly proven a valuable tool for analyzing microbial communities. Technological advances will continue to drive down the sequencing cost for metagenomic projects and, in fact, the flood of current datasets indicates that funding to obtain sequences is not a major limitation. Major bottlenecks are encountered, however, in terms of storage and computational processing of sequencing data. With community-wide efforts and standardized tools, the impact of these current limitations might be managed in the short term. In the long term, however, large standardized databases will be required (e.g., a MetaGeneBank) to give information access to the entire scientific community. Every metagenomic dataset contains many new and unexpected discoveries, and the efforts of microbiologists worldwide will be needed to ensure that nothing is being missed. As for the data, whether raw or processed, it is just data. Only its biological and ecological interpretation will further our understanding of the complex and wonderful diversity of the microbial world around us.

Government License

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a US Department of Energy Office of Science Laboratory, is operated under Contract No. DE-AC02-06CH11357. The US Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

References

  1. Barberan A, Bates ST, et al. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 2012;6(2):343–51.PubMedCentralPubMedCrossRefGoogle Scholar
  2. Barns SM, Fundyga RE, et al. Remarkable archaeal diversity detected in a Yellowstone National Park hot spring environment. Proc Natl Acad Sci U S A. 1994;91(5):1609–13.PubMedCentralPubMedCrossRefGoogle Scholar
  3. Bates ST, Berg-Lyons D, et al. Examining the global distribution of dominant archaeal populations in soil. ISME J. 2011;5(5):908–17.PubMedCentralPubMedCrossRefGoogle Scholar
  4. Bazinet AL, Cummings MP. A comparative evaluation of sequence classification programs. BMC Bioinforma. 2012;13(1):92.CrossRefGoogle Scholar
  5. Bentley DR, Balasubramanian S, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9.PubMedCentralPubMedCrossRefGoogle Scholar
  6. Bergmann GT, Bates ST, et al. The under-recognized dominance of Verrucomicrobia in soil bacterial communities. Soil Biol Biochem. 2011;43(7):1450–5.PubMedCentralPubMedCrossRefGoogle Scholar
  7. Brown MV, Lauro FM, et al. Global biogeography of SAR11 marine bacteria. Mol Syst Biol. 2012;8:595.PubMedCentralPubMedCrossRefGoogle Scholar
  8. Caporaso JG, Kuczynski J, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7(5):335–6.PubMedCentralPubMedCrossRefGoogle Scholar
  9. de la Bastide M, McCombie WR. Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinforma. 2007. Chapter 11: Unit11 14.Google Scholar
  10. Delmont TO, Malandain C, et al. Metagenomic mining for microbiologists. ISME J. 2011;5(12):1837–43.PubMedCentralPubMedCrossRefGoogle Scholar
  11. Delmont TO, Prestat E, et al. Structure, fluctuation and magnitude of a natural grassland soil metagenome. ISME J. 2012;6(9):1677–87.PubMedCentralPubMedCrossRefGoogle Scholar
  12. DeLong EF, Preston CM, et al. Community genomics among stratified microbial assemblages in the ocean’s interior. Science. 2006;311(5760):496–503.PubMedCrossRefGoogle Scholar
  13. Dinsdale EA, Edwards RA, et al. Functional metagenomic profiling of nine biomes. Nature. 2008;452(7187):629–32.PubMedCrossRefGoogle Scholar
  14. Droge J, McHardy AC. Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Brief Bioinform. 2012;13(6):646–55.PubMedCrossRefGoogle Scholar
  15. Dutilh BE, Huynen MA, et al. Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly. Bioinformatics. 2009;25(21):2878–81.PubMedCentralPubMedCrossRefGoogle Scholar
  16. Eid J, Fehr A, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323(5910):133–8.PubMedCrossRefGoogle Scholar
  17. Fan L, Reynolds D, et al. Functional equivalence and evolutionary convergence in complex communities of microbial sponge symbionts. Proc Natl Acad Sci U S A. 2012;109(27):E1878–87.PubMedCentralPubMedCrossRefGoogle Scholar
  18. Field D, Amaral-Zettler L, et al. The genomic standards consortium. PLoS Bio. 2011;9(6):e1001088.CrossRefGoogle Scholar
  19. Fuhrman JA. Microbial community structure and its functional implications. Nature. 2009;459(7244):193–9.PubMedCrossRefGoogle Scholar
  20. Fuhrman JA, Hewson I, et al. Annually reoccurring bacterial communities are predictable from ocean conditions. Proc Natl Acad Sci U S A. 2006;A103(35):13104–9.CrossRefGoogle Scholar
  21. Fuhrman JA, Steele JA, et al. A latitudinal diversity gradient in planktonic marine bacteria. Proc Natl Acad Sci U S A. 2008;A105(22):7774–8.CrossRefGoogle Scholar
  22. Gilbert JA, Field D, et al. The taxonomic and functional diversity of microbes at a temperate coastal site: a ‘multi-omic’ study of seasonal and diel temporal variation. PLoS One. 2010a;5(11):e15545.PubMedCentralPubMedCrossRefGoogle Scholar
  23. Gilbert JA, Meyer F, et al. The earth microbiome project: meeting report of the “1 EMP meeting on sample selection and acquisition at Argonne National Laboratory October 6 2010”. Stand Genomic Sci. 2010b;3(3):249–53.PubMedCentralPubMedCrossRefGoogle Scholar
  24. Gilbert JA, Bailey M, et al. The earth microbiome project: the Meeting Report for the 1st International Earth Microbiome Project Conference, Shenzhen, China, June 13th-15th 2010. Stand Genomic Sci. 2011;5(2):243–7.PubMedCentralCrossRefGoogle Scholar
  25. Gilbert JA, Steele JA, et al. Defining seasonal marine microbial community dynamics. ISME J. 2012;6:298–308.PubMedCentralPubMedCrossRefGoogle Scholar
  26. Gill SR, Pop M, et al. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312(5778):1355–9.PubMedCentralPubMedCrossRefGoogle Scholar
  27. Hess M, Sczyrba A, et al. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 2011;331(6016):463–7.PubMedCrossRefGoogle Scholar
  28. Iverson V, Morris RM, et al. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science. 2012;335(6068):587–90.PubMedCrossRefGoogle Scholar
  29. Kanehisa M. The KEGG database. Novartis Found Symp. 2002;247:91–101. discussion 101–103, 119–128, 244–152.PubMedCrossRefGoogle Scholar
  30. Knight R, Jansson J, et al. Designing better metagenomic surveys: the role of experimental design and metadata capture in making useful metagenomic datasets for ecology and biotechnology. Nat Biotechnol. 2012;30(6):513–2.PubMedCrossRefGoogle Scholar
  31. Koren S, Schatz MC, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012;30(7):693–700.PubMedCentralPubMedCrossRefGoogle Scholar
  32. Li R, Li Y, et al. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008;24(5):713–4.PubMedCrossRefGoogle Scholar
  33. Liu MY, Kjelleberg S, et al. Functional genomic analysis of an uncultured delta-proteobacterium in the sponge Cymbastela concentrica. ISME J. 2011;5(3):427–35.PubMedCentralPubMedCrossRefGoogle Scholar
  34. Loman NJ, Misra RV, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012;30(5):434–9.PubMedCrossRefGoogle Scholar
  35. Mackelprang R, Waldrop MP, et al. Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature. 2011;480(7377):368–71.PubMedCrossRefGoogle Scholar
  36. Margulies M, Egholm M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376–80.PubMedCentralPubMedGoogle Scholar
  37. Markowitz VM, Ivanova NN, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res. 2008;36(Database issue):D534–8.PubMedCentralPubMedGoogle Scholar
  38. Martiny JB, Bohannan BJ, et al. Microbial biogeography: putting microorganisms on the map. Nat Rev Microbiol. 2006;4(2):102–12.PubMedCrossRefGoogle Scholar
  39. Mavromatis K, Ivanova N, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4(6):495–500.PubMedCrossRefGoogle Scholar
  40. McDonald D, Clemente JC, et al. The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience. 2012;1(1):7.PubMedCentralPubMedCrossRefGoogle Scholar
  41. McElroy KE, Luciani F, et al. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012;13:74.PubMedCentralPubMedCrossRefGoogle Scholar
  42. Metzker ML. Sequencing technologies – the next generation. Nat Rev Genet. 2010;11(1):31–46.PubMedCrossRefGoogle Scholar
  43. Meyer F, Paarmann D, et al. The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinforma. 2008;9:386.CrossRefGoogle Scholar
  44. Miller JR, Delcher AL, et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008;24(24):2818–24.PubMedCentralPubMedCrossRefGoogle Scholar
  45. Miller JR, Koren S, et al. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95(6):315–27.PubMedCentralPubMedCrossRefGoogle Scholar
  46. Morgan JL, Darling AE, et al. Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One. 2010;5(4):e10209.PubMedCentralPubMedCrossRefGoogle Scholar
  47. Namiki T, Hachiya T, et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40(20):e155.PubMedCentralPubMedCrossRefGoogle Scholar
  48. Nemergut DR, Costello EK, et al. Global patterns in the biogeography of bacterial taxa. Environ Microbiol. 2011;13(1):135–44.PubMedCrossRefGoogle Scholar
  49. Ottesen EA, Marin R, et al. Metatranscriptomic analysis of autonomously collected and preserved marine bacterioplankton. ISME J. 2011;5(12):1881–95.PubMedCentralPubMedCrossRefGoogle Scholar
  50. Overbeek R, Begley T, et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33(17):5691–702.PubMedCentralPubMedCrossRefGoogle Scholar
  51. Peng Y, Leung HC, et al. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics. 2011;27(13):i94–101.PubMedCentralPubMedCrossRefGoogle Scholar
  52. Prabakaran P, Streaker E, et al. 454 antibody sequencing – error characterization and correction. BMC Res Notes. 2011;4:404.PubMedCentralPubMedCrossRefGoogle Scholar
  53. Prosser JI. Replicate or lie. Environ Microbiol. 2010;12(7):1806–10.PubMedCrossRefGoogle Scholar
  54. Quail M, Smith ME, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13(1):341.PubMedCentralPubMedCrossRefGoogle Scholar
  55. Rho M, Tang H, et al. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38(20):e191.PubMedCentralPubMedCrossRefGoogle Scholar
  56. Riesenfeld CS, Schloss PD, et al. Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004;38:525–52.PubMedCrossRefGoogle Scholar
  57. Rothberg JM, Hinz W, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475(7356):348–52.PubMedCrossRefGoogle Scholar
  58. Rusch DB, Halpern AL, et al. The Sorcerer II global ocean sampling expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007;5(3):e77.PubMedCentralPubMedCrossRefGoogle Scholar
  59. Schneider GF, Dekker C. DNA sequencing with nanopores. Nat Biotechnol. 2012;30(4):326–8. doi: 10.1038/nbt.2181.Google Scholar
  60. Salmela L. Correction of sequencing errors in a mixed set of reads. Bioinformatics. 2010;26(10):1284–90.PubMedCrossRefGoogle Scholar
  61. Seshadri R, Kravitz SA, et al. CAMERA: a community resource for metagenomics. PLoS Biol. 2007;5(3):e75.PubMedCentralPubMedCrossRefGoogle Scholar
  62. Simpson JT, Wong K, et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23.PubMedCentralPubMedCrossRefGoogle Scholar
  63. Trimble WL, Keegan KP, et al. Short-read reading-frame predictors are not created equal: sequence error causes loss of signal. BMC Bioinforma. 2012;13(1):183.CrossRefGoogle Scholar
  64. Tringe SG, von Mering C, et al. Comparative metagenomics of microbial communities. Science. 2005;308(5721):554–7.PubMedCrossRefGoogle Scholar
  65. Tyson GW, Chapman J, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428(6978):37–43.PubMedCrossRefGoogle Scholar
  66. Venter JC, Remington K, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304(5667):66–74.PubMedCrossRefGoogle Scholar
  67. Warnecke F, Luginbuhl P, et al. Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature. 2007;450(7169):560–5.PubMedCrossRefGoogle Scholar
  68. Whiteley AS, Jenkins S, et al. Microbial 16S rRNA Ion Tag and community metagenome sequencing using the Ion Torrent (PGM) platform. J Microbiol Methods. 2012;91(1):80–8.PubMedCrossRefGoogle Scholar
  69. Wilke A, Harrison T, et al. The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools. BMC Bioinforma. 2012;13:141.CrossRefGoogle Scholar
  70. Wilkening J, Wilke A, et al. Using clouds for metagenomics: a case study. IEEE Cluster 2009. 2009Google Scholar
  71. Wommack KE, Bhavsar J, et al. Metagenomics: read length matters. Appl Environ Microbiol. 2008;74(5):1453–63.PubMedCentralPubMedCrossRefGoogle Scholar
  72. Yilmaz P, Kottmann R, et al. The “Minimum Information about an ENvironmental Sequence” (MIENS) specification. Nat Biotechnol. 2010. in print.Google Scholar
  73. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–9.PubMedCentralPubMedCrossRefGoogle Scholar
  74. Zhou R, Ling S, et al. Population genetics in nonmodel organisms: II. Natural selection in marginal habitats revealed by deep sequencing on dual platforms. Mol Biol Evol. 2011;28(10):2833–42.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.School of Biotechnology and Biomolecular Sciences & Centre for Marine Bio-InnovationUniversity of New South WalesSydneyAustralia
  2. 2.Department of Ecology & EvolutionUniversity of ChicagoChicagoUSA
  3. 3.Institute of Genomic and Systems BiologyArgonne National LaboratoryArgonneUSA