123 of Metagenomics
KeywordsRead Length Nucleoside Triphosphate Metagenomic Data Metagenomic Dataset Metagenomic Sample
Microbial ecology aims to comprehensively describe the diversity and function of microorganisms in the environment. Culturing, microscopy, and chemical or biological assays were not too long ago the main tools in this field. Molecular methods, such as 16S rRNA gene sequencing, were applied to environmental systems in the 1990s and started to uncover a remarkable diversity of organisms (Barns et al. 1994). Soon, the thirst for describing microbial systems was no longer satisfied by the knowledge of the diversity of just one or a few genes. Thus, approaches were developed to describe the total genetic diversity of a given environment (Riesenfeld et al. 2004). One such approach is metagenomics, which involves sequencing the total DNA extracted from environmental samples. Arguably, metagenomics has been the fastest growing field of microbiology in the last few years and has almost become a routine practice. The learning curve in the field has been steep, and many obstacles still need to be overcome to make metagenomics a reliable and standard process. It is timely to reflect on what has been learned over the past few years from metagenome projects and to predict future needs and developments.
This brief primer gives an overview for the current status and practices as well as limitations of metagenomics. We present an introduction to sampling design, DNA extraction, sequencing technology, assembly, annotation, data sharing, and storage.
Sampling Design and DNA Processing
Metagenomic studies of single habitats, for example, acid mine drainage (Tyson et al. 2004), termite hindgut (Warnecke et al. 2007), cow rumen (Hess et al. 2011), and the human gastrointestinal tract (Gill et al. 2006), have provided an insight into the basic diversity and ecology of these environments. Moreover, comparative studies have explored the ecological distribution of genes and the functional adaptations of different microbial communities to specific ecosystems (Tringe et al. 2005; Dinsdale et al. 2008; Delmont et al. 2011). These pioneering studies were predominately designed to develop and prove the general metagenomic approach and were often limited by the high cost of sequencing. Hence, desirable scientific methodology, including biological replication, could not be adopted, a situation that precluded appropriate statistical analyses and comparison (Prosser 2010).
The significant reduction, and indeed continuing fall, in sequencing costs (see below) now means that the central tenants of scientific investigation can be adhered to. Rigorous experimental design will help researchers explore the complexity of microbial interactions and will lead to improved catalogs of proteins and genetic elements. Individual ecosystems can now be studied with appropriate cross-sectional and temporal approaches designed to identify the frequency and distribution of variance in community interaction and development (Knight et al. 2012). Such studies should also pay close attention to the collection of comprehensive physical, chemical, and biological data (see below). This will enable scientists to elucidate the emergent properties of even the most complex biological system. This capability will provide the potential to identify drivers at multiple spatial, temporal, taxonomic, phylogenetic, functional, and evolutionary levels and to define the feedback mechanisms that mediate equilibrium.
The frequency and distribution of variance within a microbial ecosystem are basic factors that must be ascertained by rigorous experimental design and analysis. For example, to analyze the microbial community structure from 1 l of seawater in a coastal pelagic ecosystem, one must also ideally define how representative this will be for the ecosystem as a whole and what the bounds of that ecosystem are. Numerous studies of marine systems have shown how community structure can vary between water masses and over time (e.g., Gilbert et al. 2012; Fuhrman 2009; Fuhrman et al. 2006, 2008; Martiny et al. 2006), and metagenomics currently helps further define how community structure varies in these environments (Ottesen et al. 2011; DeLong et al. 2006; Rusch et al. 2007; Gilbert et al. 2010a). In contrast, in soil systems variance in space appears to be far larger than in time (Mackelprang et al. 2011; Barberan et al. 2012; Bergmann et al. 2011; Nemergut et al. 2011; Bates et al. 2011). Considerable work still is needed in order to determine spatial heterogeneity, for example, how representative a 0.1 mg sample of soil is with respect to the larger environment from which it was taken.
The design of a sampling strategy is implicit in the scientific questions asked and the hypotheses tested, and standard rules outside of replication and frequency of observation are hard to define. However, the question of “depth of observation” is prudent to address because researchers now can sequence microbiomes of individual environments with exceptional depth or breadth. By enabling either deep characterization of the taxonomic, phylogenetic, and functional potential of a given ecosystem or a shallow investigation of these elements across hundreds or thousands of samples, current sequencing technology (see below) is changing the way microbial surveys are being performed (Knight et al. 2012).
DNA handling and processing play a major role in exploring microbial communities through metagenomics (see also DNA extraction methods for human studies, “Extraction Methods, DNA” and “Extraction Methods, Variability Encountered in”). Specifically, it is well known that the type of DNA extraction used for a sample will affect the community profile obtained (e.g., Delmont et al. 2012). Therefore, with projects like the Earth Microbiome Project that aim to compare a large number of samples, efforts have been made to standardize DNA extraction protocols for every physical sample. Clearly, no single protocol will be suitable for every sample type (Gilbert 2011, 2010b). For example, a particular extraction protocol might yield only very low DNA concentrations for a particular sample type, making it necessary to explore other protocols in order to improve efficiency. However, differences among DNA extraction protocols may limit comparability of data. Therefore, researchers need to further define in qualitative and quantitative terms how different DNA extraction methodologies affect microbial community structure.
Sequencing Technology and Quality Control
Next-generation sequencing technologies and their throughput, errors, and application to metagenomics
Throughput (per machine run)
Error/metagenomic example references
GLX Titanium (454/Roche)
~1 M reads @ ~500 nt
0.56 % indels; up to 0.12 % substitution
HiSeq 2000 (Illumina)
~3 G reads @ 100 nt
~0.001 % indels; up to 0.34 % substitution
Ion Torrent PGM (Life Technologies)
~0.1–5 M reads @ ~200 nt
1.5 % indels
SOLiD (Life Technologies)
~120 M reads @ ~50 nt
Up to 3 %
Roche’s platform utilizes pyrosequencing (also often referred to as 454 sequencing because of the name of the company that initially developed the platform) as its underlying molecular principle. Pyrosequencing involves the binding of a primer to a template and the sequential addition of all four nucleoside triphosphates in the presence of a DNA polymerase. If the offered nucleoside triphosphate matches the next position after the primer, then its incorporation results in the release of diphosphate (pyrophosphate, or PPi). PPi production is coupled by an enzymatic reaction involving an ATP sulfurylase and a luciferase to the production of a light signal that is detected through a charge-coupled device. The Ion Torrent sequencing platform uses a related approach; however, here, protons that are released during nucleoside incorporation are detected through semiconductor technology. In both cases, the production of light or charge signals relates to the incorporation of the sequentially offered nucleoside, which can be used to deduce the sequence downstream of the primer. Homopolymer sequences create signals proportional to the number of positions; however, the linearity of this relationship is limited by enzymatic and engineering factors leading to well-investigated insertion and deletion (Indel) sequencing errors (Prabakaran et al. 2011; McElroy et al. 2012).
Illumina sequencing is based on the incorporation and detection of fluorescently labeled nucleoside triphosphates to extend a primer bound to a template. The key feature of the nucleoside triphosphates is a chemically modified 3′ position that does not allow for further chain extension (“terminator”). Thus, the primer gets extended by only one position, whose identity is detected by different fluorescent colors for each of the four nucleosides. Through a chemical reaction, the fluorescent label is then removed, and the 3′ position is converted into a hydroxyl group allowing for another round of nucleoside incorporation. The use of a reversible terminator thus allows for a stepwise and detectable extension of the primer that results in the determination of the template sequence. In theory, this process could be repeated to generate very long sequences; in practice, however, misincorporation of nucleosides in the many clonal template strands results in the fluorescent signal getting out of phase, and thus reliable sequencing information is only obtained for about 200 positions (Quail et al. 2012).
SOLiD sequencing utilizes ligation, rather than polymerase-mediated chain extension, to determine the sequence of a template. Primers are extended through the ligation with fluorescently labeled oligonucleotides. The high specificity of the ligase ensures that only oligonucleotides matching the downstream sequence will be incorporated; and by encoding different oligonucleotides with different fluorophores, the sequence can be determined.
It is important to understand the features of the sequencing technology in terms of throughput, read length, and errors (see Table 1), because these will have a significant impact on downstream processing. For example, the relative high frequency of homopolymer errors for the pyrosequencing technology can impact ORF identification (Rho et al. 2010) but might still allow for reliable gene annotation, because of its comparatively long read length (Wommack et al. 2008). Conversely, the short read length of Illumina sequencing might reduce the rate of annotation of unassembled data, but the substantial throughput and data volume generated can facilitate assembly of entire draft genomes from metagenomic data (Hess et al. 2011). These considerations are also particularly relevant with new sequencing technologies coming online. These include single-molecule sequencing using zero-mode waveguide nanostructure arrays (Eid et al. 2009), which promises read lengths beyond 1,000 bp and has been shown to improve the hybrid assemblies of genomes (Koren et al. 2012), as well as nanopore sequencing (Schneider and Dekker 2012), which also promises long read lengths.
One important practical aspect to consider when analyzing raw sequencing data is the quality value assigned to reads. For a long time, the quality assessment provided by the technology vendor was the only available option for data consumers. Recently, however, a vendor-independent error detection and characterization has been described that relies on error estimate-based reads that are accidentally duplicated during the PCR stages (a fact described for Ion Torrent, 454, and Illumina sequencing technologies) (Trimble et al. 2012). Moreover, a significant number of publicly available metagenomic datasets contain sequence adaptors (apparently because quality control is often performed on the level of assembled sequences, not raw reads). Simple statistical analyses with tools such as FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) will rapidly detect most of these adapter contaminations. An important aspect of quality control is therefore that each individual dataset requires error profiling and that relying on general properties of the platform used is not sufficient.
Assembly of shotgun sequencing data can in general follow two strategies: the overlap-layout-consensus (OLC) and the de Bruijn graph approach (see also: “A de novo metagenomic assembly program for shotgun DNA reads, Human Microbiome, Assembly and Analysis Software, Project”). These two strategies are employed by a number of different genome assemblers, and this topic has been reviewed recently (Miller et al. 2010). Basically, the OLC assembly involves the pairwise comparison of sequence reads and the ordering of matching pairs into an overlap graph. These overlapping sequences are then merged into a consensus sequence. Assembly with the de Bruijn strategy involves representing each sequence’s reads in a graph of all possible k-mers. Two k-mers are connected when the sequence reads have them in sequential, overlapping positions. Thus, all reads of a dataset are represented by the connection within the de Bruijn graph, and assembled contigs are generated by traversing these connections to yield a sequence of k-mers.
The OLC assembly has the advantage that pairwise comparison can be performed to allow for a defined degree of dissimilarity between reads. This can compensate for sequencing errors and allows for the assembly of reads from heterogeneous populations (Tyson et al. 2004). However, memory requirement for pairwise comparisons increases exponentially with the numbers of reads in the dataset; hence, the OLC assembler often cannot deal with large datasets (e.g., Illumina data). Nevertheless, several OLCs, including the Celera Assembler (Miller et al. 2008), Phrap (de la Bastide and McCombie 2007), and Newbler (Roche), have been used to assemble partial or complete draft genomes from metagenomic data; see, for example, Tyson et al. (2004), Liu et al. (2011), and Brown et al. (2012).
In contrast, memory requirements of de Bruijn assemblers are largely determined by the k-mer size chosen to define the graph. Thus, these assemblers have been used successfully with large numbers of short reads. Initially, de Bruijn assemblers designed for clonal genomes, such as Velvet (Zerbino and Birney 2008), SOAP (Li et al. 2008), and ABySS (Simpson et al. 2009), were used to assemble metagenomic data. Because of the heterogeneous nature of microbial populations, however, assemblies often ended up fragmented. One reason is that every positional difference between two reads from the same region of two closely related genomes will create a “bubble” in the graph. Another reason is that sequence errors in low-abundance reads cause terminating branches. Traversing such a highly branched graph leads to large number of contigs. These problems have been partially overcome by modification of existing de Bruijn assemblers such as MetaVelvet (Namiki et al. 2012) or by newly designed de Bruijn-based algorithms such as Meta-IDBA (Peng et al. 2011; see also “Meta-IDBA, overview”). Conceptually, these solutions often include the identification of subgraphs that correspond to individual genomes or the abundance information of k-mers to find an optimal solution path through the graph.
These subdividing approaches are analogous to binning metagenomic reads or contigs, in order to identify groups of sequences that define a specific genome. These bins or even individual sequence reads can also be taxonomically classified by comparison with known reference sequences. Binning and classifying of sequences can be based on phylogeny, similarity, or composition (or combinations thereof), and a large number of algorithms and software is available. For recent comparisons and benchmarking of binning and classification software, please see Bazinet and Cummings (2012) and Droge and McHardy (2012). Obviously, care has to be taken with any automated process, since nonrelated sequences can be combined to produce genomic chimera bins or classes. It is thus advisable that any binning or classification strategy is thoroughly tested through appropriate in vitro and in silico simulations (Mavromatis et al. 2007; Morgan et al. 2010; McElroy et al. 2012). Also, manual curation of contigs and iterative assembly and mapping can produce improved genomes from metagenomic data (Dutilh et al. 2009). Through such carefully designed strategies and refined processes, nearly complete genomes can be assembled, even for low-abundance organisms from large numbers of short reads (Iverson et al. 2012).
Initially, techniques developed for annotating clonal genomes were applied to metagenomic data, and several tools for metagenomic analysis, such as MG-RAST (Meyer et al. 2008) and IMG/M (Markowitz et al. 2008), were derived from existing software suites. For metagenomic projects, the principal challenges lie in the size of the dataset, the heterogeneity of the data, and the fact that sequences are frequently short, even if assembled prior to analysis.
The first step of the analysis (after extensive quality control; see above) involves identification of genes from a DNA sequence. Fundamentally, two approaches exist: the extrinsic approach, which relies on similarity comparison of an unknown sequence to existing databases, and the intrinsic (or de novo) approach, which applies statistical analysis of sequence properties, such as frequently used codon usage, to define likely open reading frames (ORFs). For metagenomic data, the extrinsic approach (e.g., running a similarity search with BLASTX) comes at a significant computational cost (Wilkening et al. 2009), rendering it less attractive. De novo approaches based on codon or nucleotide k-mer usage are thus more promising for large datasets. De novo gene-calling software for microbial genomes are trained on long contigs and assume clonal genomes. For metagenomic datasets this approach is often however unsuitable, because training data is lacking and multiple different codon usage (or k-mer) profiles are present due to the multiple, different genomes present.
However, several software packages have been designed to predict genes for short fragments or even reads (see Trimble et al. 2012 for a review). The most important finding of that review is the effect of errors on gene prediction performance, reducing the reading frame accuracy of most tools to well below 20 % at 3 % sequencing error. Only the software FragGeneScan (Rho et al. 2010; see also FragGeneScan, overview) accounted for the possibility that metagenomic sequences may contain errors, thus allowing it to clearly outperform its competitors.
Once identified, protein-coding genes require functional assignment. Here again, numerous tools and databases exist. Many researchers have found that performing BLAST analysis against the NCBI nonredundant database adds little value to their metagenomic datasets. Preferable are databases that contain high-level groupings of functions, for example, into metabolic pathways as in KEGG (Kanehisa 2002) or into subsystems as in SEED (Overbeek et al. 2005). Using such higher-level groupings allows for the generation of overviews and comparison between samples after statistical normalization.
The time and resources required to perform functional annotations are substantial, but approaches that project multiple results derived from a single sequence analysis into multiple namespaces can minimize these computational costs (Wilke et al. 2012). Numerous tools are also available to predict, for example, short RNAs and/or other genomic features, but these tools are frequently less useful for large metagenomic datasets that exhibit both low sequence quality and short reads.
Several integrations package annotation functionality into a single website. The CAMERA (Seshadri et al. 2007) website, for example, provides users with the ability to run a number of pipelines on metagenomic data (see “Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis”). The Joint Genome Institute’s IMG/M web service also provides an analysis for assembled metagenomic data, which has been used so far for over 300 metagenomic datasets. The European Bioinformatics Institute provides a service aimed at smaller, typically 454/pyrosequencing-derived metagenomes. The most popular service is the MG-RAST system (Meyer et al. 2008), used for over 50,000 metagenomes with over 140 billion base pairs of data. The system offers comprehensive quality control, tools for comparison of datasets, and data import and export tools to, for example, QIIME (Caporaso et al. 2010) using standard formats such as BIOM (McDonald et al. 2012).
Metadata, Standards, Sharing, and Storage
With over 50,000 metagenomes available, the scientific community has realized that standardized metadata (“data about data”) and higher-level classification (e.g., a controlled vocabulary) will increase the usefulness of datasets for novel discoveries (see also Metagenomics, Metadata and MetaAnalysis). Through the efforts of the Genomic Standards Consortium (GSC) (Field et al. 2011), a set of minimal questionnaires has been developed and accepted by the community (Yilmaz et al. 2010) that allows effective communication of metadata for metagenomic samples of diverse types. While the “required” GSC metadata is purposefully minimal and thus provides only a rough description, several domain-specific environmental packages exist that contain more detailed information.
As the standards evolve to match the needs of the scientific community, the groups developing software and analysis services have begun to rely on the presence of GSC-compliant metadata, effectively turning them into essential data for any metagenome project. Furthermore, comparative analysis of metagenomic datasets is becoming a routine practice, and acquiring metadata for these comparisons has become a requirement for publication in several scientific journals. Since reanalysis of raw sequence reads is often computationally too costly, the sharing of analysis results is also advisable. Currently only the IMG/M and MG-RAST platforms are designed to provide cross-sample comparisons without the need to recompute analysis results. In the MG-RAST system, moreover, users can share data (after providing metadata) with other users or make data publicly available.
Metagenomic datasets continue to grow in size. Indeed the first multi-hundred gigabase pair of metagenomes already exists. Therefore, storage and curation of metagenomic data have become a central theme. The on-disk representation of raw data and analyses has led to massive storage issues for groups attempting meta-analyses. Currently there is no solution for accessing relevant subsets of data (e.g., only reads and analyses pertaining to a specific phylum or a specific species) without downloading the entire dataset. Cloud technologies may in the future provide attractive computational solutions for storage and computing problems. However, specific and metadata-enabled solutions are required for cloud systems to power the community-wide (re-)analysis efforts of the first 50,000 metagenomes.
Metagenomics has truly proven a valuable tool for analyzing microbial communities. Technological advances will continue to drive down the sequencing cost for metagenomic projects and, in fact, the flood of current datasets indicates that funding to obtain sequences is not a major limitation. Major bottlenecks are encountered, however, in terms of storage and computational processing of sequencing data. With community-wide efforts and standardized tools, the impact of these current limitations might be managed in the short term. In the long term, however, large standardized databases will be required (e.g., a MetaGeneBank) to give information access to the entire scientific community. Every metagenomic dataset contains many new and unexpected discoveries, and the efforts of microbiologists worldwide will be needed to ensure that nothing is being missed. As for the data, whether raw or processed, it is just data. Only its biological and ecological interpretation will further our understanding of the complex and wonderful diversity of the microbial world around us.
The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a US Department of Energy Office of Science Laboratory, is operated under Contract No. DE-AC02-06CH11357. The US Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
- de la Bastide M, McCombie WR. Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinforma. 2007. Chapter 11: Unit11 14.Google Scholar
- Schneider GF, Dekker C. DNA sequencing with nanopores. Nat Biotechnol. 2012;30(4):326–8. doi: 10.1038/nbt.2181.Google Scholar
- Wilkening J, Wilke A, et al. Using clouds for metagenomics: a case study. IEEE Cluster 2009. 2009Google Scholar
- Yilmaz P, Kottmann R, et al. The “Minimum Information about an ENvironmental Sequence” (MIENS) specification. Nat Biotechnol. 2010. in print.Google Scholar