Approaches in Metagenome Research: Progress and Challenges
KeywordsMicrobial Community Terminal Restriction Fragment Length Polymorphism Metagenomic Library Xylose Isomerase Metagenomic Approach
Metagenomics comprises the culture-independent and DNA-based analysis of entire microbial communities and complements cultivation-based analysis of microorganisms. Metagenomic approaches allow comprehensive insights into phylogenetic and functional diversity of complex microbial consortia present in moderate as well as extreme environments on Earth. The introduction of next-generation sequencing technologies enabled cost-effective high-throughput sequencing of metagenomic DNA molecules resulting in increased resolution of microbial community analysis. In addition, screening of metagenomic libraries led to the identification of numerous novel biomolecules from various environments such as soil, seawater, or glacial ice.
The immensely manifold microbial niches on Earth comprise an extraordinarily high abundance and diversity of prokaryotic and eukaryotic microorganisms. The human body is colonized by a wide variety of microbes representing all three domains of life. The entirety of these microbial cells (the human microbiome) that is often described as an additional organ exceeds the number of human cells by at least an order of magnitude and outnumbers human genes by more than 100-fold (Li et al. 2012; Weinstock 2012). Also in extreme environments such as hydrothermal vents, sea ice, or deep inside the Earth’s crust, various microorganisms could be detected. For example, a phylogenetically diverse and metabolically active microbial assemblage was identified in the brine of an ice-sealed Antarctic lake (Murray et al. 2012). The microorganisms existing in this aphotic ecosystem withstand a temperature of −13 °C, anoxic conditions, and high salinity.
Currently, less than 1 % of the microorganisms on Earth are readily culturable under laboratory conditions. To investigate the high percentage of uncultured microbes, different metagenomic approaches can be routinely applied. Metagenomics allows the direct study of the collective genomes present in microbial ecosystems (Handelsman 2004). This approach significantly expanded our knowledge on microbial phylogenetic and functional diversity and enabled the discovery of numerous previously unknown biomolecules. In the recent history of metagenomics, especially next-generation sequencing techniques, allowing cost-effective and rapid decoding of metagenomic DNA, were applied to analyze microbial populations. As a consequence, a number of bioinformatic tools to evaluate and compare comprehensive high-throughput metagenomic data have been developed in the last few years.
In this review, an overview of traditional and recent metagenomic research approaches, associated future challenges, and a short description of related meta-omic studies will be given.
Microbial Phylogenetic and Functional Diversity Determination
Small-subunit rRNA genes, universally distributed across prokaryotic and eukaryotic organisms, can be considered as evolutionary clocks enabling phylogenetic analysis. Most commonly, metagenome-derived 16S rRNA and 18S rRNA genes are used to phylogenetically characterize microbial communities. Furthermore, other conserved genes such as recA, rpoB, HSP70, or EF-Tu allow phylogenetic assignments (Ludwig and Klenk 2001). These genes can be investigated by applying traditional molecular approaches including fingerprinting methods such as denaturing gradient gel electrophoresis and terminal restriction fragment length polymorphism analysis or Sanger sequencing. A significant drawback of the Sanger sequencing-based analysis of microbial communities is the time-consuming and labor-intensive nature of this approach, as well as the required construction of clone libraries.
More recently, next-generation sequencing platforms were used to decode metagenomic DNA. Currently, the following next-generation sequencing technologies are available: sequencing by ligation (SOLiD – Applied Biosciences/Life Technologies), sequencing by synthesis (Solexa/Illumina), semiconductor chip sequencing (Ion Torrent/Life Technologies), pyrosequencing (454/Roche), and single-molecule sequencing (Oxford Nanopore Technologies, SMRT – Pacific Biosciences). Compared to Sanger sequencing, these cloning-independent techniques allow the generation of far more sequence data per run. Thus, microbial diversity comparisons between different environmental samples, requiring replicated data and statistical analysis, as well as deep analysis of highly complex microbial community structures, are possible. Currently, often tens to hundreds of thousands partial metagenomic small-subunit rRNA gene sequences are produced using next-generation sequencing platforms. In a recent pyrosequencing-based 16S rRNA gene survey, a total of 41,141 bacterial and 30,651 archaeal sequences were analyzed to investigate prokaryotic diversity in Yunnan and Tibetan hot springs (Song et al. 2013). To (pre-)process small-subunit rRNA gene sequence datasets, various tools, software packages, analytical web servers, and virtual instances can be used (Gonzalez and Knight 2012). The QIIME package (Caporaso et al. 2010) provides workflows to extensively analyze high-throughput amplicon-based sequence data starting with raw sequences. Nevertheless, the avoidance of marker gene amplification bias by applying direct sequencing of metagenomic DNA instead of amplicon-based sequencing allows the most exact taxonomic assessment (Simon and Daniel 2011). For further improvement of microbial diversity and abundance estimation, Kembel et al. (2012) recently introduced an approach, which incorporates 16S rRNA gene copy number information.
To identify the taxonomic affiliation of all sequences derived from metagenomic DNA, a process called binning can be carried out. Within binning procedures, sequences of a metagenomic dataset sharing the same taxonomic origin are “binned” (grouped). Composition-based binning is based on conserved genomic features such as dinucleotide frequencies, GC content, and synonymous codon usage, whereas similarity-based binning makes use of sequence homology. Among others, PhyloPythiaS, introduced by Patil et al. (2011), represents an appropriate application to perform composition-based binning. With respect to similarity-based binning, typically searches against reference databases (e.g., National Center for Biotechnology Information databases) are performed using alignment tools such as BLAST+ (Camacho et al. 2009). Subsequently, BLAST results can be interpreted by applying software such as MEGAN (Huson et al. 2011).
Due to the often very high diversity of microbial communities, assembly of metagenome-derived sequences is challenging. In a recent metagenomic survey of honey bee gut microbiota, de novo assembly of 81,343,096 Illumina paired-end reads resulted in 54,700 scaffolds of contigs (total length, 76.6 Mb) (Engel et al. 2012). Similar to the approach conducted by Engel et al. (2012), single-genome assemblers were used for metagenome assembly with modified settings. Recently, a single-genome assembler (Velvet) has been extended to enable the assembly of short metagenomic reads (Namiki et al. 2012). This new de novo assembler (MetaVelvet) generated significantly higher N50 scores, a parameter that evaluates assembly quality, than analyzed single-genome assemblers for simulated datasets.
Based on assemblies or individual metagenomic sequence reads, gene prediction, annotation, and reconstruction of pathways can be carried out to assess the functional potential encoded by metagenomes. Consecutive processing of these steps is provided by a number of web-based tools like MG-RAST (Meyer et al. 2008). These tools utilize resources of reference databases such as SEED (Overbeek et al. 2005) and KEGG (Kanehisa et al. 2008) to link biological information to predicted genes. In a recent survey including metagenomic methods, the functional potential of Arctic Thaumarchaeota was investigated (Alonso Sáez et al. 2012). By analyzing a metagenome derived from a Southeast Beaufort Sea sample collected during Arctic winter, Alonso Sáez et al. (2012) identified thaumarchaeal pathways for ammonia oxidation. A number of other Thaumarchaeota are also capable of ammonia oxidation, but unexpectedly these Arctic thaumarchaeal organisms harbored a high abundance of genes involved in urea transport and degradation.
Metagenomic Biomolecule Discovery
To access the large pool of unexplored biomolecules, microbial community DNA has been extracted and metagenomic libraries have been constructed. Small-insert and large-insert metagenomic libraries can be screened to identify novel biomolecules. For the construction of small-insert libraries containing metagenomic DNA ≤ 15 kb, plasmids are appropriate vectors, whereas cosmids, fosmids, and bacterial artificial chromosomes (BACs) can be used for cloning of large metagenomic DNA molecules (cosmids and fosmids, ≤40 kb; BACs, 100–200 kb). Metagenomic libraries from different microbial habitats such as glacier ice, digestive tracts of animals, soil, hot springs, or seawater have already been constructed and successfully screened for novel biomolecules (see, e.g., Nacke et al. 2012). Some of these biomolecules exhibit valuable characteristics for industrial applications such as thermal stability, halotolerance, and activity under acidic or alkaline conditions. In a recent metagenomic approach, Sulaiman et al. (2012) isolated a gene encoding a novel cutinase homolog designated LC-cutinase with polyethylene terephthalate-degrading activity from a leaf-branch compost fosmid library. The enzyme showed higher specific polyethylene terephthalate-degrading activity than previously reported bacterial and fungal cutinases. Thus, LC-cutinase is a potent candidate for industrial applications, i.e., in textile industry. In general, two different metagenomic screening approaches for the identification of novel biomolecules can be distinguished: function-based screening and sequence-based screening.
Principle and Variations of Function-Driven Screens
To perform function-driven screening, the construction of small-insert or large-insert metagenomic libraries is required. A broad array of different function-based screening approaches can be applied using these libraries. The phenotypic insert detection (PID) is the most frequently applied screening strategy. Metagenomic library-containing clones expressing target genes are identified based on phenotypic characteristics. This method has been applied to identify novel lipolytic genes and gene families from German forest and grassland soil samples using tributyrin as a screening substrate (Nacke et al. 2011). A total of 37 lipolytic clones, encoding novel lipases and esterases, which could be assigned to five different known families and two putatively new families of lipolytic enzymes, were identified by halo formation on indicator agar plates. The potential to identify entirely novel target genes is an important advantage of function-driven screening approaches. Modulated detection (MD) represents another commonly applied strategy to perform function-based screening. Only if a certain gene product is expressed by a metagenomic library-containing host strain, it can grow under selective conditions. Recently, novel acid resistance genes were derived from planktonic and rhizosphere microbial communities of the Tinto River (Spain) using this strategy (Guazzaroni et al. 2013). Fifteen genes, mainly encoding putative proteins of unknown function, conferred acid resistance to the host strain Escherichia coli. Moreover, substrate-induced gene expression (SIGEX), product-induced gene expression (PIGEX), and metabolite-regulated expression (METREX) screening strategies allow the identification of target genes from metagenomic libraries (Simon and Daniel 2009). Recently, Wang et al. (2012) suggested biosensor-based genetic transducer (BGT) systems as an alternative and sensitive approach to screen for gene clusters whose expression produce small molecules that activate the employed biosensors. Nevertheless, all of these function-based screening approaches share one significant disadvantage: the dependence of target gene production on the expression machinery of the metagenomic library host.
Principle and Variants of Sequence-Based Screening
Conserved regions of genes or proteins enable sequence-driven screening approaches. Based on these regions degenerate primers can be designed and fragments of target genes amplified. For example, novel biphenyl dioxygenase DNA segments encoding active site residues were obtained from polychlorobiphenyl-contaminated soils using this strategy (Standfuß-Gabisch et al. 2012). After sequencing of an amplified partial target gene, it can be decoded completely using primer walking and extracted environmental DNA or a metagenomic library as a template. In this way, an entire xylose isomerase gene (xym1) has been derived from a soil metagenomic library (Parachin and Gorwa-Grauslund 2011). The gene product of xym1 consisted of 443 amino acids and was most similar (83 % identity) to a xylose isomerase from Sorangium cellulosum. Additionally, novel complex polyketide and nonribosomal peptide biosynthesis gene cluster that often exceed average insert sizes of large-insert metagenomic libraries can be discovered by using degenerate primers and subsequent chromosome walking (Piel 2011). The potential to identify genes of interest even if they are not expressed in a metagenomic library host represents a major advantage of sequence-based screening, but only novel variants of already-known gene or protein families can be detected by this method.
Future Challenges in Metagenomic Research and Related Meta-omic Approaches
One of the major requirements to combine and compare metagenomic studies conducted by research groups worldwide is the definition and acceptance of minimum standards in experimental design. The same applies to metatranscriptomics, metaproteomics, and metabolomics. In this way, comparison and combination of results obtained from the different meta-omic approaches are feasible. Metatranscriptomics, metaproteomics, and metabolomics comprise the study of the collective gene transcripts, expressed proteins, and metabolites, respectively, generated by the microorganisms within an ecosystem (Nacke et al. 2014; Hettich et al. 2012; Patti et al. 2012). The consequent application and combination of appropriate meta-omic approaches will lead to an enormous extension of knowledge on the gene structure, diversity, activity, and responses of microbial communities on an ecosystem level. Furthermore, the rapid growth of meta-omic technologies will continuously demand for progress in the field of bioinformatics. Thus, further development and linkage of meta-omic analysis tools will be important in the future. In addition, the application and improvement of culture-based methods will be still valuable in the future to extend the number of available reference genomes allowing mapping of metagenomic data. In this context, the young discipline of single cell genomics has potential to play a complementary role by continuously contributing novel reference genomes.
The introduction of metagenomics allowed culture-independent analysis of microbial populations in complex ecosystems. Subsequently, other culture-independent meta-omic disciplines including metatranscriptomics, metaproteomics, and metabolomics were established. Metagenomics provided insights into the enormous phylogenetic and functional diversity of microbial communities within various environments on Earth. The increasing number of next-generation sequencing technologies led to a more comprehensive and cost-effective assessment of the information encoded by metagenomic DNA. Metagenomic approaches comprising the construction and screening of metagenomic libraries resulted in identification of previously unknown biomolecules, including biomolecules with industrially relevant characteristics.
- Guazzaroni ME, Morgante V, Mirete S, et al. Novel acid resistance genes from the metagenome of the Tinto River, an extremely acidic environment. Environ Microbiol. 2013;15:1088–1102.Google Scholar
- Hettich RL, Sharma R, Chourey K, et al. Microbial metaproteomics: identifying the repertoire of proteins that microorganisms use to compete and cooperate in complex environmental communities. Curr Opin Microbiol. 2012;15:373–80.Google Scholar
- Nacke H, Fischer C, Thürmer A, et al. Land use type significantly affects microbial gene transcription in soil. Microb Ecol. 2014;67:919–30.Google Scholar
- Patti GJ, Yanes O, Siuzdak G. Innovation: Metabolomics: the apogee of the omics trilogy. Nat Rev Mol Cell Biol. 2012;13:263–69.Google Scholar
- Song ZQ, Wang FP, Zhi XY, et al. Bacterial and archaeal diversities in Yunnan and Tibetan hot springs, China. Environ Microbiol. 2013;15:1160–75.Google Scholar