Encyclopedia of Metagenomics

Living Edition
| Editors: Karen E. Nelson

AbundanceBin, Metagenomic Sequencing

  • Yuzhen YeEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-6418-1_29-4

Keywords

Synonymous Codon Usage Metagenomic Sequence Lower Common Ancestor Metagenomic Dataset Lower Common Ancestor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Definition

Binning is unsupervised clustering of metagenomic sequences into an unknown set of species.

AbundanceBin is a binning tool utilizing the different abundances of the species in a community.

Introduction

Binning is one of the challenging problems in the metagenomics field. It has two main applications. One application is for studying the structure of microbial communities. The other application is for improving the downstream analysis of metagenomic sequences, including metagenome assembly (which has shown to be extremely difficult), considering that assembling reads one bin at a time significantly reduces the complexity of the metagenome assembly problem.

Composition-based methods have been the main approaches to unsupervised classification of reads. The basis of these approaches is that the genome composition (G + C content, dinucleotide frequencies, and synonymous codon usage) vary among organisms and are generally characteristic of evolutionary lineages. Tools in this category include TETRA (Teeling et al. 2004), TACOA (Diaz et al. 2009), and MetaCluster (Leung et al. 2011). Due to the substantial variance in sequence properties along a genome, the main limitation of composition-based approaches is that they require relatively long reads (at least 800 bp), although it is shown that MetaCluster (Leung et al. 2011) can bin reads of 300 bp by employing a different distance metric (Spearman Footrule Distance) to reduce the local variations for 4-mers.

Note a large collection of methods have been developed to classify sequencing reads in a supervised manner. MEGAN (Huson and Mitra 2012) is a representative approach of this kind. These methods either use composition information (as in NCB, a naïve Bayes classifier to metagenomic sequence classification (Rosen et al. 2011)) or employ similarity searches of metagenomic sequences against a database of known genes/proteins (as in MEGAN) and assign metagenomic sequences to taxa accordingly, with or without using phylogeny. They also differ in the algorithms used for classification: MEGAN pioneers the lowest common ancestor (LCA) algorithm (Huson et al. 2007), MTR (Gori et al. 2011) improves on LCA algorithm considering multiple taxonomic ranks, and MetaPhyler (Liu et al. 2011) achieves better classification results by tuning the taxonomic classifier to each matching length, reference gene, and taxonomic level. Note that some tools in this category can only classify a subset of the metagenomic sequences instead of all. MLTreeMap (Stark et al. 2010) uses phylogenetic analysis of 31 marker genes for taxonomic distribution estimation. CARMA (Krause et al. 2008) searches for conserved Pfam domains and protein families in raw metagenomic sequences and classifies them into a higher-order taxonomy. RDP classifier is designed for classification of 16S rRNA genes, and later extended to classification of 18S rRNA genes using a naïve Bayes classifier (Cole et al. 2009).

AbundanceBin

AbundanceBin (Wu and Ye 2011) is the first unsupervised clustering algorithm that utilizes abundance information of the species in the same microbial community to group reads into bins. The fundamental assumption of the AbundanceBin algorithm is that reads are sampled from genomes following a Poisson procedure, such that the sequencing reads can be modeled as a mixture of Poisson distribution.

An expectation–maximization (EM) algorithm is used in AbundanceBin to find parameters for the Poisson distributions (i.e., the means), which reflect the relative abundance levels of the source species. AbundanceBin then assigns reads to bins based on the fitted Poisson distributions. AbundanceBin gives an estimation of the genome size (or the concatenated genome size of species of the same or very similar abundances) and the coverage (which reflects the abundances of species) of each bin in an unsupervised manner without requiring prior knowledge of the structure of the microbial communities. The EM algorithm needs an important parameter, the number of bins, which is typically unknown, as for most metagenomic projects. AbundanceBin solves this problem by using a recursive binning approach to determine the total number of bins automatically. The recursive binning approach works by separating a dataset into two bins and proceeds by further splitting bins. The recursive procedure continues if (1) the predicted abundance values of two bins differ significantly; (2) the predicted genome sizes are larger than a certain threshold; and (3) the number of reads associated with each bin is larger than a certain threshold proportion of the total number of reads classified in the parent bin.

AbundanceBin achieves accurate classification of even very short sequences sampled from species with different abundance levels, as tested on simulated and real metagenomic datasets. The software is available for download at http://omics.informatics.indiana.edu/AbundanceBin.

Integrated Binning Methods

MetaCluster 3.0 is an integrated binning method based on the unsupervised top–down separation and bottom–up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios to very different abundance ratios (Leung et al. 2011). MetaCluster 4.0 further improves the binning algorithm and is able to handle datasets with large number of species (e.g., 100 species) (Wang et al. 2012). MetaCluster is available for download at http://i.cs.hku.hk/~alse/MetaCluster/.

Joint Analysis of Multiple Metagenomic Samples

Baran and Halperin proposed an abundance-based (also termed as coverage-based) binning algorithm (MultBin) that operates on multiple samples of the same environment simultaneously, assuming that the different samples contain the same microbial species, possibly in different proportions (Baran and Halperin 2012). MultBin employs a k-medoids clustering algorithm to cluster reads according to their coverage across the samples. Testing of MultBin on simulated metagenomic datasets shows that integrating information across multiple samples yields more precise binning on each of the samples.

Summary

Abundance-based (or coverage-based) binning approaches achieve an accurate performance even for extremely short reads – when there exist species abundance differences, an ability that cannot be achieved by composition-based approaches which suffer from the variances of the compositions of short reads. Approaches that integrate abundance and composition information and approaches that utilize multiple samples have shown promising binning results.

References

  1. Baran Y, Halperin E. Joint analysis of multiple metagenomic samples. PLoS Comput Biol. 2012;8(2):e1002373.PubMedCentralPubMedCrossRefGoogle Scholar
  2. Cole JR, Wang Q, Cardenas E, et al. The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37(Database issue):D141–5.PubMedCentralPubMedCrossRefGoogle Scholar
  3. Diaz NN, Krause L, Goesmann A, et al. TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics. 2009;10:56.PubMedCentralPubMedCrossRefGoogle Scholar
  4. Gori F, Folino G, Jetten MS, et al. MTR: taxonomic annotation of short metagenomic reads using clustering at multiple taxonomic ranks. Bioinformatics. 2011;27(2):196–203.PubMedCentralPubMedCrossRefGoogle Scholar
  5. Huson DH, Mitra S. Introduction to the analysis of environmental sequences: metagenomics with MEGAN. Methods Mol Biol. 2012;856:415–29.PubMedCrossRefGoogle Scholar
  6. Huson DH, Auch AF, Qi J, et al. MEGAN analysis of metagenomic data. Genome Res. 2007;17(3):377–86.PubMedCentralPubMedCrossRefGoogle Scholar
  7. Krause L, Diaz NN, Goesmann A, et al. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008;36(7):2230–9.PubMedCentralPubMedCrossRefGoogle Scholar
  8. Leung HC, Yiu SM, Yang B, et al. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics. 2011;27(11):1489–95.PubMedCrossRefGoogle Scholar
  9. Liu B, Gibbons T, Ghodsi M, et al. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics. 2011;12 Suppl 2:S4.PubMedCentralPubMedCrossRefGoogle Scholar
  10. Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics. 2011;27(1):127–9.PubMedCentralPubMedCrossRefGoogle Scholar
  11. Stark M, Berger SA, Stamatakis A, et al. MLTreeMap–accurate maximum likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics. 2010;11:461.PubMedCentralPubMedCrossRefGoogle Scholar
  12. Teeling H, Waldmann J, Lombardot T, et al. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004;5:163.PubMedCentralPubMedCrossRefGoogle Scholar
  13. Wang Y, Leung HC, Yiu SM, et al. MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J Comput Biol. 2012;19(2):241–9.PubMedCrossRefGoogle Scholar
  14. Wu YW, Ye Y. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol. 2011;18(3):523–34.PubMedCentralPubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Indiana University, School of Informatics and ComputingBloomingtonUSA