CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads

  • Sourav Chatterji
  • Ichitaro Yamazaki
  • Zhaojun Bai
  • Jonathan A. Eisen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4955)


A major hindrance to studies of microbial diversity has been that the vast majority of microbes cannot be cultured in the laboratory and thus are not amenable to traditional methods of characterization. Environmental shotgun sequencing (ESS) overcomes this hurdle by sequencing the DNA from the organisms present in a microbial community. The interpretation of this metagenomic data can be greatly facilitated by associating every sequence read with its source organism. We report the development of CompostBin, a DNA composition-based algorithm for analyzing metagenomic sequence reads and distributing them into taxon-specific bins. Unlike previous methods that seek to bin assembled contigs and often require training on known reference genomes, CompostBin has the ability to accurately bin raw sequence reads without need for assembly or training. CompostBin uses a novel weighted PCA algorithm to project the high dimensional DNA composition data into an informative lower-dimensional space, and then uses the normalized cut clustering algorithm on this filtered data set to classify sequences into taxon-specific bins. We demonstrate the algorithm’s accuracy on a variety of low to medium complexity data sets.


Metagenomics Binning Feature Extraction Normalized Cut weighted PCA DNA composition metrics Genome Signatures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rappe, M.S., Giovannoni, S.J.: The uncultured microbial majority. Annu Rev Microbiol 57, 369–394 (2003)CrossRefGoogle Scholar
  2. 2.
    Lane, D.J., Pace, B., Olsen, G.J., Stahl, D.A., Sogin, M.L., Pace, N.R.: Rapid determination of 16s ribosomal rna sequences for phylogenetic analyses. Proc. Natl Acad. Sci. USA 82(20), 6955–6959 (1985)CrossRefGoogle Scholar
  3. 3.
    Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., Fouts, D.E., Levy, S., Knap, A.H., Lomas, M.W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y.H., Smith, H.O.: Environmental genome shotgun sequencing of the sargasso sea. Science 304(5667), 66–74 (2004)CrossRefGoogle Scholar
  4. 4.
    Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M., Solovyev, V.V., Rubin, E.M., Rokhsar, D.S., Banfield, J.F.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)CrossRefGoogle Scholar
  5. 5.
    Gill, S.R., Pop, M., Deboy, R.T., Eckburg, P.B., Turnbaugh, P.J., Samuel, B.S., Gordon, J.I., Relman, D.A., Fraser-Liggett, C.M., Nelson, K.E.: Metagenomic analysis of the human distal gut microbiome. Science 312(5778), 1355–1359 (2006)CrossRefGoogle Scholar
  6. 6.
    Wu, D., Daugherty, S.C., Van Aken, S.E., Pai, G.H., Watkins, K.L., Khouri, H., Tallon, L.J., Zaborsky, J.M., Dunbar, H.E., Tran, P.L., Moran, N.A., Eisen, J.A.: Metabolic complementarity and genomics of the dual bacterial symbiosis of sharpshooters. PLoS Biol. 4(6), 188 (2006)CrossRefGoogle Scholar
  7. 7.
    Rusch, D.B., Halpern, A.L., Sutton, G., Heidelberg, K.B., Williamson, S., Yooseph, S., Wu, D., Eisen, J.A., Hoffman, J.M., Remington, K., Beeson, K., Tran, B., Smith, H., Baden-Tillson, H., Stewart, C., Thorpe, J., Freeman, J., Andrews-Pfannkoch, C., Venter, J.E., Li, K., Kravitz, S., Heidelberg, J.F., Utterback, T., Rogers, Y.H., Falcon, L.I., Souza, V., Bonilla-Rosso, G., Eguiarte, L.E., Karl, D.M., Sathyendranath, S., Platt, T., Bermingham, E., Gallardo, V., Tamayo-Castillo, G., Ferrari, M.R., Strausberg, R.L., Nealson, K., Friedman, R., Frazier, M., Venter, J.C.: The sorcerer ii global ocean sampling expedition: Northwest atlantic through eastern tropical pacific. PLoS Biol. 5(3), e77 (2007)CrossRefGoogle Scholar
  8. 8.
    Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A.A., Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J.C., Bork, P., Hugenholtz, P., Rubin, E.M.: Comparative metagenomics of microbial communities. Science 308(5721), 554–557 (2005)CrossRefGoogle Scholar
  9. 9.
    von Mering, C., Hugenholtz, P., Raes, J., Tringe, S., Doerks, T., Jensen, L., Ward, N., Bork, P.: Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315(5815), 1126–1130 (2007)CrossRefGoogle Scholar
  10. 10.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Research (in press, 2007)Google Scholar
  11. 11.
    Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., Glockner, F.O.: Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinformatics 5(1471–2105 (Electronic)) (2004)Google Scholar
  12. 12.
    Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S., Ikemura, T.: Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res 12(5), 281–290 (2005)CrossRefGoogle Scholar
  13. 13.
    McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length dna fragments. Nat Methods 4(1), 63–72 (2007)CrossRefGoogle Scholar
  14. 14.
    Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11(7), 283–290 (1995)CrossRefGoogle Scholar
  15. 15.
    Woyke, T., Teeling, H., Ivanova, N.N., Huntemann, M., Richter, M., Gloeckner, F.O., Boffelli, D., Anderson, I.J., Barry, K.W., Shapiro, H.J., Szeto, E., Kyrpides, N.C., Mussmann, M., Amann, R., Bergin, C., Ruehland, C., Rubin, E.M., Dubilier, N.: Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443(7114), 950–955 (2006)CrossRefGoogle Scholar
  16. 16.
    Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont dna with glimmer. Bioinformatics 23(6), 673–679 (2007)CrossRefGoogle Scholar
  17. 17.
    Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)zbMATHGoogle Scholar
  18. 18.
    Kent, W.J.: Blat-the blast-like alignment tool. Genome Res 12(4), 656–664 (2002)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)CrossRefGoogle Scholar
  20. 20.
    Tenebaum, J.B., Silva, V.D., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 190(5500), 2319–2323 (2000)CrossRefGoogle Scholar
  21. 21.
    Wu, M., Eisen, J.: A simple, fast and accurate method for phylogenenomics inference approach (submitted, 2007)Google Scholar
  22. 22.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)CrossRefGoogle Scholar
  23. 23.
    Schmid, R., Schuster, S.C., Steel, M.A., Huson, D.H.: Readsim- a simulator for sanger and 454 sequencing (in press, 2007)Google Scholar
  24. 24.
    Markowitz, V.M., Korzeniewski, F., Palaniappan, K., Szeto, E., Werner, G., Padki, A., Zhao, X., Dubchak, I., Hugenholtz, P., Anderson, I., Lykidis, A., Mavromatis, K., Ivanova, N., Kyrpides, N.C.: The integrated microbial genomes (img) system. Nucleic Acids Res. 34(Database issue), D344–348 (2006)CrossRefGoogle Scholar
  25. 25.
    Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A.C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M., Lapidus, A., Grigoriev, I., Richardson, P., Hugenholtz, P., Kyrpides, N.C.: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4(6), 495–500 (2007)CrossRefGoogle Scholar
  26. 26.
    Gelfand, M.S., Koonin, E.V.: Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Res. 25(12), 2430–2439 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Sourav Chatterji
    • 1
  • Ichitaro Yamazaki
    • 2
  • Zhaojun Bai
    • 2
  • Jonathan A. Eisen
    • 1
    • 3
  1. 1.Genome Center, U C DavisDavisUSA
  2. 2.Computer Science Department, U C DavisDavisUSA
  3. 3.The Joint Genome InstituteWalnut CreekUSA

Personalised recommendations