GMeta: A Novel Algorithm to Utilize Highly Connected Components for Metagenomic Binning

  • Hong Thanh Pham
  • Le Van Vinh
  • Tran Van LangEmail author
  • Van Hoai Tran
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11814)


Metagenomic binning refers to the means of clustering or assigning taxonomy to metagenomic sequences or contigs. Due to the massive abundance of organisms in metagenomic samples, the number of nucleotide sequences skyrockets, and thus leading to the complexity of binning algorithms. Unsupervised classification is gaining a reputation in recent years since the lacking of the reference database required in the reference-based methods with various state-of-the-art tools released. By manipulating the overlapping information between reads drives to the success of various unsupervised methods with extraordinary accuracy. These research practices on the evidence that the average proportion of common l-mers between genomes of different species is practically miniature when l is sufficient. This paper introduces a novel algorithm for binning metagenomic sequences without requiring reference databases by utilizing highly connected components inside a weighted overlapping graph of reads. Experimental outcomes show that the precision is improved over other well-known binning tools for both short and long sequences.


Metagenomic binning Highly connected components Weighted overlapping graph 



This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2019-20-06.


  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T.: Genomic DNA k-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)CrossRefGoogle Scholar
  3. 3.
    National Research Council: The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. National Academies Press (2007)Google Scholar
  4. 4.
    Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)CrossRefGoogle Scholar
  5. 5.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)CrossRefGoogle Scholar
  6. 6.
    Huson, D.H., et al.: Megan community edition - interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput. Biol. 12(6), 1–12 (2016)CrossRefGoogle Scholar
  7. 7.
    Kelley, D.R., Salzberg, S.L.: Clustering metagenomic sequences with interpolated markov models. BMC Bioinform. 11(1), 544 (2010)CrossRefGoogle Scholar
  8. 8.
    Kent, W.J.: Blat-the blast-like alignment tool. Genome Res. 12(4), 656–664 (2002)CrossRefGoogle Scholar
  9. 9.
    Kislyuk, A., Bhatnagar, S., Dushoff, J., Weitz, J.S.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10(1), 316 (2009)CrossRefGoogle Scholar
  10. 10.
    Qiao, Y., Jia, B., Hu, Z., Sun, C., Xiang, Y., Wei, C.: Metabing2: a fast and accurate metagenomic sequence classification system for samples with many unknown organisms. Biol. Direct 13(1), 15 (2018)CrossRefGoogle Scholar
  11. 11.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasim-a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)CrossRefGoogle Scholar
  12. 12.
    Roumpeka, D.D., Wallace, R.J., Escalettes, F., Fotheringham, I., Watson, M.: A review of bioinformatics tools for bio-prospecting from metagenomic sequence data. Front. Genet. 8, 23 (2017)CrossRefGoogle Scholar
  13. 13.
    Shendure, J., Ji, H.: Next-generation DNA sequencing. Nat. Biotechnol. 26(10), 1135 (2008)CrossRefGoogle Scholar
  14. 14.
    Tausch, S.H., et al.: Livekraken—real-time metagenomic classification of illumina data. Bioinformatics 34(21), 3750–3752 (2018)CrossRefGoogle Scholar
  15. 15.
    Van Le, V., Van Tran, L., Van Tran, H.: A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads. BMC Bioinform. 17(1), 22 (2016)CrossRefGoogle Scholar
  16. 16.
    Vinh, L.V., Lang, T.V., Binh, L.T., Hoai, T.V.: A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads. Algorithms Mol. Biol. 10(1), 2 (2015)CrossRefGoogle Scholar
  17. 17.
    Wang, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18), i356–i362 (2012)CrossRefGoogle Scholar
  18. 18.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  19. 19.
    Wu, Y.W., Simmons, B.A., Singer, S.W.: Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32(4), 605–607 (2015)CrossRefGoogle Scholar
  20. 20.
    Wu, Y.W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18(3), 523–534 (2011)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Zhou, F., Olman, V., Xu, Y.: Barcodes for genomes and applications. BMC Bioinform. 9(1), 546 (2008)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of Computer Science and EngineeringHo Chi Minh City University of Technology, Vietnam National University Ho Chi Minh CityHo Chi Minh CityVietnam
  2. 2.Information Technology OfficeHoa Sen UniversityHo Chi Minh CityVietnam
  3. 3.Faculty of Information TechnologyHo Chi Minh City University of Technology and EducationHo Chi Minh CityVietnam
  4. 4.Institute of Applied Mechanics and InformaticsVietnam Academy of Science and Technology (VAST)HanoiVietnam

Personalised recommendations