GMeta: A Novel Algorithm to Utilize Highly Connected Components for Metagenomic Binning
Metagenomic binning refers to the means of clustering or assigning taxonomy to metagenomic sequences or contigs. Due to the massive abundance of organisms in metagenomic samples, the number of nucleotide sequences skyrockets, and thus leading to the complexity of binning algorithms. Unsupervised classification is gaining a reputation in recent years since the lacking of the reference database required in the reference-based methods with various state-of-the-art tools released. By manipulating the overlapping information between reads drives to the success of various unsupervised methods with extraordinary accuracy. These research practices on the evidence that the average proportion of common l-mers between genomes of different species is practically miniature when l is sufficient. This paper introduces a novel algorithm for binning metagenomic sequences without requiring reference databases by utilizing highly connected components inside a weighted overlapping graph of reads. Experimental outcomes show that the precision is improved over other well-known binning tools for both short and long sequences.
KeywordsMetagenomic binning Highly connected components Weighted overlapping graph
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2019-20-06.
- 3.National Research Council: The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. National Academies Press (2007)Google Scholar