Theoretical and Practical Analyses in Metagenomic Sequence Classification

  • Hend Amraoui
  • Mourad Elloumi
  • Francesco Marcelloni
  • Faouzi Mhamdi
  • Davide VerzottoEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1062)


Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g. from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-the-art methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues—mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,—and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignment-free methods that overcome such problems by combining the learning and classification processes in a single framework.


Metagenomic sequence classification Alignment-free algorithms Genome analysis Combinatorics Pattern discovery Strings 


  1. 1.
    Ames, S.K., Hysom, D.A., Gardner, S.N., Lloyd, G.S., Gokhale, M.B., Allen, J.E.: Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29(18), 2253–2260 (2013)CrossRefGoogle Scholar
  2. 2.
    Breitwieser, F., Baker, D., Salzberg, S.L.: KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19(1), 198 (2018)CrossRefGoogle Scholar
  3. 3.
    Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using diamond. Nat. Methods 12(1), 59 (2015)CrossRefGoogle Scholar
  4. 4.
    Comin, M., Verzotto, D.: The irredundant class method for remote homology detection of protein sequences. J. Comput. Biol. 18(12), 1819–1829 (2011)CrossRefGoogle Scholar
  5. 5.
    Comin, M., Verzotto, D.: Comparing, ranking and filtering motifs with character classes: application to biological sequences analysis. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, chap. 13, pp. 307–332. Wiley (2013)Google Scholar
  6. 6.
    Comin, M., Verzotto, D.: Filtering degenerate patterns with application to protein sequence analysis. Algorithms 6(2), 352–370 (2013)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinf. 11(4), 628–637 (2014)CrossRefGoogle Scholar
  8. 8.
    Comin, M., Verzotto, D.: Alignment-free measures for whole-genome comparison. In: Pattern Recognition in Computational Molecular Biology, chap. 3, pp. 43–64. Wiley (2015)Google Scholar
  9. 9.
    Freitas, T.A.K., Li, P.E., Scholz, M.B., Chain, P.S.: Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 43(10), e69 (2015)CrossRefGoogle Scholar
  10. 10.
    Garofalo, F., Rosone, G., Sciortino, M., Verzotto, D.: The colored longest common prefix array computed via sequential scans. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 153–167. Springer, Cham (2018). Scholar
  11. 11.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)CrossRefGoogle Scholar
  12. 12.
    Lam, T.H., Verzotto, D., Liu, J., Nagarajan, N., et al.: Understanding the microbial basis of body odor in pre-pubescent children and teenagers. Microbiome 6, 213 (2018)CrossRefGoogle Scholar
  13. 13.
    Marchiori, D., Comin, M.: SKraken: fast and sensitive classification of short metagenomic reads based on filtering uninformative k-mers. In: BIOINFORMATICS, pp. 59–67 (2017)Google Scholar
  14. 14.
    McIntyre, A.B., et al.: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18(1), 182 (2017)CrossRefGoogle Scholar
  15. 15.
    Ounit, R., Lonardi, S.: Higher classification accuracy of short metagenomic reads by discriminative spaced k-mers. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 286–295. Springer, Heidelberg (2015). Scholar
  16. 16.
    Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015) CrossRefGoogle Scholar
  17. 17.
    Quince, C., Walker, A.W., Simpson, J.T., Loman, N.J., Segata, N.: Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833 (2017)CrossRefGoogle Scholar
  18. 18.
    Saha, S., Johnson, J., Pal, S., Weinstock, G.M., Rajasekaran, S.: MSC: a metagenomic sequence classification algorithm. Bioinformatics, bty1071 (2019)Google Scholar
  19. 19.
    Teo, A.S., Verzotto, D., Yao, F., Nagarajan, N., Hillmer, A.M.: Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line. GigaScience 4, 65 (2015)CrossRefGoogle Scholar
  20. 20.
    Truong, D.T., et al.: Metaphlan2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12(10), 902 (2015)CrossRefGoogle Scholar
  21. 21.
    Vervier, K., Mahé, P., Vert, J.-P.: MetaVW: large-scale machine learning for metagenomics sequence classification. In: Mamitsuka, H. (ed.) Data Mining for Systems Biology. MMB, vol. 1807, pp. 9–20. Springer, New York (2018). Scholar
  22. 22.
    Verzotto, D., Teo, A.S., Hillmer, A.M., Nagarajan, N.: OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. GigaScience 5, 2 (2016)CrossRefGoogle Scholar
  23. 23.
    Verzotto, D., Teo, A.S., Hillmer, A.M., Nagarajan, N.: Index-based map-to-sequence alignment in large eukaryotic genomes. In: Proceedings 5th RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq), pp. 1–11. Cold Spring Harbor Labs Journals (2015). bioRxiv 017194
  24. 24.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  25. 25.
    Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Hend Amraoui
    • 1
    • 2
    • 3
  • Mourad Elloumi
    • 3
  • Francesco Marcelloni
    • 1
  • Faouzi Mhamdi
    • 3
  • Davide Verzotto
    • 1
    • 4
    • 5
    Email author
  1. 1.Department of Information EngineeringUniversity of PisaPisaItaly
  2. 2.University of Tunis El ManarTunisTunisia
  3. 3.Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE), National Higher School of Engineers of Tunis (ENSIT)University of TunisTunisTunisia
  4. 4.Institute for Informatics and Telematics, CNRPisaItaly
  5. 5.Euro-Mediterranean Biomedical Scientific Institute (ISBEM)Pisa and MesagneItaly

Personalised recommendations