Fast Approximation of Frequent k-mers and Applications to Metagenomics

  • Leonardo Pellegrina
  • Cinzia Pizzi
  • Fabio VandinEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11467)


Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all k-mers and their abundances, in other situations reporting only frequent k-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of k-mers’ abundance-based distances among datasets of reads, commonly used in metagenomic analyses.

In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent k-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.


k-mer analysis Sampling algorithm VC dimension Metagenomics 


  1. 1.
    Benoit, G., Peterlongo, P., et al.: Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94 (2016)CrossRefGoogle Scholar
  2. 2.
    Břinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584–3592 (2015)CrossRefGoogle Scholar
  3. 3.
    Brown, C.T., Howe, A., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802 (2012)
  4. 4.
    Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2013)CrossRefGoogle Scholar
  5. 5.
    Danovaro, R., Canals, M., et al.: A submarine volcanic eruption leads to a novel microbial habitat. Nat. Ecol. Evol. 1(6), 0144 (2017)CrossRefGoogle Scholar
  6. 6.
    Dickson, L.B., Jiolle, D., et al.: Carryover effects of larval exposure to different environmental bacteria drive adult trait variation in a mosquito vector. Sci. Adv. 3(8), e1700585 (2017)CrossRefGoogle Scholar
  7. 7.
    Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)CrossRefGoogle Scholar
  8. 8.
    Hrytsenko, Y., Daniels, N.M., Schwartz, R.S.: Efficient distance calculations between genomes using mathematical approximation. In: Proceedings of the ACM-BCB, p. 546 (2018)Google Scholar
  9. 9.
    Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)CrossRefGoogle Scholar
  10. 10.
    Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017)CrossRefGoogle Scholar
  11. 11.
    Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using \(\ell \)-tuples. Genome Res. 13(8), 1916–1922 (2003)Google Scholar
  12. 12.
    Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009). Scholar
  13. 13.
    Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  14. 14.
    Melsted, P., Halldórsson, B.V.: KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics 30(24), 3541–3547 (2014)CrossRefGoogle Scholar
  15. 15.
    Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinform. 12(1), 333 (2011)CrossRefGoogle Scholar
  16. 16.
    Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, Cambridge (2017)zbMATHGoogle Scholar
  17. 17.
    Mohamadi, H., Khan, H., Birol, I.: ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 33(9), 1324–1330 (2017)Google Scholar
  18. 18.
    Ondov, B.D., Treangen, T.J., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)CrossRefGoogle Scholar
  19. 19.
    Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(14), 568–575 (2017)Google Scholar
  20. 20.
    Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)CrossRefGoogle Scholar
  21. 21.
    Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)CrossRefGoogle Scholar
  23. 23.
    Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950–1957 (2014)CrossRefGoogle Scholar
  24. 24.
    Salmela, L., Walve, R., Rivals, E., Ukkonen, E.: Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33(6), 799–806 (2016)Google Scholar
  25. 25.
    Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. National Acad. Sci. 106(8), 2677–2682 (2009)CrossRefGoogle Scholar
  26. 26.
    Sivadasan, N., Srinivasan, R., Goyal, K.: Kmerlight: fast and accurate k-mer abundance estimation. arXiv preprint arXiv:1609.05626 (2016)
  27. 27.
    Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300 (2016)CrossRefGoogle Scholar
  28. 28.
    Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
  29. 29.
    Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16(2), 264 (1971)CrossRefGoogle Scholar
  30. 30.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  31. 31.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)CrossRefGoogle Scholar
  32. 32.
    Zhang, Q., Pell, J., et al.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS One 9(7), e101271 (2014)CrossRefGoogle Scholar
  33. 33.
    Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Leonardo Pellegrina
    • 1
  • Cinzia Pizzi
    • 1
  • Fabio Vandin
    • 1
    Email author
  1. 1.Department of Information EngineeringUniversity of PadovaPadovaItaly

Personalised recommendations