Skip to main content

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11490))

Included in the following conference series:

Abstract

Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed-up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time.

In this paper we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values in order to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup of 6.2x, outperforming previous methods. Software and Datasets are available at ISSH

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theor. Comput. Sci. 638, 76–90 (2016)

    Article  MathSciNet  Google Scholar 

  2. Břinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584 (2015)

    Article  Google Scholar 

  3. Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(4), 628–637 (2014)

    Article  Google Scholar 

  4. Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 4 (2015)

    Article  Google Scholar 

  5. Darling, A.E., Treangen, T.J., Zhang, L., Kuiken, C., Messeguer, X., Perna, N.T.: Procrastination leads to efficient filtration for local multiple alignment. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS, vol. 4175, pp. 126–137. Springer, Heidelberg (2006). https://doi.org/10.1007/11851561_12

    Chapter  Google Scholar 

  6. Girotto, S., Comin, M., Pizzi, C.: Fast spaced seed hashing. In: Proceedings of the 17th Workshop on Algorithms in Bioinformatics (WABI). Leibniz International Proceedings in Informatics, vol. 88, pp. 7:1–7:14 (2017)

    Google Scholar 

  7. Girotto, S., Comin, M., Pizzi, C.: Higher recall in metagenomic sequence classification exploiting overlapping reads. BMC Genomics 18(10), 917 (2017)

    Article  Google Scholar 

  8. Girotto, S., Comin, M., Pizzi, C.: Metagenomic reads binning with spaced seeds. Theor. Comput. Sci. 698, 88–99 (2017)

    Article  MathSciNet  Google Scholar 

  9. Girotto, S., Comin, M., Pizzi, C.: Efficient computation of spaced seed hashing with block indexing. BMC Bioinform. 19(15), 441 (2018)

    Article  Google Scholar 

  10. Girotto, S., Comin, M., Pizzi, C.: FSH: fast spaced seed hashing exploiting adjacent hashes. Algorithms Mol. Biol. 13(1), 8 (2018)

    Article  Google Scholar 

  11. Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)

    Article  Google Scholar 

  12. Hahn, L., Leimeister, C.A., Ounit, R., Lonardi, S., Morgenstern, B.: rasbhari: Optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLoS Comput. Biol. 12(10), 1–18 (2016)

    Article  Google Scholar 

  13. Harris, R.S.: improved pairwise alignment of genomic DNA. Ph.D. thesis, University Park, PA, USA (2007)

    Google Scholar 

  14. Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discret. Appl. Math. 138(3), 253–263 (2004)

    Article  MathSciNet  Google Scholar 

  15. Kucherov, G., Noé, L., Roytberg, M.A.: A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinform. Comput. Biol. 4(2), 553–569 (2006)

    Article  Google Scholar 

  16. Leimeister, C.A., Boden, M., Horwege, S., Lindner, S., Morgenstern, B.: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14), 1991 (2014)

    Article  Google Scholar 

  17. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3), 440 (2002)

    Article  Google Scholar 

  18. Marchiori, D., Comin, M.: SKraken: fast and sensitive classification of short metagenomic reads based on filtering uninformative k-mers. In: Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017), pp. 59–67. INSTICC, SciTePress (2017)

    Google Scholar 

  19. Noé, L., Martin, D.E.K.: A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances. J. Comput. Biol. 21(12), 947–963 (2014)

    Article  Google Scholar 

  20. Onodera, T., Shibuya, T.: The gapped spectrum kernel for support vector machines. In: Perner, P. (ed.) MLDM 2013. LNCS (LNAI), vol. 7988, pp. 1–15. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39712-7_1

    Chapter  Google Scholar 

  21. Ounit, R., Lonardi, S.: Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics 32(24), 3823 (2016)

    Article  Google Scholar 

  22. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: accurate mapping of short color-space reads. PLOS Comput. Biol. 5(5), 1–11 (2009)

    Article  Google Scholar 

  23. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Cinzia Pizzi or Matteo Comin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Petrucci, E., Noé, L., Pizzi, C., Comin, M. (2019). Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing. In: Cai, Z., Skums, P., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2019. Lecture Notes in Computer Science(), vol 11490. Springer, Cham. https://doi.org/10.1007/978-3-030-20242-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20242-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20241-5

  • Online ISBN: 978-3-030-20242-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics