Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Petrucci, Enrico; Noé, Laurent; Pizzi, Cinzia; Comin, Matteo

doi:10.1007/978-3-030-20242-2_18

Enrico Petrucci¹⁷,
Laurent Noé¹⁸,
Cinzia Pizzi¹⁷ &
…
Matteo Comin¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11490))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

722 Accesses
1 Citations
1 Altmetric

Abstract

Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed-up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time.

In this paper we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values in order to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup of 6.2x, outperforming previous methods. Software and Datasets are available at ISSH

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theor. Comput. Sci. 638, 76–90 (2016)
Article MathSciNet Google Scholar
Břinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584 (2015)
Article Google Scholar
Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(4), 628–637 (2014)
Article Google Scholar
Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 4 (2015)
Article Google Scholar
Darling, A.E., Treangen, T.J., Zhang, L., Kuiken, C., Messeguer, X., Perna, N.T.: Procrastination leads to efficient filtration for local multiple alignment. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS, vol. 4175, pp. 126–137. Springer, Heidelberg (2006). https://doi.org/10.1007/11851561_12
Chapter Google Scholar
Girotto, S., Comin, M., Pizzi, C.: Fast spaced seed hashing. In: Proceedings of the 17th Workshop on Algorithms in Bioinformatics (WABI). Leibniz International Proceedings in Informatics, vol. 88, pp. 7:1–7:14 (2017)
Google Scholar
Girotto, S., Comin, M., Pizzi, C.: Higher recall in metagenomic sequence classification exploiting overlapping reads. BMC Genomics 18(10), 917 (2017)
Article Google Scholar
Girotto, S., Comin, M., Pizzi, C.: Metagenomic reads binning with spaced seeds. Theor. Comput. Sci. 698, 88–99 (2017)
Article MathSciNet Google Scholar
Girotto, S., Comin, M., Pizzi, C.: Efficient computation of spaced seed hashing with block indexing. BMC Bioinform. 19(15), 441 (2018)
Article Google Scholar
Girotto, S., Comin, M., Pizzi, C.: FSH: fast spaced seed hashing exploiting adjacent hashes. Algorithms Mol. Biol. 13(1), 8 (2018)
Article Google Scholar
Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)
Article Google Scholar
Hahn, L., Leimeister, C.A., Ounit, R., Lonardi, S., Morgenstern, B.: rasbhari: Optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLoS Comput. Biol. 12(10), 1–18 (2016)
Article Google Scholar
Harris, R.S.: improved pairwise alignment of genomic DNA. Ph.D. thesis, University Park, PA, USA (2007)
Google Scholar
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discret. Appl. Math. 138(3), 253–263 (2004)
Article MathSciNet Google Scholar
Kucherov, G., Noé, L., Roytberg, M.A.: A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinform. Comput. Biol. 4(2), 553–569 (2006)
Article Google Scholar
Leimeister, C.A., Boden, M., Horwege, S., Lindner, S., Morgenstern, B.: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14), 1991 (2014)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3), 440 (2002)
Article Google Scholar
Marchiori, D., Comin, M.: SKraken: fast and sensitive classification of short metagenomic reads based on filtering uninformative k-mers. In: Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017), pp. 59–67. INSTICC, SciTePress (2017)
Google Scholar
Noé, L., Martin, D.E.K.: A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances. J. Comput. Biol. 21(12), 947–963 (2014)
Article Google Scholar
Onodera, T., Shibuya, T.: The gapped spectrum kernel for support vector machines. In: Perner, P. (ed.) MLDM 2013. LNCS (LNAI), vol. 7988, pp. 1–15. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39712-7_1
Chapter Google Scholar
Ounit, R., Lonardi, S.: Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics 32(24), 3823 (2016)
Article Google Scholar
Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: accurate mapping of short color-space reads. PLOS Comput. Biol. 5(5), 1–11 (2009)
Article Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Enrico Petrucci, Cinzia Pizzi & Matteo Comin
CRIStAL UMR9189, Université de Lille, Lille, France
Laurent Noé

Authors

Enrico Petrucci
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Noé
View author publications
You can also search for this author in PubMed Google Scholar
Cinzia Pizzi
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Comin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Cinzia Pizzi or Matteo Comin .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Zhipeng Cai
Georgia State University, Atlanta, GA, USA
Pavel Skums
Central South University, Changsha, China
Min Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Petrucci, E., Noé, L., Pizzi, C., Comin, M. (2019). Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing. In: Cai, Z., Skums, P., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2019. Lecture Notes in Computer Science(), vol 11490. Springer, Cham. https://doi.org/10.1007/978-3-030-20242-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-20242-2_18
Published: 09 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20241-5
Online ISBN: 978-3-030-20242-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics