Skip to main content

Fast Indexes for Gapped Pattern Matching

  • Conference paper
  • First Online:
SOFSEM 2020: Theory and Practice of Computer Science (SOFSEM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12011))

Abstract

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.

This research is supported by Academy of Finland through grant 319454.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It is possible that further improvements from sorting alone are possible, using a more heavily engineered sort function that our hand-rolled LSD radix sort. Our point here is that sorting is an important dimension along which SA-scan can be optimized.

References

  1. Bader, J., Gog, S., Petri, M.: Practical variable length gap pattern matching. In: Goldberg, A.V., Kulikov, A.S. (eds.) SEA 2016. LNCS, vol. 9685, pp. 1–16. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-38851-9_1

    Chapter  Google Scholar 

  2. Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theor. Comput. Sci. 409(3), 486–496 (2008)

    Article  MathSciNet  Google Scholar 

  3. Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)

    Article  MathSciNet  Google Scholar 

  4. Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443, 25–34 (2012)

    Article  MathSciNet  Google Scholar 

  5. Bille, P., Thorup, M.: Regular expression matching with multi-strings and intervals. In: Proceedings of SODA, pp. 1297–1308. ACM-SIAM (2010)

    Google Scholar 

  6. Cox, R.: Regular expression matching with a trigram index or how Google code search worked (2012). https://swtch.com/~rsc/regexp/regexp4.html

  7. Crawford, T., Iliopoulos, C.S., Raman, R.: String matching techniques for musical similarity and melodic recognition. Comput. Musicol. 11, 73–100 (1998)

    Google Scholar 

  8. Crochemore, M., Iliopoulos, C.S., Makris, C., Rytter, W., Tsakalidis, A.K., Tsichlas, T.: Approximate string matching with gaps. N. J. Comput. 9(1), 54–65 (2002)

    MathSciNet  MATH  Google Scholar 

  9. Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008)

    Article  Google Scholar 

  10. Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of SODA, pp. 1459–1477. ACM-SIAM (2018)

    Google Scholar 

  11. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the SODA, pp. 841–850. ACM-SIAM (2003)

    Google Scholar 

  12. Haapasalo, T., Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Online dictionary matching with variable-length gaps. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 76–87. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20662-7_7

    Chapter  Google Scholar 

  13. Knuth, D., Morris, J.H., Pratt, V.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  Google Scholar 

  14. Lewenstein, M.: Indexing with gaps. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 135–143. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24583-1_14

    Chapter  Google Scholar 

  15. Lopez, A.: Hierarchical phrase-based translation with suffix arrays. In: Proceedings of the EMNLP-CoNLL 2007, pp. 976–985. ACL (2007)

    Google Scholar 

  16. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  17. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the SIGIR, pp. 472–479. ACM (2005)

    Google Scholar 

  18. Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005)

    Article  Google Scholar 

  19. Navarro, G.: Wavelet trees for all. J. Discrete Algorithms 25, 2–20 (2014)

    Article  MathSciNet  Google Scholar 

  20. Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15(235), 1–12 (2014)

    Google Scholar 

  21. Saikkonen, R., Sippu, S., Soisalon-Soininen, E.: Experimental analysis of an online dictionary matching algorithm for regular expressions with gaps. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 327–338. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20086-6_25

    Chapter  Google Scholar 

  22. Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: Proceedings of the SIGIR 2007, pp. 127–134. ACM (2007)

    Google Scholar 

Download references

Acknowledgments

Our thanks go to Tania Starikovskaya for suggesting the problem of indexing for regular-expression matching to us. We also thank Matthias Petri and Simon Gog for prompt answers to questions about their article and code and the anonymous reviewers for helpful comments. This work was funded by the Academy of Finland via grant 319454 and by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement No. 690941 (BIRDS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bella Zhukova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cáceres, M., Puglisi, S.J., Zhukova, B. (2020). Fast Indexes for Gapped Pattern Matching. In: Chatzigeorgiou, A., et al. SOFSEM 2020: Theory and Practice of Computer Science. SOFSEM 2020. Lecture Notes in Computer Science(), vol 12011. Springer, Cham. https://doi.org/10.1007/978-3-030-38919-2_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-38919-2_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-38918-5

  • Online ISBN: 978-3-030-38919-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics