Skip to main content

A Boyer—Moore Type Algorithm for Compressed Pattern Matching

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1848))

Included in the following conference series:

Abstract

We apply the Boyer-Moore technique to compressed pattern matching for text string described in terms of collage system, which is a formal framework that captures various dictionary-based compression methods. For a subclass of collage systems that contain no truncation, our new algorithm runs in O(‖D‖ + n. m + m2 + r) time using O(‖D‖ + m2) space, where ‖D‖ is the size of dictionary D, n is the compressed text length, m is the pattern length, and r is the number of pattern occurrences. For a general collage system, the time complexity is (height(D<(‖D‖ + n) + n. m + m2 + r ), where height(D) is the maximum dependency of tokens in D. We showed that the algorithm specialized for the so-called byte pair encoding (BPE) is very fast in practice. In fact it runs about 1.2 ~ 3.0 times faster than the exact match routine of the software package agrep, known as the fastest pattern matching tool.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.

    Article  MATH  MathSciNet  Google Scholar 

  2. C. Allauzen, M. Crochemore, and M. Raffinot. Factor oracle, suffix oracle. Technical Report IGM-99-08, Institut Gaspard-Monge, 1999.

    Google Scholar 

  3. A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52:299–307, 1996.

    Article  MathSciNet  Google Scholar 

  4. R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Comm. ACM, 35(10):74–82, 1992.

    Article  Google Scholar 

  5. R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):62–72, 1977.

    Article  Google Scholar 

  6. M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text compression using antidictionaries. In Proc. 26th Internationial Colloquim on Automata, Languages and Programming, pages 261–270. Springer-Verlag, 1999.

    Google Scholar 

  7. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.

    MATH  Google Scholar 

  8. E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Direct pattern matching on compressed text. In Proc. 5th International Symp. on String Processing and Information Retrieval, pages 90–95. IEEE Computer Society, 1998.

    Google Scholar 

  9. P. Gage. A new algorithm for data compression. The C Users Journal, 12(2), 1994.

    Google Scholar 

  10. T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th International Symp. on String Processing and Information Retrieval, pages 89–96. IEEE Computer Society, 1999.

    Google Scholar 

  11. T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 1–13. Springer-Verlag, 1999.

    Google Scholar 

  12. T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. Data Compression Conference (DCC’98), pages 103–112. IEEE Computer Society, 1998.

    Google Scholar 

  13. D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput, 6(2):323–350, 1977.

    Article  MATH  MathSciNet  Google Scholar 

  14. U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proc. 5th Ann. Symp. on Combinatorial Pattern Matching, pages 113–124. Springer-Verlag, 1994.

    Google Scholar 

  15. M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).

    MathSciNet  Google Scholar 

  16. G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 14–36. Springer-Verlag, 1999.

    Google Scholar 

  17. G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching. Springer-Verlag, 2000. to appear.

    Google Scholar 

  18. W. Rytter. Algorithms on compressed strings and arrays. In Proc. 26th Ann. Conf. on Current Trends in Theory and Practice of Infomatics. Springer-Verlag, 1999.

    Google Scholar 

  19. Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Speeding up pattern matching by text compression. In Proc. 4th Italian Conference on Algorithms and Complexity, pages 306–315. Springer-Verlag, 2000.

    Google Scholar 

  20. Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 37–49. Springer-Verlag, 1999.

    Google Scholar 

  21. N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.

    Article  Google Scholar 

  22. T. A. Welch. A technique for high performance data compression. IEEE Comput., 17:8–19, June 1984.

    Google Scholar 

  23. S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Usenix Winter 1992 Technical Conference, pages 153–162, 1992.

    Google Scholar 

  24. S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.

    Google Scholar 

  25. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on Inform. Theory, IT-23(3):337–349, May 1977.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S. (2000). A Boyer—Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_17

Download citation

  • DOI: https://doi.org/10.1007/3-540-45123-4_17

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67633-1

  • Online ISBN: 978-3-540-45123-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics