A Boyer—Moore Type Algorithm for Compressed Pattern Matching

Shibata, Yusuke; Matsumoto, Tetsuya; Takeda, Masayuki; Shinohara, Ayumi; Arikawa, Setsuo

doi:10.1007/3-540-45123-4_17

Yusuke Shibata⁶,
Tetsuya Matsumoto⁶,
Masayuki Takeda⁶,
Ayumi Shinohara⁶ &
…
Setsuo Arikawa⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1848))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

625 Accesses
27 Citations

Abstract

We apply the Boyer-Moore technique to compressed pattern matching for text string described in terms of collage system, which is a formal framework that captures various dictionary-based compression methods. For a subclass of collage systems that contain no truncation, our new algorithm runs in O(‖D‖ + n. m + m² + r) time using O(‖D‖ + m²) space, where ‖D‖ is the size of dictionary D, n is the compressed text length, m is the pattern length, and r is the number of pattern occurrences. For a general collage system, the time complexity is (height(D<(‖D‖ + n) + n. m + m² + r ), where height(D) is the maximum dependency of tokens in D. We showed that the algorithm specialized for the so-called byte pair encoding (BPE) is very fast in practice. In fact it runs about 1.2 ~ 3.0 times faster than the exact match routine of the software package agrep, known as the fastest pattern matching tool.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.
Article MATH MathSciNet Google Scholar
C. Allauzen, M. Crochemore, and M. Raffinot. Factor oracle, suffix oracle. Technical Report IGM-99-08, Institut Gaspard-Monge, 1999.
Google Scholar
A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52:299–307, 1996.
Article MathSciNet Google Scholar
R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Comm. ACM, 35(10):74–82, 1992.
Article Google Scholar
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):62–72, 1977.
Article Google Scholar
M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text compression using antidictionaries. In Proc. 26th Internationial Colloquim on Automata, Languages and Programming, pages 261–270. Springer-Verlag, 1999.
Google Scholar
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.
MATH Google Scholar
E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Direct pattern matching on compressed text. In Proc. 5th International Symp. on String Processing and Information Retrieval, pages 90–95. IEEE Computer Society, 1998.
Google Scholar
P. Gage. A new algorithm for data compression. The C Users Journal, 12(2), 1994.
Google Scholar
T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th International Symp. on String Processing and Information Retrieval, pages 89–96. IEEE Computer Society, 1999.
Google Scholar
T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 1–13. Springer-Verlag, 1999.
Google Scholar
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. Data Compression Conference (DCC’98), pages 103–112. IEEE Computer Society, 1998.
Google Scholar
D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput, 6(2):323–350, 1977.
Article MATH MathSciNet Google Scholar
U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proc. 5th Ann. Symp. on Combinatorial Pattern Matching, pages 113–124. Springer-Verlag, 1994.
Google Scholar
M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).
MathSciNet Google Scholar
G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 14–36. Springer-Verlag, 1999.
Google Scholar
G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching. Springer-Verlag, 2000. to appear.
Google Scholar
W. Rytter. Algorithms on compressed strings and arrays. In Proc. 26th Ann. Conf. on Current Trends in Theory and Practice of Infomatics. Springer-Verlag, 1999.
Google Scholar
Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Speeding up pattern matching by text compression. In Proc. 4th Italian Conference on Algorithms and Complexity, pages 306–315. Springer-Verlag, 2000.
Google Scholar
Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 37–49. Springer-Verlag, 1999.
Google Scholar
N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.
Article Google Scholar
T. A. Welch. A technique for high performance data compression. IEEE Comput., 17:8–19, June 1984.
Google Scholar
S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Usenix Winter 1992 Technical Conference, pages 153–162, 1992.
Google Scholar
S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.
Google Scholar
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on Inform. Theory, IT-23(3):337–349, May 1977.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University 33, Fukuoka, 812-8581, Japan
Yusuke Shibata, Tetsuya Matsumoto, Masayuki Takeda, Ayumi Shinohara & Setsuo Arikawa

Authors

Yusuke Shibata
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar
Ayumi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Matematica ed Applicazioni, Universitá die Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo
Centre de recherches mathématiques, Université de Montréal, CP 6128, succursale Centre-Ville, Montréal, Québec, Canada, H3C 3J7
David Sankoff

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S. (2000). A Boyer—Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_17

Download citation

DOI: https://doi.org/10.1007/3-540-45123-4_17
Published: 07 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics