Abstract
We apply the Boyer-Moore technique to compressed pattern matching for text string described in terms of collage system, which is a formal framework that captures various dictionary-based compression methods. For a subclass of collage systems that contain no truncation, our new algorithm runs in O(‖D‖ + n. m + m2 + r) time using O(‖D‖ + m2) space, where ‖D‖ is the size of dictionary D, n is the compressed text length, m is the pattern length, and r is the number of pattern occurrences. For a general collage system, the time complexity is (height(D<(‖D‖ + n) + n. m + m2 + r ), where height(D) is the maximum dependency of tokens in D. We showed that the algorithm specialized for the so-called byte pair encoding (BPE) is very fast in practice. In fact it runs about 1.2 ~ 3.0 times faster than the exact match routine of the software package agrep, known as the fastest pattern matching tool.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.
C. Allauzen, M. Crochemore, and M. Raffinot. Factor oracle, suffix oracle. Technical Report IGM-99-08, Institut Gaspard-Monge, 1999.
A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52:299–307, 1996.
R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Comm. ACM, 35(10):74–82, 1992.
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):62–72, 1977.
M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text compression using antidictionaries. In Proc. 26th Internationial Colloquim on Automata, Languages and Programming, pages 261–270. Springer-Verlag, 1999.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.
E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Direct pattern matching on compressed text. In Proc. 5th International Symp. on String Processing and Information Retrieval, pages 90–95. IEEE Computer Society, 1998.
P. Gage. A new algorithm for data compression. The C Users Journal, 12(2), 1994.
T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th International Symp. on String Processing and Information Retrieval, pages 89–96. IEEE Computer Society, 1999.
T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 1–13. Springer-Verlag, 1999.
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. Data Compression Conference (DCC’98), pages 103–112. IEEE Computer Society, 1998.
D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput, 6(2):323–350, 1977.
U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proc. 5th Ann. Symp. on Combinatorial Pattern Matching, pages 113–124. Springer-Verlag, 1994.
M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).
G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 14–36. Springer-Verlag, 1999.
G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching. Springer-Verlag, 2000. to appear.
W. Rytter. Algorithms on compressed strings and arrays. In Proc. 26th Ann. Conf. on Current Trends in Theory and Practice of Infomatics. Springer-Verlag, 1999.
Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Speeding up pattern matching by text compression. In Proc. 4th Italian Conference on Algorithms and Complexity, pages 306–315. Springer-Verlag, 2000.
Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 37–49. Springer-Verlag, 1999.
N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.
T. A. Welch. A technique for high performance data compression. IEEE Comput., 17:8–19, June 1984.
S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Usenix Winter 1992 Technical Conference, pages 153–162, 1992.
S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on Inform. Theory, IT-23(3):337–349, May 1977.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S. (2000). A Boyer—Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_17
Download citation
DOI: https://doi.org/10.1007/3-540-45123-4_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive