Advertisement

A Boyer—Moore Type Algorithm for Compressed Pattern Matching

  • Yusuke Shibata
  • Tetsuya Matsumoto
  • Masayuki Takeda
  • Ayumi Shinohara
  • Setsuo Arikawa
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1848)

Abstract

We apply the Boyer-Moore technique to compressed pattern matching for text string described in terms of collage system, which is a formal framework that captures various dictionary-based compression methods. For a subclass of collage systems that contain no truncation, our new algorithm runs in O(‖D‖ + n. m + m2 + r) time using O(‖D‖ + m2) space, where ‖D‖ is the size of dictionary D, n is the compressed text length, m is the pattern length, and r is the number of pattern occurrences. For a general collage system, the time complexity is (height(D<(‖D‖ + n) + n. m + m2 + r ), where height(D) is the maximum dependency of tokens in D. We showed that the algorithm specialized for the so-called byte pair encoding (BPE) is very fast in practice. In fact it runs about 1.2 ~ 3.0 times faster than the exact match routine of the software package agrep, known as the fastest pattern matching tool.

Keywords

Compression Ratio Pattern Match Original Text Compression Method Collage System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    C. Allauzen, M. Crochemore, and M. Raffinot. Factor oracle, suffix oracle. Technical Report IGM-99-08, Institut Gaspard-Monge, 1999.Google Scholar
  3. 3.
    A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52:299–307, 1996.CrossRefMathSciNetGoogle Scholar
  4. 4.
    R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Comm. ACM, 35(10):74–82, 1992.CrossRefGoogle Scholar
  5. 5.
    R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):62–72, 1977.CrossRefGoogle Scholar
  6. 6.
    M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text compression using antidictionaries. In Proc. 26th Internationial Colloquim on Automata, Languages and Programming, pages 261–270. Springer-Verlag, 1999.Google Scholar
  7. 7.
    M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.zbMATHGoogle Scholar
  8. 8.
    E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Direct pattern matching on compressed text. In Proc. 5th International Symp. on String Processing and Information Retrieval, pages 90–95. IEEE Computer Society, 1998.Google Scholar
  9. 9.
    P. Gage. A new algorithm for data compression. The C Users Journal, 12(2), 1994.Google Scholar
  10. 10.
    T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th International Symp. on String Processing and Information Retrieval, pages 89–96. IEEE Computer Society, 1999.Google Scholar
  11. 11.
    T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 1–13. Springer-Verlag, 1999.Google Scholar
  12. 12.
    T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. Data Compression Conference (DCC’98), pages 103–112. IEEE Computer Society, 1998.Google Scholar
  13. 13.
    D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput, 6(2):323–350, 1977.zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proc. 5th Ann. Symp. on Combinatorial Pattern Matching, pages 113–124. Springer-Verlag, 1994.Google Scholar
  15. 15.
    M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).MathSciNetGoogle Scholar
  16. 16.
    G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 14–36. Springer-Verlag, 1999.Google Scholar
  17. 17.
    G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching. Springer-Verlag, 2000. to appear.Google Scholar
  18. 18.
    W. Rytter. Algorithms on compressed strings and arrays. In Proc. 26th Ann. Conf. on Current Trends in Theory and Practice of Infomatics. Springer-Verlag, 1999.Google Scholar
  19. 19.
    Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Speeding up pattern matching by text compression. In Proc. 4th Italian Conference on Algorithms and Complexity, pages 306–315. Springer-Verlag, 2000.Google Scholar
  20. 20.
    Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 37–49. Springer-Verlag, 1999.Google Scholar
  21. 21.
    N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.CrossRefGoogle Scholar
  22. 22.
    T. A. Welch. A technique for high performance data compression. IEEE Comput., 17:8–19, June 1984.Google Scholar
  23. 23.
    S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Usenix Winter 1992 Technical Conference, pages 153–162, 1992.Google Scholar
  24. 24.
    S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.Google Scholar
  25. 25.
    J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on Inform. Theory, IT-23(3):337–349, May 1977.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Yusuke Shibata
    • 1
  • Tetsuya Matsumoto
    • 1
  • Masayuki Takeda
    • 1
  • Ayumi Shinohara
    • 1
  • Setsuo Arikawa
    • 1
  1. 1.Department of InformaticsKyushu University 33FukuokaJapan

Personalised recommendations