Advertisement

Boyer—Moore String Matching over Ziv-Lempel Compressed Text

  • Gonzalo Navarro
  • Jorma Tarhio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1848)

Abstract

We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not improve because still all the symbols in the compressed text have to be scanned, we show experimentally that speedups of up to 30% over the fastest previous approaches are obtained. Moreover, we show that using an encoding method that sacrifices some compression ratio our method is twice as fast as decompressing plus searching using the best available algorithms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. DCC’92, pages 279–288, 1992.Google Scholar
  2. 2.
    A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. J. of and Sys. Sciences, 52(2):299–307, 1996.CrossRefMathSciNetGoogle Scholar
  3. 3.
    A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press, Oxford, UK, 1997.zbMATHGoogle Scholar
  4. 4.
    T. Bell, J. Cleary, and I. Witten. Text Compression. Prentice Hall, 1990.Google Scholar
  5. 5.
    R. S. Boyer and J. S. Moore. A fast string searching algorithm. CACM, 20(10):762–772, 1977.Google Scholar
  6. 6.
    M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.Google Scholar
  7. 7.
    M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20:388–404, 1998.zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    L. Gasieniec, M. Karpinksi, W. Plandowski, and W. Rytter. Efficient algorithms for Lempel-Ziv encodings. In Proc. SWAT’96, 1996.Google Scholar
  9. 9.
    R. N. Horspool. Practical fast searching in strings. Software Practice and Experience, 10:501–506, 1980.CrossRefGoogle Scholar
  10. 10.
    D. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the I.R.E., 40(9):1090–1101, 1952.Google Scholar
  11. 11.
    J. Kärkkäinen, G. Navarro, and E. Ukkonen. Approximate string matching over ziv-lempel compressed text. In Proc. CPM’2000, LNCS1848, 2000, pp. 195–209.Google Scholar
  12. 12.
    T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th Intl. Symp. on String Processing and Information Retrieval (SPIRE’99), pages 89–96. IEEE CS Press, 1999.Google Scholar
  13. 13.
    T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. DCC’98, 1998.Google Scholar
  14. 14.
    T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99, LNCS 1645, pages 1–13, 1999.Google Scholar
  15. 15.
    U. Manber. A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. on Information Systems, 15(2):124–136, 1997.CrossRefGoogle Scholar
  16. 16.
    E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Trans. on Information Systems, 2000. To appear. Previous versions in SIGIR’98 and SPIRE’98.Google Scholar
  17. 17.
    G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99, LNCS 1645, pages 14–36, 1999.Google Scholar
  18. 18.
    H. Peltola and J. Tarhio. String matching in the DNA alphabet. Software Practice and Experience, 27(7):851–861, 1997.CrossRefGoogle Scholar
  19. 19.
    D. Sunday. A very fast substring search algorithm. CACM, 33(8):132–142, 1990.Google Scholar
  20. 20.
    T. A. Welch. A technique for high performance data compression. IEEE Computer Magazine, 17(6):8–19, June 1984.Google Scholar
  21. 21.
    S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, October 1992.Google Scholar
  22. 22.
    S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Proc. USENIX Technical Conference, pages 153–162, Berkeley, CA, USA, Winter 1992.Google Scholar
  23. 23.
    J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23:337–343, 1977.zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    J. Ziv and A. Lempel. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory, 24:530–536, 1978.zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • Jorma Tarhio
    • 2
  1. 1.Dept. of Computer ScienceUniversity of ChileChile
  2. 2.Dept. of Computer ScienceUniversity of JoensuuFinland

Personalised recommendations