Abstract
We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not improve because still all the symbols in the compressed text have to be scanned, we show experimentally that speedups of up to 30% over the fastest previous approaches are obtained. Moreover, we show that using an encoding method that sacrifices some compression ratio our method is twice as fast as decompressing plus searching using the best available algorithms.
Work developed during postdoctoral stay at the University of Helsinki, partially supported by the Academy of Finland and Fundacíon Andes. Also supported by Fondecyt grant 1-990627.
Supported in part by the Academy of Finland.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. DCC’92, pages 279–288, 1992.
A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. J. of and Sys. Sciences, 52(2):299–307, 1996.
A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press, Oxford, UK, 1997.
T. Bell, J. Cleary, and I. Witten. Text Compression. Prentice Hall, 1990.
R. S. Boyer and J. S. Moore. A fast string searching algorithm. CACM, 20(10):762–772, 1977.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20:388–404, 1998.
L. Gasieniec, M. Karpinksi, W. Plandowski, and W. Rytter. Efficient algorithms for Lempel-Ziv encodings. In Proc. SWAT’96, 1996.
R. N. Horspool. Practical fast searching in strings. Software Practice and Experience, 10:501–506, 1980.
D. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the I.R.E., 40(9):1090–1101, 1952.
J. Kärkkäinen, G. Navarro, and E. Ukkonen. Approximate string matching over ziv-lempel compressed text. In Proc. CPM’2000, LNCS1848, 2000, pp. 195–209.
T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th Intl. Symp. on String Processing and Information Retrieval (SPIRE’99), pages 89–96. IEEE CS Press, 1999.
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. DCC’98, 1998.
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99, LNCS 1645, pages 1–13, 1999.
U. Manber. A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. on Information Systems, 15(2):124–136, 1997.
E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Trans. on Information Systems, 2000. To appear. Previous versions in SIGIR’98 and SPIRE’98.
G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99, LNCS 1645, pages 14–36, 1999.
H. Peltola and J. Tarhio. String matching in the DNA alphabet. Software Practice and Experience, 27(7):851–861, 1997.
D. Sunday. A very fast substring search algorithm. CACM, 33(8):132–142, 1990.
T. A. Welch. A technique for high performance data compression. IEEE Computer Magazine, 17(6):8–19, June 1984.
S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, October 1992.
S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Proc. USENIX Technical Conference, pages 153–162, Berkeley, CA, USA, Winter 1992.
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23:337–343, 1977.
J. Ziv and A. Lempel. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory, 24:530–536, 1978.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Navarro, G., Tarhio, J. (2000). Boyer—Moore String Matching over Ziv-Lempel Compressed Text. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_16
Download citation
DOI: https://doi.org/10.1007/3-540-45123-4_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive