Abstract
A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.
Supported in part by NSF grants CCR-9002351 and CCR-9301129, and by the Advanced Research Projects Agency under contract number DABT63-93-C-0052. Part of this work was done while the author was visiting the University of Washington.
Preview
Unable to display preview. Download preview PDF.
References
Amir, A, and G. Benson, ‘Two-dimensional periodicity and its application,” Proc. of the 3rd Symp. on Discrete Algorithms, Orlando Florida (January 1992), pp. 440–452.
Amir, A, and G. Benson, “Efficient two dimensional compressed matching,” Proc. of the Data Compression Conference, Snowbird Utah (March 1992), pp. 279–288.
Amir, A, G. Benson, and M. Farach, “Let sleeping files lie: pattern matching in Z-compressed files,” Proc. of the 5rd Symp. on Discrete Algorithms, (January 1994), to appear.
Aho, A. V., and M. J. Corasick, “Efficient string matching: an aid to bibliographic search”, Communications of the ACM, 18 (June 1975), pp. 333–340.
Bitner J. R., G. Erlich, and E. M. Reingold, “Efficient generation of the binary reflected Gray code and its applications,” Communications of the ACM, 19 (September 1976), pp. 517–521.
Bell, T. G., J. G. Cleary, and I. H. Witten, Text Compression, Prentice-Hall, Englewood Cliffs, NJ (1990).
Boyer R. S., and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, 20 (October 1977), pp. 762–772.
Eilam-Tsoreff T., and U. Vishkin, “Matching patterns in a string subject to multilinear transformations,” Proc. of the Int. Workshop on Sequences, Combinatorics, Compression, Security, and Transmission, Salerno, Italy (June 1988).
Farach M., private communication (October 1993).
Garey M. R., and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-completeness, W. H. Freeman, San Francisco, CA, 1979.
B. Gopal, and U. Manber, “A Fixed-Dictionary Approach to Fast Searching in Compressed Files,” submitted for publication.
Jewell G. C., “Text compaction for information retrieval systems,” IEEE SMC Newsletter, 5 (February 1976).
Klein, S.T., A. Bookstein, and S. Deerwester, “Storing text retrieval systems on CD-ROM: compression and encryption considerations,” ACM Trans. on Information Systems, 7 (July 1989), pp. 230–245.
Manber U. and S. Wu, “GLIMPSE: A Tool to Search Through Entire File Systems,” Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 23–32.
Witten, I. H., T. C. Bell, and C. G. Nevill, “Models for compression in fulltext retrieval systems,” Proc. of the Data Compression Conference, Snowbird, Utah (April 1991), pp. 23–32.
Welch, T. A., “A technique for high-performance data compression,” IEEE Computer, 17 (June 1984), pp. 8–19.
Wu S., and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM 35 (October 1992), pp. 83–91.
Ziv, J. and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans, on Information Theory, IT-23 (May 1977). pp. 337–343.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Manber, U. (1994). A text compression scheme that allows fast searching directly in the compressed file. In: Crochemore, M., Gusfield, D. (eds) Combinatorial Pattern Matching. CPM 1994. Lecture Notes in Computer Science, vol 807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58094-8_10
Download citation
DOI: https://doi.org/10.1007/3-540-58094-8_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58094-2
Online ISBN: 978-3-540-48450-9
eBook Packages: Springer Book Archive