A text compression scheme that allows fast searching directly in the compressed file

Manber, Udi

doi:10.1007/3-540-58094-8_10

A text compression scheme that allows fast searching directly in the compressed file

Udi Manber¹

Conference paper
First Online: 01 January 2005

180 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 807))

Abstract

A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.

Supported in part by NSF grants CCR-9002351 and CCR-9301129, and by the Advanced Research Projects Agency under contract number DABT63-93-C-0052. Part of this work was done while the author was visiting the University of Washington.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

Amir, A, and G. Benson, ‘Two-dimensional periodicity and its application,” Proc. of the 3rd Symp. on Discrete Algorithms, Orlando Florida (January 1992), pp. 440–452.
Google Scholar
Amir, A, and G. Benson, “Efficient two dimensional compressed matching,” Proc. of the Data Compression Conference, Snowbird Utah (March 1992), pp. 279–288.
Google Scholar
Amir, A, G. Benson, and M. Farach, “Let sleeping files lie: pattern matching in Z-compressed files,” Proc. of the 5rd Symp. on Discrete Algorithms, (January 1994), to appear.
Google Scholar
Aho, A. V., and M. J. Corasick, “Efficient string matching: an aid to bibliographic search”, Communications of the ACM, 18 (June 1975), pp. 333–340.
Google Scholar
Bitner J. R., G. Erlich, and E. M. Reingold, “Efficient generation of the binary reflected Gray code and its applications,” Communications of the ACM, 19 (September 1976), pp. 517–521.
Google Scholar
Bell, T. G., J. G. Cleary, and I. H. Witten, Text Compression, Prentice-Hall, Englewood Cliffs, NJ (1990).
Google Scholar
Boyer R. S., and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, 20 (October 1977), pp. 762–772.
Google Scholar
Eilam-Tsoreff T., and U. Vishkin, “Matching patterns in a string subject to multilinear transformations,” Proc. of the Int. Workshop on Sequences, Combinatorics, Compression, Security, and Transmission, Salerno, Italy (June 1988).
Google Scholar
Farach M., private communication (October 1993).
Google Scholar
Garey M. R., and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-completeness, W. H. Freeman, San Francisco, CA, 1979.
Google Scholar
B. Gopal, and U. Manber, “A Fixed-Dictionary Approach to Fast Searching in Compressed Files,” submitted for publication.
Google Scholar
Jewell G. C., “Text compaction for information retrieval systems,” IEEE SMC Newsletter, 5 (February 1976).
Google Scholar
Klein, S.T., A. Bookstein, and S. Deerwester, “Storing text retrieval systems on CD-ROM: compression and encryption considerations,” ACM Trans. on Information Systems, 7 (July 1989), pp. 230–245.
Google Scholar
Manber U. and S. Wu, “GLIMPSE: A Tool to Search Through Entire File Systems,” Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 23–32.
Google Scholar
Witten, I. H., T. C. Bell, and C. G. Nevill, “Models for compression in fulltext retrieval systems,” Proc. of the Data Compression Conference, Snowbird, Utah (April 1991), pp. 23–32.
Google Scholar
Welch, T. A., “A technique for high-performance data compression,” IEEE Computer, 17 (June 1984), pp. 8–19.
Google Scholar
Wu S., and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM 35 (October 1992), pp. 83–91.
Google Scholar
Ziv, J. and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans, on Information Theory, IT-23 (May 1977). pp. 337–343.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Arizona, 85721, Tucson, AZ
Udi Manber

Authors

Udi Manber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Maxime Crochemore Dan Gusfield

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manber, U. (1994). A text compression scheme that allows fast searching directly in the compressed file. In: Crochemore, M., Gusfield, D. (eds) Combinatorial Pattern Matching. CPM 1994. Lecture Notes in Computer Science, vol 807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58094-8_10

Download citation

DOI: https://doi.org/10.1007/3-540-58094-8_10
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58094-2
Online ISBN: 978-3-540-48450-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics