Skip to main content

A text compression scheme that allows fast searching directly in the compressed file

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 807))

Abstract

A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.

Supported in part by NSF grants CCR-9002351 and CCR-9301129, and by the Advanced Research Projects Agency under contract number DABT63-93-C-0052. Part of this work was done while the author was visiting the University of Washington.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amir, A, and G. Benson, ‘Two-dimensional periodicity and its application,” Proc. of the 3rd Symp. on Discrete Algorithms, Orlando Florida (January 1992), pp. 440–452.

    Google Scholar 

  2. Amir, A, and G. Benson, “Efficient two dimensional compressed matching,” Proc. of the Data Compression Conference, Snowbird Utah (March 1992), pp. 279–288.

    Google Scholar 

  3. Amir, A, G. Benson, and M. Farach, “Let sleeping files lie: pattern matching in Z-compressed files,” Proc. of the 5rd Symp. on Discrete Algorithms, (January 1994), to appear.

    Google Scholar 

  4. Aho, A. V., and M. J. Corasick, “Efficient string matching: an aid to bibliographic search”, Communications of the ACM, 18 (June 1975), pp. 333–340.

    Google Scholar 

  5. Bitner J. R., G. Erlich, and E. M. Reingold, “Efficient generation of the binary reflected Gray code and its applications,” Communications of the ACM, 19 (September 1976), pp. 517–521.

    Google Scholar 

  6. Bell, T. G., J. G. Cleary, and I. H. Witten, Text Compression, Prentice-Hall, Englewood Cliffs, NJ (1990).

    Google Scholar 

  7. Boyer R. S., and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, 20 (October 1977), pp. 762–772.

    Google Scholar 

  8. Eilam-Tsoreff T., and U. Vishkin, “Matching patterns in a string subject to multilinear transformations,” Proc. of the Int. Workshop on Sequences, Combinatorics, Compression, Security, and Transmission, Salerno, Italy (June 1988).

    Google Scholar 

  9. Farach M., private communication (October 1993).

    Google Scholar 

  10. Garey M. R., and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-completeness, W. H. Freeman, San Francisco, CA, 1979.

    Google Scholar 

  11. B. Gopal, and U. Manber, “A Fixed-Dictionary Approach to Fast Searching in Compressed Files,” submitted for publication.

    Google Scholar 

  12. Jewell G. C., “Text compaction for information retrieval systems,” IEEE SMC Newsletter, 5 (February 1976).

    Google Scholar 

  13. Klein, S.T., A. Bookstein, and S. Deerwester, “Storing text retrieval systems on CD-ROM: compression and encryption considerations,” ACM Trans. on Information Systems, 7 (July 1989), pp. 230–245.

    Google Scholar 

  14. Manber U. and S. Wu, “GLIMPSE: A Tool to Search Through Entire File Systems,” Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 23–32.

    Google Scholar 

  15. Witten, I. H., T. C. Bell, and C. G. Nevill, “Models for compression in fulltext retrieval systems,” Proc. of the Data Compression Conference, Snowbird, Utah (April 1991), pp. 23–32.

    Google Scholar 

  16. Welch, T. A., “A technique for high-performance data compression,” IEEE Computer, 17 (June 1984), pp. 8–19.

    Google Scholar 

  17. Wu S., and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM 35 (October 1992), pp. 83–91.

    Google Scholar 

  18. Ziv, J. and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans, on Information Theory, IT-23 (May 1977). pp. 337–343.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Maxime Crochemore Dan Gusfield

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Manber, U. (1994). A text compression scheme that allows fast searching directly in the compressed file. In: Crochemore, M., Gusfield, D. (eds) Combinatorial Pattern Matching. CPM 1994. Lecture Notes in Computer Science, vol 807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58094-8_10

Download citation

  • DOI: https://doi.org/10.1007/3-540-58094-8_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58094-2

  • Online ISBN: 978-3-540-48450-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics