Skip to main content

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

  • Conference paper
String Processing and Information Retrieval (SPIRE 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

Abstract

This work presents (s,c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s,c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s,c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios.

We formally describe the (s,c)-Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s,c)-Dense Code improves End-Tagged Dense Code and Tagged Huffman Code, and reaches only 0.5% overhead over plain Huffman Code.

This work is partially supported by CICYT Grant (#TIC2002-04413-C04-04), CYTED VII.19 RIBIDI Project, and (for the third author) Fondecyt Grant 1-020831.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)

    Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)

    Google Scholar 

  3. Brisaboa, N., Iglesias, E., Navarro, G., Paramá, J.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  4. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)

    Google Scholar 

  5. Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)

    Article  Google Scholar 

  6. Moffat, A., Turpin, A.: On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications 45(10), 1200–1207 (1997)

    Article  Google Scholar 

  7. Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  8. Navarro, G., de Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)

    Article  Google Scholar 

  9. Rautio, J., Tanninen, J., Tarhio, J.: String matching with stopper encoding and code splitting. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 42–52. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. 25th Annual International ACM SIGIR conference on Research and development in information retrieval, pp. 222–229 (2002)

    Google Scholar 

  11. Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 298–306 (1998)

    Google Scholar 

  12. Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)

    Article  Google Scholar 

  13. Manber, U., Wu, S.: GLIMPSE: A tool to search through entire file systems. In: Proc. of the Winter 1994 USENIX Technical Conference, pp. 23–32 (1994)

    Google Scholar 

  14. Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)

    Google Scholar 

  15. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  16. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  17. Ziviani, N., Silva de Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. Computer 33(11), 37–44 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (2003). (S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39984-1_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20177-9

  • Online ISBN: 978-3-540-39984-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics