Advertisement

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • María F. Esteller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)

Abstract

This work presents (s,c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s,c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s,c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios.

We formally describe the (s,c)-Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s,c)-Dense Code improves End-Tagged Dense Code and Tagged Huffman Code, and reaches only 0.5% overhead over plain Huffman Code.

Keywords

Compression Ratio Lower Frequency Word Information Retrieval System Compression Scheme Dense Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)Google Scholar
  3. 3.
    Brisaboa, N., Iglesias, E., Navarro, G., Paramá, J.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)Google Scholar
  5. 5.
    Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)CrossRefGoogle Scholar
  6. 6.
    Moffat, A., Turpin, A.: On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications 45(10), 1200–1207 (1997)CrossRefGoogle Scholar
  7. 7.
    Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  8. 8.
    Navarro, G., de Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)CrossRefGoogle Scholar
  9. 9.
    Rautio, J., Tanninen, J., Tarhio, J.: String matching with stopper encoding and code splitting. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 42–52. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. 25th Annual International ACM SIGIR conference on Research and development in information retrieval, pp. 222–229 (2002)Google Scholar
  11. 11.
    Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 298–306 (1998)Google Scholar
  12. 12.
    Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)CrossRefGoogle Scholar
  13. 13.
    Manber, U., Wu, S.: GLIMPSE: A tool to search through entire file systems. In: Proc. of the Winter 1994 USENIX Technical Conference, pp. 23–32 (1994)Google Scholar
  14. 14.
    Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)Google Scholar
  15. 15.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Ziviani, N., Silva de Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. Computer 33(11), 37–44 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Antonio Fariña
    • 1
  • Gonzalo Navarro
    • 2
  • María F. Esteller
    • 1
  1. 1.Database Lab.Univ. da Coruña, Facultade de InformáticaA CoruñaSpain
  2. 2.Dept. of Computer ScienceUniv. de Chile, Blanco EncaladaSantiagoChile

Personalised recommendations