(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

Brisaboa, Nieves R.; Fariña, Antonio; Navarro, Gonzalo; Esteller, María F.

doi:10.1007/978-3-540-39984-1_10

Nieves R. Brisaboa⁷,
Antonio Fariña⁷,
Gonzalo Navarro⁸ &
…
María F. Esteller⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

539 Accesses
38 Citations

Abstract

This work presents (s,c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s,c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s,c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios.

We formally describe the (s,c)-Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s,c)-Dense Code improves End-Tagged Dense Code and Tagged Huffman Code, and reaches only 0.5% overhead over plain Huffman Code.

This work is partially supported by CICYT Grant (#TIC2002-04413-C04-04), CYTED VII.19 RIBIDI Project, and (for the third author) Fondecyt Grant 1-020831.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Google Scholar
Brisaboa, N., Iglesias, E., Navarro, G., Paramá, J.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)
Chapter Google Scholar
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)
Google Scholar
Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)
Article Google Scholar
Moffat, A., Turpin, A.: On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications 45(10), 1200–1207 (1997)
Article Google Scholar
Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)
Chapter Google Scholar
Navarro, G., de Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)
Article Google Scholar
Rautio, J., Tanninen, J., Tarhio, J.: String matching with stopper encoding and code splitting. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 42–52. Springer, Heidelberg (2002)
Chapter Google Scholar
Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. 25th Annual International ACM SIGIR conference on Research and development in information retrieval, pp. 222–229 (2002)
Google Scholar
Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 298–306 (1998)
Google Scholar
Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)
Article Google Scholar
Manber, U., Wu, S.: GLIMPSE: A tool to search through entire file systems. In: Proc. of the Winter 1994 USENIX Technical Conference, pp. 23–32 (1994)
Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MATH MathSciNet Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Article MATH MathSciNet Google Scholar
Ziviani, N., Silva de Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. Computer 33(11), 37–44 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Database Lab., Univ. da Coruña, Facultade de Informática, Campus de Elviña s/n, 15071, A Coruña, Spain
Nieves R. Brisaboa, Antonio Fariña & María F. Esteller
Dept. of Computer Science, Univ. de Chile, Blanco Encalada, 2120, Santiago, Chile
Gonzalo Navarro

Authors

Nieves R. Brisaboa
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Fariña
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
María F. Esteller
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, Canada
Mario A. Nascimento
Universidade Federal do Amazonas, Manaus, AM, Brasil
Edleno S. de Moura
INESC-ID/IST, R. Alves Redol 9, 1000, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (2003). (S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-39984-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics