Abstract
This work presents (s,c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s,c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s,c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios.
We formally describe the (s,c)-Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s,c)-Dense Code improves End-Tagged Dense Code and Tagged Huffman Code, and reaches only 0.5% overhead over plain Huffman Code.
This work is partially supported by CICYT Grant (#TIC2002-04413-C04-04), CYTED VII.19 RIBIDI Project, and (for the third author) Fondecyt Grant 1-020831.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Brisaboa, N., Iglesias, E., Navarro, G., Paramá, J.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)
Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)
Moffat, A., Turpin, A.: On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications 45(10), 1200–1207 (1997)
Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)
Navarro, G., de Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)
Rautio, J., Tanninen, J., Tarhio, J.: String matching with stopper encoding and code splitting. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 42–52. Springer, Heidelberg (2002)
Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. 25th Annual International ACM SIGIR conference on Research and development in information retrieval, pp. 222–229 (2002)
Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 298–306 (1998)
Silva de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)
Manber, U., Wu, S.: GLIMPSE: A tool to search through entire file systems. In: Proc. of the Winter 1994 USENIX Technical Conference, pp. 23–32 (1994)
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Ziviani, N., Silva de Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. Computer 33(11), 37–44 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (2003). (S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-39984-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20177-9
Online ISBN: 978-3-540-39984-1
eBook Packages: Springer Book Archive