Abstract
The performance of data compression on a large static text may be improved if certain variable-length strings are included in the character set for which a code is generated. A new method for extending the alphabet is presented, based on a reduction to a graph-theoretic problem. A related optimization problem is shown to be NP-complete, a fast heuristic is suggested, and experimental results are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arnold, R., Bell, T.: A corpus for the evaluation of lossless compression algorithms. In: Proc. Data Compression Conference DCC 1997, Snowbird, Utah, pp. 201–210 (1997)
Apostolico, A.: The myriad virtues of subword trees, Combinatorial Algorithms on Words. NATO ASI Series, vol. F12, pp. 85–96. Springer, Berlin (1985)
Apostolico, A., Lonardi, S.: Some theory and practice of greedy off-line textual substitution. In: Proc. Data Compression Conference DCC 1998, Snowbird, Utah, pp. 119–128 (1998)
Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proc. of the IEEE 88, 1733–1744 (2000)
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)
Bell, T.C., Cleary, J.G., Witten, I.A.: Text Compression. Prentice Hall, Englewood Cliffs (1990)
Bell, T., Witten, I.H., Cleary, J.G.: Modeling for Text Compression. ACM Computing Surveys 21, 557–591 (1989)
Bentley, J., McIlroy, D.: Data compression using long common strings. In: Proc. Data Compression Conference, DCC 1999, Snowbird, Utah, pp. 287–295 (1999)
Bookstein, A., Klein, S.T.: Compression, Information Theory and Grammars: A Unified Approach. ACM Trans. on Information Systems 8, 27–49 (1990)
Bookstein, A., Klein, S.T., Raita, T.: An overhead reduction technique for mega-state compression schemes. Information Processing & Management 33, 745–760 (1997)
Bookstein, A., Klein, S.T., Ziff, D.A.: A systematic approach to compressing a full text retrieval system. Information Processing & Management 28, 795–806 (1992)
Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 122–136. Springer, Heidelberg (2003)
Cannane, A., Williams, H.E.: General-purpose compression for efficient retrieval. Journal of the ASIS 52(5), 430–437 (2001)
Choueka, Y.: Responsa: A full-text retrieval system with linguistic processing for a 65-million word corpus of jewish heritage in Hebrew. IEEE Data Eng. Bull. 14(4), 22–31 (1989)
Even, S.: Graph Algorithms. Computer Science Press (1979)
Fraenkel, A.S.: All about the Responsa Retrieval Project you always wanted to know but were afraid to ask. Expanded Summary, Jurimetrics J. 16, 149–156 (1976)
Moffat, A.: Word-based text compression. Software – Practice & Experience 19, 185–198 (1989)
Fraenkel, A.S., Mor, M., Perl, Y.: Is text compression by prefixes and suffixes practical? Acta Informatica 20, 371–389 (1983)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, San Francisco (1979)
Halldorsson, M.M., Radhakrishnan, J.: Greed is good: approximating independent sets in sparse and bounded degree graphs. In: Proc. 26th ACM-STOC, pp. 439–448 (1994)
Hochbaum, D.S.: Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, Boston (1997)
Klein, S.T.: Skeleton trees for the efficient decoding of Huffman encoded texts. The Special issue on Compression and Efficiency in Information Retrieval of the Kluwer Journal of Information Retrieval 3, 7–23 (2000)
Klein, S.T.: Efficient optimal recompression. The Computer Journal 40, 117–126 (1997)
Klein, S.T., Kopel Ben-Nissan, M.: On the Usefulness of Fibonacci Compression Codes. The Computer Journal 53, 701–716 (2010)
Kortsarz, G., Peleg, D.: On choosing dense subgraphs. In: Proc. 34th FOCS, Palo-Alto, CA, pp. 692–701 (1993)
Larson, N.J., Moffat, A.: Offline dicionary based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)
Longo, G., Galasso, G.: An application of informational divergence to Huffman codes. IEEE Trans. on Inf. Th. IT–28, 36–43 (1982)
de Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Trans. on Information Systems 18, 113–139 (2000)
Rissanen, J., Langdon, G.G.: Universal modeling and coding. IEEE Trans. on Inf. Th. IT–27, 12–23 (1981)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29, 928–951 (1982)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Klein, S.T. (2014). A New Approach to Alphabet Extension for Improving Static Compression Schemes. In: Dershowitz, N., Nissan, E. (eds) Language, Culture, Computation. Computing - Theory and Technology. Lecture Notes in Computer Science, vol 8001. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45321-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-45321-2_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45320-5
Online ISBN: 978-3-642-45321-2
eBook Packages: Computer ScienceComputer Science (R0)