Abstract
Improving encoding and decoding time in compression technique is a great demand to modern users. In bit level compression technique, it requires more time to encode or decode every single bit when a binary code is used. In this research, we develop a dictionary-based compression technique where we use a quaternary tree instead of a binary tree for construction of Huffman codes. Firstly, we explore the properties of quaternary tree structure mathematically for construction of Huffman codes. We study the terminology of new tree structure thoroughly and prove the results. Secondly, after a statistical analysis of English language, we design a variable length dictionary based on quaternary codes. Thirdly, we develop the encoding and decoding algorithms for the proposed technique. We compare the performance of the proposed technique with the existing popular techniques. The proposed technique performs better than the existing techniques with respect to decompression speed while the space requirement increases insignificantly.
Similar content being viewed by others
Availability of data
The datasets and the source code of compression algorithms supporting of this article are available online in the following link.The Brown Corpus [Online]. Available: http://www.nltk.org/nltk_data/. Accessed 30 May 2018. The Canterbury Corpus [Online]. Available: http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May 2018. The Enwik8 Corpus [Online]. Available: http://mattmahoney.net/dc/text.html, http://mattmahoney.net/dc/enwik8.zip. Accessed 30 May 2018. The SUPara Corpus [Online]. Available: http://dx.doi.org/10.21227/gz0b-5p24. Zopfli Source Code[Online]. Available: https://github.com/google/zopfli/commit/89cf773beef75d7f4d6d378debdf299378c3314e. Accessed 30 May 2018. LZHAM Source [Online]. Available: https://github.com/richgel999/lzham_codec. Accessed 30 May 2018. bzip2 Source: bzip2 1.0.6 6-Sept-2010 [Online]. Available: https://github.com/enthought/bzip2-1.0.6. LZMA Source: LZMA implementation in 7zip 9.20.1 [Online]. Available: LZMA SDK:https://www.7-zip.org/sdk.html.
References
Khuri, S., Hsu, H-C.: Tools for visualizing text compression algorithms. In: Proceedings of the 2000 ACM Symposium on Applied Computing (SAC’00), Como, Italy, March 2000, vol. 1, pp. 119–123 (2000)
Huffman, DA.: A method for construction of minimum redundancy codes. In: Proceedings of the IRE, Sep 1952, vol. 40, pp. 1090–1101 (1952)
Carus, A., Mesut, A.: Fast text compression using multiple static dictionaries. Inf. Technol. J. 9(5), 1013–1021 (2010). https://doi.org/10.3923/itj.2010.1013.1021
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714
Ziv, J., Lempel, A.: Compression of individual sequence via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978). https://doi.org/10.1109/TIT.1978.1055934
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346
Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984). https://doi.org/10.1109/MC.1984.1659158
Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Inf. Process. Manag. 41(5), 1175–1192 (2005). https://doi.org/10.1016/j.ipm.2004.08.009
L’ansk’y, J., Žemlička, M.: Text compression: syllables. In: Proceedings of the Dateso–Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, April 13–15, 2005, pp. 32–45 (2005)
Adiego, J., de la Feunte, P.: On the use of words as source alphabet symbols in PPM. In: Proceedings of Data Compression Conference, Snowbird, UT, USA, March 28–30, 2006, pp. 435 (2006)
Dvorsky, J., Pokorny, J., Snasel, V.: Word-based compression methods for large text documents. In: Proceedings of Data Compression Conference, Snowbird, UT, USA, March 29–31, 1999, pp. 523 (1999)
L’ansk’y, J., Žemlička, M.: Compression of a dictionary. In: Proceedings of DATESO Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, April 26–28, 2006, pp. 11–20 (2006)
Al-Bahadili, H., Rababa, A.: An adaptive bit-level text compression scheme based on the HCDC algorithm. In: Proceedings of Mosharaka International Conference on Communications, Networking and Information Technology, Amman, Jordan, Dec 6–8, 2007, pp. 51–56 (2007)
Chung, K.L.: Efficient Huffman decoding. Inf. Process. Lett. 61(2), 97–99 (1997). https://doi.org/10.1016/S0020-0190(96)00204-9
Schack, R.: The length of a typical Huffman codeword. IEEE Trans. Inf. Theory 40(4), 1246–1247 (1994). https://doi.org/10.1109/18.335944
Katona, G.O.H., Nemetz, T.O.H.: Huffman codes and self-information. IEEE Trans. Inf. Theory 22(3), 337–340 (1978). https://doi.org/10.1109/TIT.1976.1055554
Fenwick, P.M.: Huffman code efficiencies for extensions of sources. IEEE Trans. Commun. 43(2/3/4), 163–165 (1995)
Kavousianos, X., Kalligeros, E., Nikolos, D.: Test-data compression based on variable-to-variable huffman encoding with codeword reusability. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 27(7), 1333–1338 (2008)
Lin, Y.-K., Huang, S.-C., Yang, C.-H.: A fast algorithm for Huffman decoding based on a recursion Huffman tree. J. Syst. Softw. 85, 974–980 (2012)
Alakuijala, J., Vandevenne, L.: Data compression using Zopfli. Google Inc. [Online]. https://zopfli.googlecode.com/files/Data_compression_using_Zopfli.pdf. Accessed 30 May 2018 (2013)
The Brown Corpus [Online]. http://www.nltk.org/nltk_data/. Accessed 30 May (2018)
The Canterbury Corpus [Online]. http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May (2018)
Mumin, M.A.A., Shoeb, A.A.M., Selim, M.R., Iqbal, M.Z.: SUPara: a balanced english-bengali parallel corpus. SUST J. Sci. Technol. 16(2), 46–51 (2012)
The Enwik8 Corpus [Online]. http://mattmahoney.net/dc/text.htmlhttp://mattmahoney.net/dc/enwik8.zip. Accessed 30 May 2018
Habib, A., Rahman, M.S.: Balancing decoding speed and memory usage for Huffman codes using quaternary tree. Appl. Inf. 4, 5 (2017). https://doi.org/10.1186/s40535-016-0032-z
Zopfli Source Code [Online]. https://github.com/google/zopfli/commit/89cf773beef75d7f4d6d378debdf299378c3314e. Accessed 30 May 2018
LZHAM Source [Online]. https://github.com/richgel999/lzham_codec. Accessed 30 May (2018)
bzip2 Source: bzip2 1.0.6 6-Sept-2010 [Online]. https://github.com/enthought/bzip2-1.0.6. Accessed 30 May 2018
LZMA Source: LZMA implementation in 7zip 9.20.1 [Online]. LZMA SDK: https://www.7-zip.org/sdk.htm. Accessed 30 May 2018
Deutsch P (1996) RFC 1952—GZIP file format specification, version 4.3, May 1996. [Online]. http://www.ietf.org/rfc/rfc1952.txt. Accessed 30 May (2018)
Acknowledgements
The authors are grateful to Information and Communication Technology Division, Government of the People’s Republic of Bangladesh for the grant to do this research work.
Funding
All the funding provided by the Ministry of Posts, Telecommunications and Information Technology, People’s Republic of Bangladesh [Order No: 56.00.0000.028.33.025.14-153, date: 08.06.2017]. The above funding gives financial support for the designing of the study and conducting experiments.
Author information
Authors and Affiliations
Contributions
The authors discussed the problem and the solutions proposed altogether. All authors participated in drafting and revising the final manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Habib, A., Islam, M.J. & Rahman, M.S. A dictionary-based text compression technique using quaternary code. Iran J Comput Sci 3, 127–136 (2020). https://doi.org/10.1007/s42044-019-00047-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42044-019-00047-w