Skip to main content
Log in

A dictionary-based text compression technique using quaternary code

  • Original Article
  • Published:
Iran Journal of Computer Science Aims and scope Submit manuscript

Abstract

Improving encoding and decoding time in compression technique is a great demand to modern users. In bit level compression technique, it requires more time to encode or decode every single bit when a binary code is used. In this research, we develop a dictionary-based compression technique where we use a quaternary tree instead of a binary tree for construction of Huffman codes. Firstly, we explore the properties of quaternary tree structure mathematically for construction of Huffman codes. We study the terminology of new tree structure thoroughly and prove the results. Secondly, after a statistical analysis of English language, we design a variable length dictionary based on quaternary codes. Thirdly, we develop the encoding and decoding algorithms for the proposed technique. We compare the performance of the proposed technique with the existing popular techniques. The proposed technique performs better than the existing techniques with respect to decompression speed while the space requirement increases insignificantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Availability of data

The datasets and the source code of compression algorithms supporting of this article are available online in the following link.The Brown Corpus [Online]. Available: http://www.nltk.org/nltk_data/. Accessed 30 May 2018. The Canterbury Corpus [Online]. Available: http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May 2018. The Enwik8 Corpus [Online]. Available: http://mattmahoney.net/dc/text.html, http://mattmahoney.net/dc/enwik8.zip. Accessed 30 May 2018. The SUPara Corpus [Online]. Available: http://dx.doi.org/10.21227/gz0b-5p24. Zopfli Source Code[Online]. Available: https://github.com/google/zopfli/commit/89cf773beef75d7f4d6d378debdf299378c3314e. Accessed 30 May 2018. LZHAM Source [Online]. Available: https://github.com/richgel999/lzham_codec. Accessed 30 May 2018. bzip2 Source: bzip2 1.0.6 6-Sept-2010 [Online]. Available: https://github.com/enthought/bzip2-1.0.6. LZMA Source: LZMA implementation in 7zip 9.20.1 [Online]. Available: LZMA SDK:https://www.7-zip.org/sdk.html.

References

  1. Khuri, S., Hsu, H-C.: Tools for visualizing text compression algorithms. In: Proceedings of the 2000 ACM Symposium on Applied Computing (SAC’00), Como, Italy, March 2000, vol. 1, pp. 119–123 (2000)

  2. Huffman, DA.: A method for construction of minimum redundancy codes. In: Proceedings of the IRE, Sep 1952, vol. 40, pp. 1090–1101 (1952)

  3. Carus, A., Mesut, A.: Fast text compression using multiple static dictionaries. Inf. Technol. J. 9(5), 1013–1021 (2010). https://doi.org/10.3923/itj.2010.1013.1021

    Article  Google Scholar 

  4. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714

    Article  MathSciNet  MATH  Google Scholar 

  5. Ziv, J., Lempel, A.: Compression of individual sequence via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978). https://doi.org/10.1109/TIT.1978.1055934

    Article  MathSciNet  MATH  Google Scholar 

  6. Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346

    Article  MathSciNet  MATH  Google Scholar 

  7. Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984). https://doi.org/10.1109/MC.1984.1659158

    Article  Google Scholar 

  8. Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Inf. Process. Manag. 41(5), 1175–1192 (2005). https://doi.org/10.1016/j.ipm.2004.08.009

    Article  MATH  Google Scholar 

  9. L’ansk’y, J., Žemlička, M.: Text compression: syllables. In: Proceedings of the Dateso–Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, April 13–15, 2005, pp. 32–45 (2005)

  10. Adiego, J., de la Feunte, P.: On the use of words as source alphabet symbols in PPM. In: Proceedings of Data Compression Conference, Snowbird, UT, USA, March 28–30, 2006, pp. 435 (2006)

  11. Dvorsky, J., Pokorny, J., Snasel, V.: Word-based compression methods for large text documents. In: Proceedings of Data Compression Conference, Snowbird, UT, USA, March 29–31, 1999, pp. 523 (1999)

  12. L’ansk’y, J., Žemlička, M.: Compression of a dictionary. In: Proceedings of DATESO Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, April 26–28, 2006, pp. 11–20 (2006)

  13. Al-Bahadili, H., Rababa, A.: An adaptive bit-level text compression scheme based on the HCDC algorithm. In: Proceedings of Mosharaka International Conference on Communications, Networking and Information Technology, Amman, Jordan, Dec 6–8, 2007, pp. 51–56 (2007)

  14. Chung, K.L.: Efficient Huffman decoding. Inf. Process. Lett. 61(2), 97–99 (1997). https://doi.org/10.1016/S0020-0190(96)00204-9

    Article  MathSciNet  MATH  Google Scholar 

  15. Schack, R.: The length of a typical Huffman codeword. IEEE Trans. Inf. Theory 40(4), 1246–1247 (1994). https://doi.org/10.1109/18.335944

    Article  MATH  Google Scholar 

  16. Katona, G.O.H., Nemetz, T.O.H.: Huffman codes and self-information. IEEE Trans. Inf. Theory 22(3), 337–340 (1978). https://doi.org/10.1109/TIT.1976.1055554

    Article  MathSciNet  MATH  Google Scholar 

  17. Fenwick, P.M.: Huffman code efficiencies for extensions of sources. IEEE Trans. Commun. 43(2/3/4), 163–165 (1995)

    Article  Google Scholar 

  18. Kavousianos, X., Kalligeros, E., Nikolos, D.: Test-data compression based on variable-to-variable huffman encoding with codeword reusability. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 27(7), 1333–1338 (2008)

    Article  Google Scholar 

  19. Lin, Y.-K., Huang, S.-C., Yang, C.-H.: A fast algorithm for Huffman decoding based on a recursion Huffman tree. J. Syst. Softw. 85, 974–980 (2012)

    Article  Google Scholar 

  20. Alakuijala, J., Vandevenne, L.: Data compression using Zopfli. Google Inc. [Online]. https://zopfli.googlecode.com/files/Data_compression_using_Zopfli.pdf. Accessed 30 May 2018 (2013)

  21. The Brown Corpus [Online]. http://www.nltk.org/nltk_data/. Accessed 30 May (2018)

  22. The Canterbury Corpus [Online]. http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May (2018)

  23. Mumin, M.A.A., Shoeb, A.A.M., Selim, M.R., Iqbal, M.Z.: SUPara: a balanced english-bengali parallel corpus. SUST J. Sci. Technol. 16(2), 46–51 (2012)

    Google Scholar 

  24. The Enwik8 Corpus [Online]. http://mattmahoney.net/dc/text.htmlhttp://mattmahoney.net/dc/enwik8.zip. Accessed 30 May 2018

  25. Habib, A., Rahman, M.S.: Balancing decoding speed and memory usage for Huffman codes using quaternary tree. Appl. Inf. 4, 5 (2017). https://doi.org/10.1186/s40535-016-0032-z

    Article  Google Scholar 

  26. Zopfli Source Code [Online]. https://github.com/google/zopfli/commit/89cf773beef75d7f4d6d378debdf299378c3314e. Accessed 30 May 2018

  27. LZHAM Source [Online]. https://github.com/richgel999/lzham_codec. Accessed 30 May (2018)

  28. bzip2 Source: bzip2 1.0.6 6-Sept-2010 [Online]. https://github.com/enthought/bzip2-1.0.6. Accessed 30 May 2018

  29. LZMA Source: LZMA implementation in 7zip 9.20.1 [Online]. LZMA SDK: https://www.7-zip.org/sdk.htm. Accessed 30 May 2018

  30. Deutsch P (1996) RFC 1952—GZIP file format specification, version 4.3, May 1996. [Online]. http://www.ietf.org/rfc/rfc1952.txt. Accessed 30 May (2018)

Download references

Acknowledgements

The authors are grateful to Information and Communication Technology Division, Government of the People’s Republic of Bangladesh for the grant to do this research work.

Funding

All the funding provided by the Ministry of Posts, Telecommunications and Information Technology, People’s Republic of Bangladesh [Order No: 56.00.0000.028.33.025.14-153, date: 08.06.2017]. The above funding gives financial support for the designing of the study and conducting experiments.

Author information

Authors and Affiliations

Authors

Contributions

The authors discussed the problem and the solutions proposed altogether. All authors participated in drafting and revising the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ahsan Habib.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Habib, A., Islam, M.J. & Rahman, M.S. A dictionary-based text compression technique using quaternary code. Iran J Comput Sci 3, 127–136 (2020). https://doi.org/10.1007/s42044-019-00047-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42044-019-00047-w

Keywords

Navigation