A dictionary-based text compression technique using quaternary code

Habib, Ahsan; Islam, M. Jahirul; Rahman, Mohammad Shahidur

doi:10.1007/s42044-019-00047-w

A dictionary-based text compression technique using quaternary code

Original Article
Published: 07 September 2019

Volume 3, pages 127–136, (2020)
Cite this article

Iran Journal of Computer Science Aims and scope Submit manuscript

Ahsan Habib ORCID: orcid.org/0000-0001-9320-4456¹,
M. Jahirul Islam¹ &
Mohammad Shahidur Rahman¹

398 Accesses
6 Citations
Explore all metrics

Abstract

Improving encoding and decoding time in compression technique is a great demand to modern users. In bit level compression technique, it requires more time to encode or decode every single bit when a binary code is used. In this research, we develop a dictionary-based compression technique where we use a quaternary tree instead of a binary tree for construction of Huffman codes. Firstly, we explore the properties of quaternary tree structure mathematically for construction of Huffman codes. We study the terminology of new tree structure thoroughly and prove the results. Secondly, after a statistical analysis of English language, we design a variable length dictionary based on quaternary codes. Thirdly, we develop the encoding and decoding algorithms for the proposed technique. We compare the performance of the proposed technique with the existing popular techniques. The proposed technique performs better than the existing techniques with respect to decompression speed while the space requirement increases insignificantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Compression Scheme for Natural Language Text by Hashing

Article 04 June 2022

Trigram-Based Vietnamese Text Compression

Lempel–Ziv-78 Compressed String Dictionaries

Article 26 July 2017

Availability of data

The datasets and the source code of compression algorithms supporting of this article are available online in the following link.The Brown Corpus [Online]. Available: http://www.nltk.org/nltk_data/. Accessed 30 May 2018. The Canterbury Corpus [Online]. Available: http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May 2018. The Enwik8 Corpus [Online]. Available: http://mattmahoney.net/dc/text.html, http://mattmahoney.net/dc/enwik8.zip. Accessed 30 May 2018. The SUPara Corpus [Online]. Available: http://dx.doi.org/10.21227/gz0b-5p24. Zopfli Source Code[Online]. Available: https://github.com/google/zopfli/commit/89cf773beef75d7f4d6d378debdf299378c3314e. Accessed 30 May 2018. LZHAM Source [Online]. Available: https://github.com/richgel999/lzham_codec. Accessed 30 May 2018. bzip2 Source: bzip2 1.0.6 6-Sept-2010 [Online]. Available: https://github.com/enthought/bzip2-1.0.6. LZMA Source: LZMA implementation in 7zip 9.20.1 [Online]. Available: LZMA SDK:https://www.7-zip.org/sdk.html.

References

Khuri, S., Hsu, H-C.: Tools for visualizing text compression algorithms. In: Proceedings of the 2000 ACM Symposium on Applied Computing (SAC’00), Como, Italy, March 2000, vol. 1, pp. 119–123 (2000)
Huffman, DA.: A method for construction of minimum redundancy codes. In: Proceedings of the IRE, Sep 1952, vol. 40, pp. 1090–1101 (1952)
Carus, A., Mesut, A.: Fast text compression using multiple static dictionaries. Inf. Technol. J. 9(5), 1013–1021 (2010). https://doi.org/10.3923/itj.2010.1013.1021
Article Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequence via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978). https://doi.org/10.1109/TIT.1978.1055934
Article MathSciNet MATH Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346
Article MathSciNet MATH Google Scholar
Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984). https://doi.org/10.1109/MC.1984.1659158
Article Google Scholar
Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Inf. Process. Manag. 41(5), 1175–1192 (2005). https://doi.org/10.1016/j.ipm.2004.08.009
Article MATH Google Scholar
L’ansk’y, J., Žemlička, M.: Text compression: syllables. In: Proceedings of the Dateso–Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, April 13–15, 2005, pp. 32–45 (2005)
Adiego, J., de la Feunte, P.: On the use of words as source alphabet symbols in PPM. In: Proceedings of Data Compression Conference, Snowbird, UT, USA, March 28–30, 2006, pp. 435 (2006)
Dvorsky, J., Pokorny, J., Snasel, V.: Word-based compression methods for large text documents. In: Proceedings of Data Compression Conference, Snowbird, UT, USA, March 29–31, 1999, pp. 523 (1999)
L’ansk’y, J., Žemlička, M.: Compression of a dictionary. In: Proceedings of DATESO Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, April 26–28, 2006, pp. 11–20 (2006)
Al-Bahadili, H., Rababa, A.: An adaptive bit-level text compression scheme based on the HCDC algorithm. In: Proceedings of Mosharaka International Conference on Communications, Networking and Information Technology, Amman, Jordan, Dec 6–8, 2007, pp. 51–56 (2007)
Chung, K.L.: Efficient Huffman decoding. Inf. Process. Lett. 61(2), 97–99 (1997). https://doi.org/10.1016/S0020-0190(96)00204-9
Article MathSciNet MATH Google Scholar
Schack, R.: The length of a typical Huffman codeword. IEEE Trans. Inf. Theory 40(4), 1246–1247 (1994). https://doi.org/10.1109/18.335944
Article MATH Google Scholar
Katona, G.O.H., Nemetz, T.O.H.: Huffman codes and self-information. IEEE Trans. Inf. Theory 22(3), 337–340 (1978). https://doi.org/10.1109/TIT.1976.1055554
Article MathSciNet MATH Google Scholar
Fenwick, P.M.: Huffman code efficiencies for extensions of sources. IEEE Trans. Commun. 43(2/3/4), 163–165 (1995)
Article Google Scholar
Kavousianos, X., Kalligeros, E., Nikolos, D.: Test-data compression based on variable-to-variable huffman encoding with codeword reusability. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 27(7), 1333–1338 (2008)
Article Google Scholar
Lin, Y.-K., Huang, S.-C., Yang, C.-H.: A fast algorithm for Huffman decoding based on a recursion Huffman tree. J. Syst. Softw. 85, 974–980 (2012)
Article Google Scholar
Alakuijala, J., Vandevenne, L.: Data compression using Zopfli. Google Inc. [Online]. https://zopfli.googlecode.com/files/Data_compression_using_Zopfli.pdf. Accessed 30 May 2018 (2013)
The Brown Corpus [Online]. http://www.nltk.org/nltk_data/. Accessed 30 May (2018)
The Canterbury Corpus [Online]. http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May (2018)
Mumin, M.A.A., Shoeb, A.A.M., Selim, M.R., Iqbal, M.Z.: SUPara: a balanced english-bengali parallel corpus. SUST J. Sci. Technol. 16(2), 46–51 (2012)
Google Scholar
The Enwik8 Corpus [Online]. http://mattmahoney.net/dc/text.html http://mattmahoney.net/dc/enwik8.zip. Accessed 30 May 2018
Habib, A., Rahman, M.S.: Balancing decoding speed and memory usage for Huffman codes using quaternary tree. Appl. Inf. 4, 5 (2017). https://doi.org/10.1186/s40535-016-0032-z
Article Google Scholar
Zopfli Source Code [Online]. https://github.com/google/zopfli/commit/89cf773beef75d7f4d6d378debdf299378c3314e. Accessed 30 May 2018
LZHAM Source [Online]. https://github.com/richgel999/lzham_codec. Accessed 30 May (2018)
bzip2 Source: bzip2 1.0.6 6-Sept-2010 [Online]. https://github.com/enthought/bzip2-1.0.6. Accessed 30 May 2018
LZMA Source: LZMA implementation in 7zip 9.20.1 [Online]. LZMA SDK: https://www.7-zip.org/sdk.htm. Accessed 30 May 2018
Deutsch P (1996) RFC 1952—GZIP file format specification, version 4.3, May 1996. [Online]. http://www.ietf.org/rfc/rfc1952.txt. Accessed 30 May (2018)

Download references

Acknowledgements

The authors are grateful to Information and Communication Technology Division, Government of the People’s Republic of Bangladesh for the grant to do this research work.

Funding

All the funding provided by the Ministry of Posts, Telecommunications and Information Technology, People’s Republic of Bangladesh [Order No: 56.00.0000.028.33.025.14-153, date: 08.06.2017]. The above funding gives financial support for the designing of the study and conducting experiments.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
Ahsan Habib, M. Jahirul Islam & Mohammad Shahidur Rahman

Authors

Ahsan Habib
View author publications
You can also search for this author in PubMed Google Scholar
M. Jahirul Islam
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Shahidur Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors discussed the problem and the solutions proposed altogether. All authors participated in drafting and revising the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ahsan Habib.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Habib, A., Islam, M.J. & Rahman, M.S. A dictionary-based text compression technique using quaternary code. Iran J Comput Sci 3, 127–136 (2020). https://doi.org/10.1007/s42044-019-00047-w

Download citation

Received: 24 March 2019
Accepted: 28 August 2019
Published: 07 September 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s42044-019-00047-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dictionary-based text compression technique using quaternary code

Abstract

Access this article

Similar content being viewed by others

An Efficient Compression Scheme for Natural Language Text by Hashing

Trigram-Based Vietnamese Text Compression

Lempel–Ziv-78 Compressed String Dictionaries

Availability of data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A dictionary-based text compression technique using quaternary code

Abstract

Access this article

Similar content being viewed by others

An Efficient Compression Scheme for Natural Language Text by Hashing

Trigram-Based Vietnamese Text Compression

Lempel–Ziv-78 Compressed String Dictionaries

Availability of data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation