Hierarchical Clustering Approach to Text Compression

Oswald, C.; Akshay Vyas, V.; Arun Kumar, K.; Vijay Sri, L.; Sivaselvan, B.

doi:10.1007/978-981-10-3373-5_35

C. Oswald¹⁹,
V. Akshay Vyas²⁰,
K. Arun Kumar²⁰,
L. Vijay Sri¹⁹ &
…
B. Sivaselvan¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 518))

1086 Accesses
1 Citations

Abstract

A novel data compression perspective is explored in this paper and focus is given on a new text compression algorithm based on clustering technique in Data Mining. Huffman encoding is enhanced through clustering, a non-trivial phase in the field of Data Mining for lossless text compression. The seminal hierarchical clustering technique has been modified in such a way that optimal number of words (patterns which are sequence of characters with a space as suffix) are obtained. These patterns are employed in the encoding process of our algorithm instead of single character-based code assignment approach of conventional Huffman encoding. Our approach is built on an efficient cosine similarity measure, which maximizes the compression ratio. Simulation of our proposed technique over benchmark corpus clearly shows the gain in compression ratio and time of our proposed work in relation to conventional Huffman encoding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

David, S.: Data Compression: The Complete Reference. Second edn. (2004)
Google Scholar
A, H.D.: A method for the construction of minimum redundancy codes. proc. IRE 40(9) (1952) 1098–1101
Google Scholar
Ramakrishnan, N., Grama, A.: Data mining: From serendipity to science - guest editors’ introduction. IEEE Computer 32(8) (1999) 34–37
Article Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)
Google Scholar
Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In Bocca, J.B., Jarke, M., Zaniolo, C., eds.: VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, Morgan Kaufmann (1994) 487–499
Google Scholar
Rajaraman, A., Ullman, J.D., Ullman, J.D., Ullman, J.D.: Mining of massive datasets. Volume 1. Cambridge University Press Cambridge (2012)
Google Scholar
Aggarwal, C.C., Reddy, C.K.: Data clustering: algorithms and applications. CRC Press (2013)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1) (2001) 3–55
Article MathSciNet Google Scholar
Pountain, D.: Run-length encoding. Byte 12(6) (1987) 317–319
Google Scholar
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6) (1987) 520–540
Article Google Scholar
Vitter, J.S.: Design and analysis of dynamic huffman codes. Journal of the ACM (JACM) 34(4) (1987) 825–845
Article MathSciNet MATH Google Scholar
Oswald, C., Ghosh, A.I., Sivaselvan, B.: An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration. Springer (2015) 563–575
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Agglomerative nesting (program agnes). Finding Groups in Data: An Introduction to Cluster Analysis (2008) 199–252
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. SIGMOD ’96, New York, NY, USA, ACM (1996) 103–114
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8) (2010) 651–666 Award winning papers from the 19th International Conference on Pattern Recognition (ICPR)19th International Conference in Pattern Recognition (ICPR)
Google Scholar
Calgary compression corpus datasets. http://www.corpus.canterbury.ac.nz/descriptions/ Accessed: 2015-07-23

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Indian Institute of Information Technology, Design and Manufacturing Kancheepuram, Chennai, Tamil Nadu, India
C. Oswald, L. Vijay Sri & B. Sivaselvan
Department of Computer Science and Engineering, Department of Information Technology, Sona College of Technology, Salem, Tamil Nadu, India
V. Akshay Vyas & K. Arun Kumar

Authors

C. Oswald
View author publications
You can also search for this author in PubMed Google Scholar
V. Akshay Vyas
View author publications
You can also search for this author in PubMed Google Scholar
K. Arun Kumar
View author publications
You can also search for this author in PubMed Google Scholar
L. Vijay Sri
View author publications
You can also search for this author in PubMed Google Scholar
B. Sivaselvan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to C. Oswald .

Editor information

Editors and Affiliations

National Institute of Technology, Dept. of Computer Science & Engineering National Institute of Technology, Rourkela, Odisha, India
Pankaj Kumar Sa
National Institute of Technology, Dept. of Computer Science & Engineering National Institute of Technology, Rourkela, Odisha, India
Manmath Narayan Sahoo
Universiti Malaysia Perlis (UniMAP), School of Mecahtronics Engineering Universiti Malaysia Perlis (UniMAP), Arau, Perlis, Malaysia
M. Murugappan
The University of Exeter, Lecturer The University of Exeter, Exeter, Devon, United Kingdom
Yulei Wu
National Institute of Technology, Dept. of Computer Science & Engineering National Institute of Technology, Rourkela, Odisha, India
Banshidhar Majhi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oswald, C., Akshay Vyas, V., Arun Kumar, K., Vijay Sri, L., Sivaselvan, B. (2018). Hierarchical Clustering Approach to Text Compression. In: Sa, P., Sahoo, M., Murugappan, M., Wu, Y., Majhi, B. (eds) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Advances in Intelligent Systems and Computing, vol 518. Springer, Singapore. https://doi.org/10.1007/978-981-10-3373-5_35

Download citation

DOI: https://doi.org/10.1007/978-981-10-3373-5_35
Published: 13 July 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3372-8
Online ISBN: 978-981-10-3373-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics