Abstract
A novel data compression perspective is explored in this paper and focus is given on a new text compression algorithm based on clustering technique in Data Mining. Huffman encoding is enhanced through clustering, a non-trivial phase in the field of Data Mining for lossless text compression. The seminal hierarchical clustering technique has been modified in such a way that optimal number of words (patterns which are sequence of characters with a space as suffix) are obtained. These patterns are employed in the encoding process of our algorithm instead of single character-based code assignment approach of conventional Huffman encoding. Our approach is built on an efficient cosine similarity measure, which maximizes the compression ratio. Simulation of our proposed technique over benchmark corpus clearly shows the gain in compression ratio and time of our proposed work in relation to conventional Huffman encoding.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
David, S.: Data Compression: The Complete Reference. Second edn. (2004)
A, H.D.: A method for the construction of minimum redundancy codes. proc. IRE 40(9) (1952) 1098–1101
Ramakrishnan, N., Grama, A.: Data mining: From serendipity to science - guest editors’ introduction. IEEE Computer 32(8) (1999) 34–37
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)
Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In Bocca, J.B., Jarke, M., Zaniolo, C., eds.: VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, Morgan Kaufmann (1994) 487–499
Rajaraman, A., Ullman, J.D., Ullman, J.D., Ullman, J.D.: Mining of massive datasets. Volume 1. Cambridge University Press Cambridge (2012)
Aggarwal, C.C., Reddy, C.K.: Data clustering: algorithms and applications. CRC Press (2013)
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1) (2001) 3–55
Pountain, D.: Run-length encoding. Byte 12(6) (1987) 317–319
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6) (1987) 520–540
Vitter, J.S.: Design and analysis of dynamic huffman codes. Journal of the ACM (JACM) 34(4) (1987) 825–845
Oswald, C., Ghosh, A.I., Sivaselvan, B.: An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration. Springer (2015) 563–575
Kaufman, L., Rousseeuw, P.J.: Agglomerative nesting (program agnes). Finding Groups in Data: An Introduction to Cluster Analysis (2008) 199–252
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. SIGMOD ’96, New York, NY, USA, ACM (1996) 103–114
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8) (2010) 651–666 Award winning papers from the 19th International Conference on Pattern Recognition (ICPR)19th International Conference in Pattern Recognition (ICPR)
Calgary compression corpus datasets. http://www.corpus.canterbury.ac.nz/descriptions/ Accessed: 2015-07-23
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Oswald, C., Akshay Vyas, V., Arun Kumar, K., Vijay Sri, L., Sivaselvan, B. (2018). Hierarchical Clustering Approach to Text Compression. In: Sa, P., Sahoo, M., Murugappan, M., Wu, Y., Majhi, B. (eds) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Advances in Intelligent Systems and Computing, vol 518. Springer, Singapore. https://doi.org/10.1007/978-981-10-3373-5_35
Download citation
DOI: https://doi.org/10.1007/978-981-10-3373-5_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3372-8
Online ISBN: 978-981-10-3373-5
eBook Packages: EngineeringEngineering (R0)