Hierarchical Clustering Approach to Text Compression

  • C. Oswald
  • V. Akshay Vyas
  • K. Arun Kumar
  • L. Vijay Sri
  • B. Sivaselvan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 518)


A novel data compression perspective is explored in this paper and focus is given on a new text compression algorithm based on clustering technique in Data Mining. Huffman encoding is enhanced through clustering, a non-trivial phase in the field of Data Mining for lossless text compression. The seminal hierarchical clustering technique has been modified in such a way that optimal number of words (patterns which are sequence of characters with a space as suffix) are obtained. These patterns are employed in the encoding process of our algorithm instead of single character-based code assignment approach of conventional Huffman encoding. Our approach is built on an efficient cosine similarity measure, which maximizes the compression ratio. Simulation of our proposed technique over benchmark corpus clearly shows the gain in compression ratio and time of our proposed work in relation to conventional Huffman encoding.


Hierarchical clustering Compression ratio Cosine similarity measure Huffman encoding Lossless compression 


  1. 1.
    David, S.: Data Compression: The Complete Reference. Second edn. (2004)Google Scholar
  2. 2.
    A, H.D.: A method for the construction of minimum redundancy codes. proc. IRE 40(9) (1952) 1098–1101Google Scholar
  3. 3.
    Ramakrishnan, N., Grama, A.: Data mining: From serendipity to science - guest editors’ introduction. IEEE Computer 32(8) (1999) 34–37CrossRefGoogle Scholar
  4. 4.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)Google Scholar
  5. 5.
    Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In Bocca, J.B., Jarke, M., Zaniolo, C., eds.: VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, Morgan Kaufmann (1994) 487–499Google Scholar
  6. 6.
    Rajaraman, A., Ullman, J.D., Ullman, J.D., Ullman, J.D.: Mining of massive datasets. Volume 1. Cambridge University Press Cambridge (2012)Google Scholar
  7. 7.
    Aggarwal, C.C., Reddy, C.K.: Data clustering: algorithms and applications. CRC Press (2013)Google Scholar
  8. 8.
    Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1) (2001) 3–55MathSciNetCrossRefGoogle Scholar
  9. 9.
    Pountain, D.: Run-length encoding. Byte 12(6) (1987) 317–319Google Scholar
  10. 10.
    Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6) (1987) 520–540CrossRefGoogle Scholar
  11. 11.
    Vitter, J.S.: Design and analysis of dynamic huffman codes. Journal of the ACM (JACM) 34(4) (1987) 825–845MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Oswald, C., Ghosh, A.I., Sivaselvan, B.: An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration. Springer (2015) 563–575Google Scholar
  13. 13.
    Kaufman, L., Rousseeuw, P.J.: Agglomerative nesting (program agnes). Finding Groups in Data: An Introduction to Cluster Analysis (2008) 199–252Google Scholar
  14. 14.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. SIGMOD ’96, New York, NY, USA, ACM (1996) 103–114Google Scholar
  15. 15.
    Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8) (2010) 651–666 Award winning papers from the 19th International Conference on Pattern Recognition (ICPR)19th International Conference in Pattern Recognition (ICPR)Google Scholar
  16. 16.
    Calgary compression corpus datasets. Accessed: 2015-07-23

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • C. Oswald
    • 1
  • V. Akshay Vyas
    • 2
  • K. Arun Kumar
    • 2
  • L. Vijay Sri
    • 1
  • B. Sivaselvan
    • 1
  1. 1.Department of Computer EngineeringIndian Institute of Information Technology, Design and Manufacturing KancheepuramChennaiIndia
  2. 2.Department of Computer Science and Engineering, Department of Information TechnologySona College of TechnologySalemIndia

Personalised recommendations