Skip to main content

Hierarchical Clustering Approach to Text Compression

  • Conference paper
  • First Online:
Progress in Intelligent Computing Techniques: Theory, Practice, and Applications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 518))

Abstract

A novel data compression perspective is explored in this paper and focus is given on a new text compression algorithm based on clustering technique in Data Mining. Huffman encoding is enhanced through clustering, a non-trivial phase in the field of Data Mining for lossless text compression. The seminal hierarchical clustering technique has been modified in such a way that optimal number of words (patterns which are sequence of characters with a space as suffix) are obtained. These patterns are employed in the encoding process of our algorithm instead of single character-based code assignment approach of conventional Huffman encoding. Our approach is built on an efficient cosine similarity measure, which maximizes the compression ratio. Simulation of our proposed technique over benchmark corpus clearly shows the gain in compression ratio and time of our proposed work in relation to conventional Huffman encoding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. David, S.: Data Compression: The Complete Reference. Second edn. (2004)

    Google Scholar 

  2. A, H.D.: A method for the construction of minimum redundancy codes. proc. IRE 40(9) (1952) 1098–1101

    Google Scholar 

  3. Ramakrishnan, N., Grama, A.: Data mining: From serendipity to science - guest editors’ introduction. IEEE Computer 32(8) (1999) 34–37

    Article  Google Scholar 

  4. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)

    Google Scholar 

  5. Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In Bocca, J.B., Jarke, M., Zaniolo, C., eds.: VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, Morgan Kaufmann (1994) 487–499

    Google Scholar 

  6. Rajaraman, A., Ullman, J.D., Ullman, J.D., Ullman, J.D.: Mining of massive datasets. Volume 1. Cambridge University Press Cambridge (2012)

    Google Scholar 

  7. Aggarwal, C.C., Reddy, C.K.: Data clustering: algorithms and applications. CRC Press (2013)

    Google Scholar 

  8. Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1) (2001) 3–55

    Article  MathSciNet  Google Scholar 

  9. Pountain, D.: Run-length encoding. Byte 12(6) (1987) 317–319

    Google Scholar 

  10. Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6) (1987) 520–540

    Article  Google Scholar 

  11. Vitter, J.S.: Design and analysis of dynamic huffman codes. Journal of the ACM (JACM) 34(4) (1987) 825–845

    Article  MathSciNet  MATH  Google Scholar 

  12. Oswald, C., Ghosh, A.I., Sivaselvan, B.: An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration. Springer (2015) 563–575

    Google Scholar 

  13. Kaufman, L., Rousseeuw, P.J.: Agglomerative nesting (program agnes). Finding Groups in Data: An Introduction to Cluster Analysis (2008) 199–252

    Google Scholar 

  14. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. SIGMOD ’96, New York, NY, USA, ACM (1996) 103–114

    Google Scholar 

  15. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8) (2010) 651–666 Award winning papers from the 19th International Conference on Pattern Recognition (ICPR)19th International Conference in Pattern Recognition (ICPR)

    Google Scholar 

  16. Calgary compression corpus datasets. http://www.corpus.canterbury.ac.nz/descriptions/ Accessed: 2015-07-23

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Oswald .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Oswald, C., Akshay Vyas, V., Arun Kumar, K., Vijay Sri, L., Sivaselvan, B. (2018). Hierarchical Clustering Approach to Text Compression. In: Sa, P., Sahoo, M., Murugappan, M., Wu, Y., Majhi, B. (eds) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Advances in Intelligent Systems and Computing, vol 518. Springer, Singapore. https://doi.org/10.1007/978-981-10-3373-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3373-5_35

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3372-8

  • Online ISBN: 978-981-10-3373-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics