Skip to main content

Text Clustering

  • Chapter
  • First Online:
Text Data Mining

Abstract

As the saying goes, birds of a feather flock together. Clustering analysis in data mining is a process of dividing data points into different subsets based on the intrinsic rules and distribution characteristics so that the data points in the same cluster are similar to each other while those in different clusters are distinct. Each subset of data points is called a cluster. As an unsupervised machine learning method, clustering differs from classification in several ways. First, rather than requiring labeled data for supervision in the clustering process, it simply depends on a similarity computation between different data points, which therefore allows high flexibility. Second, in a classification problem, the categories should be predefined, while in clustering, the number of categories is unknown in advance. The clustering system determines the number of categories and the data points contained in each category according to certain criteria. Clustering is a fundamental problem in machine learning and has been widely used in natural language processing and text data mining.

In text clustering, the text data should first be represented in a machine computable form. Therefore, text representation is the basis of text clustering. The text representation methods have been described in Chap. 3 in detail. In this chapter, we will focus on the clustering algorithms. The typical clustering algorithms include partition-based methods, hierarchy-based methods, and density-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For the simplicity of description, we use “document” to refer to a piece of text at different levels (e.g., sentence, document, etc.).

References

  • Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).

    Google Scholar 

  • Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of internal clustering validation measures. In 2010 IEEE International Conference on Data Mining (pp. 911–916). New York: IEEE.

    Chapter  Google Scholar 

  • Yang, Y., Pierce, T., & Carbonell, J. (1998). A study of retrospective and on-line event detection. In Proceedings of SIGIR (pp. 28–36).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Tsinghua University Press

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zong, C., Xia, R., Zhang, J. (2021). Text Clustering. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-0100-2_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-0099-9

  • Online ISBN: 978-981-16-0100-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics