Abstract
As the saying goes, birds of a feather flock together. Clustering analysis in data mining is a process of dividing data points into different subsets based on the intrinsic rules and distribution characteristics so that the data points in the same cluster are similar to each other while those in different clusters are distinct. Each subset of data points is called a cluster. As an unsupervised machine learning method, clustering differs from classification in several ways. First, rather than requiring labeled data for supervision in the clustering process, it simply depends on a similarity computation between different data points, which therefore allows high flexibility. Second, in a classification problem, the categories should be predefined, while in clustering, the number of categories is unknown in advance. The clustering system determines the number of categories and the data points contained in each category according to certain criteria. Clustering is a fundamental problem in machine learning and has been widely used in natural language processing and text data mining.
In text clustering, the text data should first be represented in a machine computable form. Therefore, text representation is the basis of text clustering. The text representation methods have been described in Chap. 3 in detail. In this chapter, we will focus on the clustering algorithms. The typical clustering algorithms include partition-based methods, hierarchy-based methods, and density-based methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For the simplicity of description, we use “document” to refer to a piece of text at different levels (e.g., sentence, document, etc.).
References
Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).
Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of internal clustering validation measures. In 2010 IEEE International Conference on Data Mining (pp. 911–916). New York: IEEE.
Yang, Y., Pierce, T., & Carbonell, J. (1998). A study of retrospective and on-line event detection. In Proceedings of SIGIR (pp. 28–36).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 Tsinghua University Press
About this chapter
Cite this chapter
Zong, C., Xia, R., Zhang, J. (2021). Text Clustering. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_6
Download citation
DOI: https://doi.org/10.1007/978-981-16-0100-2_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0099-9
Online ISBN: 978-981-16-0100-2
eBook Packages: Computer ScienceComputer Science (R0)