Text Clustering

Zong, Chengqing; Xia, Rui; Zhang, Jiajun

doi:10.1007/978-981-16-0100-2_6

Chengqing Zong⁴,
Rui Xia⁵ &
Jiajun Zhang⁴

2988 Accesses

Abstract

As the saying goes, birds of a feather flock together. Clustering analysis in data mining is a process of dividing data points into different subsets based on the intrinsic rules and distribution characteristics so that the data points in the same cluster are similar to each other while those in different clusters are distinct. Each subset of data points is called a cluster. As an unsupervised machine learning method, clustering differs from classification in several ways. First, rather than requiring labeled data for supervision in the clustering process, it simply depends on a similarity computation between different data points, which therefore allows high flexibility. Second, in a classification problem, the categories should be predefined, while in clustering, the number of categories is unknown in advance. The clustering system determines the number of categories and the data points contained in each category according to certain criteria. Clustering is a fundamental problem in machine learning and has been widely used in natural language processing and text data mining.

In text clustering, the text data should first be represented in a machine computable form. Therefore, text representation is the basis of text clustering. The text representation methods have been described in Chap. 3 in detail. In this chapter, we will focus on the clustering algorithms. The typical clustering algorithms include partition-based methods, hierarchy-based methods, and density-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For the simplicity of description, we use “document” to refer to a piece of text at different levels (e.g., sentence, document, etc.).

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).
Google Scholar
Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of internal clustering validation measures. In 2010 IEEE International Conference on Data Mining (pp. 911–916). New York: IEEE.
Chapter Google Scholar
Yang, Y., Pierce, T., & Carbonell, J. (1998). A study of retrospective and on-line event detection. In Proceedings of SIGIR (pp. 28–36).
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, Beijing, China
Chengqing Zong & Jiajun Zhang
School of Computer Science & Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Rui Xia

Authors

Chengqing Zong
View author publications
You can also search for this author in PubMed Google Scholar
Rui Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zong, C., Xia, R., Zhang, J. (2021). Text Clustering. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_6

Download citation

DOI: https://doi.org/10.1007/978-981-16-0100-2_6
Published: 21 January 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0099-9
Online ISBN: 978-981-16-0100-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics