Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Text Clustering

Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_415

Definition

Text clustering is to automatically group textual documents (for example, documents in plain text, web pages, emails and etc) into clusters based on their content similarity. The problem of text clustering can be defined as follows. Given a set of n documents noted as DS and a pre-defined cluster number K (usually set by users), DS is clustered into K document clusters DS1 , DS2 , … , DSk, (i . e , {DS1, DS2, … , DSk} = DS) so that the documents in a same document cluster are similar to one another while documents from different clusters are dissimilar [14].

Historical Background

Text clustering was initially developed to improve the performance of search engines through pre-clustering the entire corpus [2]. Text clustering later has also been investigated as a post-retrieval document browsing technique [1, 2, 7].

Foundations

Text clustering consists of several important components including document representation, text clustering algorithms and performance measurements....

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Croft WB. Organizing and searching large files of documents. Ph.D. thesis, University of Cambridge; 1978.Google Scholar
  2. 2.
    Cutting DR, Karger DR, Pedersen JO, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.Google Scholar
  3. 3.
    Day WH, Edelsbrunner H. Efficient algorithms for agglomerative hierarchical clustering methods. J Classif. 1984;1(2):1–24.zbMATHGoogle Scholar
  4. 4.
    Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning, UT CS Technical report #TR. Department of Computer Sciences, University of Texas, Austin; 2001.Google Scholar
  5. 5.
    Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management; 1998. p. 148–55.Google Scholar
  6. 6.
    Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.CrossRefGoogle Scholar
  7. 7.
    Leouski AV, Croft WB. An evaluation of techniques for clustering search results. Technical report IR-76. Department of Computer Science, University of Massachusetts, Amherst; 1996.Google Scholar
  8. 8.
    Lewis DD. Representation quality in text classification: an introduction and experiment. In: Proceedings of the Workshop on Speech and Natural Language; 1990. p. 288–295.Google Scholar
  9. 9.
    MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; 1967. p. 281–97.Google Scholar
  10. 10.
    Nagy G. State of the art in pattern recognition. Proc IEEE. 1968;56(5):836–62.CrossRefGoogle Scholar
  11. 11.
    Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):147.MathSciNetCrossRefGoogle Scholar
  12. 12.
    Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. Technique report, University of Minnesota – Computer Science and Engineering; 2000.Google Scholar
  13. 13.
    van Rijsbergen CJ. Information retrieval. 2nd ed. London: Butterworths; 1979.zbMATHGoogle Scholar
  14. 14.
    Yoo I, Hu XH. A comprehensive comparison study of document clustering for a biomedical distal library Medline. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries; 2006. p. 220–9.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Microsoft Research AsiaBeijingChina

Section editors and affiliations

  • Zheng Chen
    • 1
  1. 1.Microsoft Research AsiaMicrosoft CorporationBeijingChina