Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Document Clustering

  • Ying Zhao
  • George Karypis
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1479


High-dimensional clustering; Text clustering; Unsupervised learning on document datasets


At a high-level the problem of document clustering is defined as follows. Given a set S of n documents, we would like to partition them into a pre-determined number of k subsets S1, S2, …, Sk, such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering is an essential part of text mining and has many applications in information retrieval and knowledge management. Document clustering faces two big challenges: the dimensionality of the feature space tends to be high (i.e., a document collection often consists of thousands or tens of thousands unique words); the size of a document collection tends to be large.

Historical Background

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms as well as in facilitating...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Boley D. Principal direction divisive partitioning. Data Mining Knowl Discov. 1998; 2(4): 325–44.Google Scholar
  2. 2.
    Cutting DR, Pedersen JO, Karger DR, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.Google Scholar
  3. 3.
    Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc. 1977;39(1):1–38.MathSciNetzbMATHGoogle Scholar
  4. 4.
    Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2001. p. 269–74.Google Scholar
  5. 5.
    Ding C., He X., Zha H., Gu M., and Simon H. 1Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, 2001.Google Scholar
  6. 6.
    Duda RO, Hart PE, Stork DG. Pattern classification. New York: Wiley; 2001.zbMATHGoogle Scholar
  7. 7.
    Fisher D. Iterative optimization and simplification of hierarchical clusterings. J Artif Intell Res. 1996;4(1):147–80.CrossRefzbMATHGoogle Scholar
  8. 8.
    Jain AK, Dubes RC. Algorithms for clustering data. New York: Prentice Hall; 1988.zbMATHGoogle Scholar
  9. 9.
    Karypis G. Cluto: a clustering toolkit. Technical Report 02-017, Department of Computer Science, University of Minnesota, 2002.Google Scholar
  10. 10.
    King B. Step-wise clustering procedures. J Am Stat Assoc. 1967;69(317):86–101.CrossRefGoogle Scholar
  11. 11.
    MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Symposium on Mathematical Statistics and Probablity. 1967; p. 281–97.Google Scholar
  12. 12.
    Salton G. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Reading: Addison-Wesley; 1989.Google Scholar
  13. 13.
    Sneath PH, Sokal RR. Numerical taxonomy. London: Freeman; 1973.zbMATHGoogle Scholar
  14. 14.
    Zahn K. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Trans Comput. 1971; C-20(1):68–86.CrossRefzbMATHGoogle Scholar
  15. 15.
    Zha H, He X, Ding C, Simon H, Gu M. Bipartite graph partitioning and data clustering. In: Proceedings of the International Conference on Information and Knowledge Management; 2001.Google Scholar
  16. 16.
    Zhao Y, Karypis G. Criterion functions for document clustering: experiments and analysis. Mach Learn. 2004;55:311–31.CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Ying Zhao
    • 1
  • George Karypis
    • 2
  1. 1.Tsinghua UniversityBeijingChina
  2. 2.University of MinnesotaMinneapolisUSA

Section editors and affiliations

  • Dimitrios Gunopulos
    • 1
  1. 1.Department of Computer Science and EngineeringThe University of California at Riverside, Bourns College of EngineeringRiversideUSA