Abstract
Document collections are frequently encountered without labels. Labels may be determined by clustering the documents into disparate groups and implicitly finding common themes among the document clusters. This chapter describes methods for clustering documents. A key theme for document clustering is computing measures of similarity. We review the major clustering methods: k-means clustering, hierarchical clustering and the EM algorithm. Strategies for assigning meaning to algorithmically generated clusters and labels are considered. Performance evaluation helps determine the empirical characteristics of desirable clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A. Banerjee and J. Langford. An objective evaluation criterion for clustering. In Proceedings of KDD-2004. ACM, New York, 2004.
R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3:1183–1208, 2003.
D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92, pages 1–12. ACM, New York, 1992.
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1):1–38, 1977. With discussion.
I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001.
S. Dumais and H. Chen. Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263. ACM, New York, 2000.
C. Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning, pages 147–153. AAAI Press, Menlo Park, 2003.
N. Jardine and C. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
M. Kearns, Y. Mansour, and A.-Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pages 282–293. Morgan Kaufmann, San Francisco, 1997.
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. University of California Press, Berkeley, 1967.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):1–32, 2000.
G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, 1971.
G. Stein, A. Bagga, and G. Wise. Multi-document summarization: Methodologies and evaluations. In Proceedings of the 7th Conference on Automatic Natural Language Processing (TALN’00), pages 337–346. ATALA Press, Paris, 2000.
E. Voorhees. The cluster hypothesis revisited. In Proceedings of SIGIR-85, pages 188–196. ACM, New York, 1985.
P. Willett. Recent trends in hierarchic document clustering. Information Processing and Management, 24:577–597, 1988.
S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer-Verlag London Limited
About this chapter
Cite this chapter
Weiss, S.M., Indurkhya, N., Zhang, T. (2010). Finding Structure in a Document Collection. In: Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84996-226-1_5
Download citation
DOI: https://doi.org/10.1007/978-1-84996-226-1_5
Publisher Name: Springer, London
Print ISBN: 978-1-84996-225-4
Online ISBN: 978-1-84996-226-1
eBook Packages: Computer ScienceComputer Science (R0)