Abstract
Document collections are frequently encountered without labels. Labels may be determined by clustering the documents into disparate groups and implicitly finding common themes among the document clusters. This chapter describes methods for clustering documents. A key theme for document clustering is computing measures of similarity. We review the major clustering methods: k-means clustering, hierarchical clustering and the EM algorithm. Strategies for assigning meaning to algorithmically generated clusters and labels are considered. Performance evaluation helps determine the empirical characteristics of desirable clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer-Verlag London
About this chapter
Cite this chapter
Weiss, S.M., Indurkhya, N., Zhang, T. (2015). Finding Structure in a Document Collection. In: Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-6750-1_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-6750-1_5
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6749-5
Online ISBN: 978-1-4471-6750-1
eBook Packages: Computer ScienceComputer Science (R0)