Finding Structure in a Document Collection

Weiss, Sholom M.; Indurkhya, Nitin; Zhang, Tong

doi:10.1007/978-1-84996-226-1_5

Sholom M. Weiss⁵,
Nitin Indurkhya⁶ &
Tong Zhang⁷

Part of the book series: Texts in Computer Science ((TCS))

3319 Accesses

Abstract

Document collections are frequently encountered without labels. Labels may be determined by clustering the documents into disparate groups and implicitly finding common themes among the document clusters. This chapter describes methods for clustering documents. A key theme for document clustering is computing measures of similarity. We review the major clustering methods: k-means clustering, hierarchical clustering and the EM algorithm. Strategies for assigning meaning to algorithmically generated clusters and labels are considered. Performance evaluation helps determine the empirical characteristics of desirable clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Banerjee and J. Langford. An objective evaluation criterion for clustering. In Proceedings of KDD-2004. ACM, New York, 2004.
Google Scholar
R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3:1183–1208, 2003.
MATH Google Scholar
D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92, pages 1–12. ACM, New York, 1992.
Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1):1–38, 1977. With discussion.
MATH MathSciNet Google Scholar
I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001.
Article MATH Google Scholar
S. Dumais and H. Chen. Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263. ACM, New York, 2000.
Google Scholar
C. Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning, pages 147–153. AAAI Press, Menlo Park, 2003.
Google Scholar
N. Jardine and C. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
Article Google Scholar
M. Kearns, Y. Mansour, and A.-Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pages 282–293. Morgan Kaufmann, San Francisco, 1997.
Google Scholar
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. University of California Press, Berkeley, 1967.
Google Scholar
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):1–32, 2000.
Article Google Scholar
G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, 1971.
Google Scholar
G. Stein, A. Bagga, and G. Wise. Multi-document summarization: Methodologies and evaluations. In Proceedings of the 7th Conference on Automatic Natural Language Processing (TALN’00), pages 337–346. ATALA Press, Paris, 2000.
Google Scholar
E. Voorhees. The cluster hypothesis revisited. In Proceedings of SIGIR-85, pages 188–196. ACM, New York, 1985.
Chapter Google Scholar
P. Willett. Recent trends in hierarchic document clustering. Information Processing and Management, 24:577–597, 1988.
Article Google Scholar
S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003.
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

T.J. Watson Research Center, IBM Corporation, Kitchawan Road 1101, Yorktown Heights, 10598, NY, USA
Sholom M. Weiss
School of Computer Science & Engg., University of New South Wales, Sydney, 2052, NSW, Australia
Nitin Indurkhya
Dept. Statistics, Hill Center, Rutgers University, Piscataway, 08854-8019, NJ, USA
Tong Zhang

Authors

Sholom M. Weiss
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Indurkhya
View author publications
You can also search for this author in PubMed Google Scholar
Tong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sholom M. Weiss .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Weiss, S.M., Indurkhya, N., Zhang, T. (2010). Finding Structure in a Document Collection. In: Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84996-226-1_5

Download citation

DOI: https://doi.org/10.1007/978-1-84996-226-1_5
Publisher Name: Springer, London
Print ISBN: 978-1-84996-225-4
Online ISBN: 978-1-84996-226-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics