Skip to main content

Finding Structure in a Document Collection

  • Chapter
Fundamentals of Predictive Text Mining

Part of the book series: Texts in Computer Science ((TCS))

  • 3319 Accesses

Abstract

Document collections are frequently encountered without labels. Labels may be determined by clustering the documents into disparate groups and implicitly finding common themes among the document clusters. This chapter describes methods for clustering documents. A key theme for document clustering is computing measures of similarity. We review the major clustering methods: k-means clustering, hierarchical clustering and the EM algorithm. Strategies for assigning meaning to algorithmically generated clusters and labels are considered. Performance evaluation helps determine the empirical characteristics of desirable clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • A. Banerjee and J. Langford. An objective evaluation criterion for clustering. In Proceedings of KDD-2004. ACM, New York, 2004.

    Google Scholar 

  • R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3:1183–1208, 2003.

    MATH  Google Scholar 

  • D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92, pages 1–12. ACM, New York, 1992.

    Google Scholar 

  • A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1):1–38, 1977. With discussion.

    MATH  MathSciNet  Google Scholar 

  • I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001.

    Article  MATH  Google Scholar 

  • S. Dumais and H. Chen. Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263. ACM, New York, 2000.

    Google Scholar 

  • C. Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning, pages 147–153. AAAI Press, Menlo Park, 2003.

    Google Scholar 

  • N. Jardine and C. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.

    Article  Google Scholar 

  • M. Kearns, Y. Mansour, and A.-Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pages 282–293. Morgan Kaufmann, San Francisco, 1997.

    Google Scholar 

  • J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. University of California Press, Berkeley, 1967.

    Google Scholar 

  • K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):1–32, 2000.

    Article  Google Scholar 

  • G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, 1971.

    Google Scholar 

  • G. Stein, A. Bagga, and G. Wise. Multi-document summarization: Methodologies and evaluations. In Proceedings of the 7th Conference on Automatic Natural Language Processing (TALN’00), pages 337–346. ATALA Press, Paris, 2000.

    Google Scholar 

  • E. Voorhees. The cluster hypothesis revisited. In Proceedings of SIGIR-85, pages 188–196. ACM, New York, 1985.

    Chapter  Google Scholar 

  • P. Willett. Recent trends in hierarchic document clustering. Information Processing and Management, 24:577–597, 1988.

    Article  Google Scholar 

  • S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003.

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sholom M. Weiss .

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London Limited

About this chapter

Cite this chapter

Weiss, S.M., Indurkhya, N., Zhang, T. (2010). Finding Structure in a Document Collection. In: Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84996-226-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-84996-226-1_5

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84996-225-4

  • Online ISBN: 978-1-84996-226-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics