Skip to main content

Efficiently Clustering Documents with Committees

  • Conference paper
  • First Online:
Book cover PRICAI 2002: Trends in Artificial Intelligence (PRICAI 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2417))

Included in the following conference series:

Abstract

The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of SIGIR-85. pp. 97–110.

    Google Scholar 

  2. Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings ofACL-89. pp. 76–83. Vancouver, Canada.

    Google Scholar 

  3. Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp.318–329. Copenhagen, Denmark.

    Google Scholar 

  4. Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings ofICDE’99. pp. 512–521. Sydney, Australia.

    Google Scholar 

  5. Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96. pp. 76–84. Zurich, Switzerland.

    Google Scholar 

  6. Jain, A.K.; Murty, M.N.; and Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264–323.

    Article  Google Scholar 

  7. Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217–240.

    Article  Google Scholar 

  8. Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8): 68–75.

    Google Scholar 

  9. Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405–416. Elsevier/North Holland, Amsterdam.

    Google Scholar 

  10. Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97. pp. 170–176. Nashville, TN.

    Google Scholar 

  11. McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5 th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281–298.

    Google Scholar 

  12. Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.

    Google Scholar 

  13. Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota.

    Google Scholar 

  14. van Rijsbergen, C. J. 1979. Information Retrieval, second edition. London: Buttersworth. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html

    Google Scholar 

  15. Wagstaff, K. and Cardie, C. 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103–1110. Palo Alto, CA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pantel, P., Lin, D. (2002). Efficiently Clustering Documents with Committees. In: Ishizuka, M., Sattar, A. (eds) PRICAI 2002: Trends in Artificial Intelligence. PRICAI 2002. Lecture Notes in Computer Science(), vol 2417. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45683-X_46

Download citation

  • DOI: https://doi.org/10.1007/3-540-45683-X_46

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44038-3

  • Online ISBN: 978-3-540-45683-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics