Efficiently Clustering Documents with Committees

Pantel, Patrick; Lin, Dekang

doi:10.1007/3-540-45683-X_46

Patrick Pantel³ &
Dekang Lin³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2417))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

863 Accesses
6 Citations

Abstract

The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of SIGIR-85. pp. 97–110.
Google Scholar
Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings ofACL-89. pp. 76–83. Vancouver, Canada.
Google Scholar
Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp.318–329. Copenhagen, Denmark.
Google Scholar
Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings ofICDE’99. pp. 512–521. Sydney, Australia.
Google Scholar
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96. pp. 76–84. Zurich, Switzerland.
Google Scholar
Jain, A.K.; Murty, M.N.; and Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264–323.
Article Google Scholar
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217–240.
Article Google Scholar
Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8): 68–75.
Google Scholar
Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405–416. Elsevier/North Holland, Amsterdam.
Google Scholar
Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97. pp. 170–176. Nashville, TN.
Google Scholar
McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5 ^th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281–298.
Google Scholar
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.
Google Scholar
Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota.
Google Scholar
van Rijsbergen, C. J. 1979. Information Retrieval, second edition. London: Buttersworth. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html
Google Scholar
Wagstaff, K. and Cardie, C. 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103–1110. Palo Alto, CA.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, Alberta, T6H 2E1, Canada
Patrick Pantel & Dekang Lin

Authors

Patrick Pantel
View author publications
You can also search for this author in PubMed Google Scholar
Dekang Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Science and Technology Department of Information and Communication Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
Mitsuru Ishizuka
School of Information Technology Knowledge Representation and Reasoning Unit (KRRU) Faculty of Engineering and Information Technology, Griffith University, PMB 50 Gold Coast Mail Centre, Queensland, 9726, Australia
Abdul Sattar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pantel, P., Lin, D. (2002). Efficiently Clustering Documents with Committees. In: Ishizuka, M., Sattar, A. (eds) PRICAI 2002: Trends in Artificial Intelligence. PRICAI 2002. Lecture Notes in Computer Science(), vol 2417. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45683-X_46

Download citation

DOI: https://doi.org/10.1007/3-540-45683-X_46
Published: 21 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44038-3
Online ISBN: 978-3-540-45683-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics