Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach

  • Levent Ertöz
  • Michael Steinbach
  • Vipin Kumar
Part of the Network Theory and Applications book series (NETA, volume 11)

Abstract

Given a set of documents, clustering is often used to group the documents, in the hope that each group will represent documents with a common theme or topic. Initially, hierarchical clustering was used to cluster documents [5]. This approach has the advantage of producing a set of nested document clusters, which can be interpreted as a topic hierarchy or tree, from general to more specific topics. In practice, while the clusters at different levels of the hierarchy sometimes represent documents with consistent topics, it is common for many clusters to be a mixture of topics, even at lower, more refined levels of the hierarchy. More recently, as document collections have grown larger, K-means clustering has emerged as a more efficient approach for producing clusters of documents [4, 9, 16]. K-means clustering produces a set of un-nested clusters, and the top (most frequent or highest ”weight”) terms of the cluster are used to characterize the topic of the cluster. Once again, it is not unusual for some clusters to be mixtures of topics.

Keywords

Bors Lost Malon Anil Roundup 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Charu C. Aggarwal, Stephen C. Gates and Philip S. Yu, “On the merits of building categorization systems by supervised clustering,”Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 352–356, 1999.Google Scholar
  2. [2]
    Douglas R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey, “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections”, ACM SIGIR `92, Pages 318 — 329, 1992.Google Scholar
  3. [3]
    Richard C. Dubes and Anil K. Jain, Algorithms for Clustering Data, Prentice Hall, 1988.Google Scholar
  4. [4]
    Inderjit S. Dhillon and Dharmendra S. Modha, “Concept Decompositions for Large Sparse Text Data using Clustering,” to appear in Machine Learning, 2000 (also appears as IBM Research Report RJ 10147 (95022), July 8, 1999 ).Google Scholar
  5. [5]
    A. El-Hamdouchi and P. Willet, “Comparison of Hierarchic Agglomerative Clustering Methods for Document Retrieval,” The Computer Journal, Vol. 32, No. 3, 1989.Google Scholar
  6. [6]
    K. C. Gowda and G. Krishna, (1978), “Agglomerative Clustering Using the Concept of Mutual Nearest Neighborhood”, Pattern Recognition, Vol. 10, pp. 105–112.MATHCrossRefGoogle Scholar
  7. [7]
    Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, (1998), “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” In Proceedings of the 15th International Conference on Data Engineering, 1999.Google Scholar
  8. [8]
    R. A. Jarvis and E. A. Patrick, “Clustering Using a Similarity Measure Based on Shared Nearest Neighbors,” IEEE Transactions on Computers, Vol. C-22, No. 11, November, 1973.Google Scholar
  9. [9]
    George Karypis and Eui-Hong (Sam) Han, “Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization,” CIKM 2000.Google Scholar
  10. [10]
    George Karypis, Eui-Hong Han, and Vipin Kumar, (1999) “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling,” IEEE Computer, Vol. 32, No. 8, August, 1999. pp. 68–75.Google Scholar
  11. [11]
    Gerald Kowalski, Information Retrieval Systems — Theory and Implementation, Kluwer Academic Publishers, 1997.Google Scholar
  12. [12]
    L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons, 1990.Google Scholar
  13. [13]
    Daphe Koller and Mehran Sahami, “Hierarchically classifying documents using very few words,” Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997, Pages 170–178.Google Scholar
  14. [14]
    Bjorner Larsen and Chinatsu Aone, “Fast and Effective Text Mining Using Linear-time Document Clustering,” KDD-99, San Diego, California, 1999.Google Scholar
  15. [15]
    C. J. van Rijsbergen, Information Retrieval, Buttersworth, London, second edition,1989.Google Scholar
  16. [16]
    Michael Steinbach, George Karypis, and Vipin Kumar, “A Comparison of Document Clustering Algorithms,” KDD-2000 Text Mining Workshop, 2000.Google Scholar
  17. [17]
    TREC: Text REtrieval Conference. http://trec.nist.gov
  18. [18]
    Oren Zamir, Oren Etzioni, Omid Madani, Richard M. Karp, “Fast and Intuitive Clustering of Web Documents,” KDD ‘87, Pages 287–290, 1997.Google Scholar
  19. [19]
    “Evaluation of Hierarchical Clustering Algorithms for Document Datasets,” Ying Zhao and George Karypis, CIKM 2002.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Levent Ertöz
    • 1
  • Michael Steinbach
    • 1
  • Vipin Kumar
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of MinnesotaMinneapolisUSA

Personalised recommendations