Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents

  • Hichem Frigui
  • Olfa Nasraoui

Abstract

In this chapter, we propose a new approach to unsupervised text document categorization based on a coupled process of clustering and cluster-dependent keyword weighting. The proposed algorithm is based on the K-Means clustering algorithm. Hence it is computationally and implementationally simple. Moreover, it learns a different set of keyword weights for each cluster. This means that, as a by-product of the clustering process, each document cluster will be characterized by a possibly different set of keywords. The cluster-dependent keyword weights have two advantages: they help in partitioning the document collection into more meaningful categories; and they can be used to automatically generate a compact description of each cluster in terms of not only the attribute values,but also their relevance. In particular, for the case of text data, this approach can be used to automatically annotate the documents. We also extend the proposed approach to handle the inherent fuzziness in text documents, by automatically generating fuzzy or soft labels instead of hard all-or-nothing categorization. This means that a text document can belong to several categories with different degrees. The proposed approach can handle noise documents elegantly by automatically designating one or two noise magnet clusters that grab most outliers away from the other clusters. The performance of the proposed algorithm is illustrated by using it to cluster real text document collections.

Keywords

Text Document Document Cluster Fuzzy Partition Inverse Document Frequency Latent Semantic Indexing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [AD91]
    H. Almuallim and T.G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 547–552, 1991.Google Scholar
  2. [BDJ99]
    M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41 (2): 335–362, 1999.MathSciNetMATHGoogle Scholar
  3. [Bez81]
    J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York, 1981.MATHCrossRefGoogle Scholar
  4. [BF98]
    P.S. Bradley and U.M. Fayyad. Refining initial points for K-Means clustering. In Procedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, pages 91–99, 1998.Google Scholar
  5. [BFR98]
    P.S. Bradley, U.M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9–15, 1998.Google Scholar
  6. [BL85]
    C. Buckley and A.F. Lewit. Optimizations of inverted vector searches. In SIGIR ’85, pages 97–110, 1985.Google Scholar
  7. Bow] Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [online, cited September 2002].Available from World Wide Web: http://www.cs.cmu.edu/mccallum/bow.
  8. [CKPT92]
    D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections.In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, June 1992.Google Scholar
  9. [CMU]
    newsgroup data set [online, cited September 2002 ]. Available from World Wide Web: www-2. es.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html.Google Scholar
  10. [DDF+90]
    S. Deerwester, S. Dumais, G. Fumas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.CrossRefGoogle Scholar
  11. [FK97]
    H. Frigui and R. Krishnapuram. Clustering by competitive agglomeration. Pattern Recognition, 30 (7): 1223–1232, 1997.Google Scholar
  12. [FK99]
    H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. Transactions on Pattern Analysis and Machine Intelligence, 21 (5): 450–465, May 1999.CrossRefGoogle Scholar
  13. [FLE00]
    F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2 (1): 51–57, 2000.CrossRefGoogle Scholar
  14. [FN00]
    H. Frigui and O. Nasraoui. Simultaneous clustering and attribute discrimination. In Proceedings of the IEEE Conference on Fuzzy Systems, San Antonio, TX, pages 158–163, 2000.Google Scholar
  15. [GK79]
    E.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the IEEE Conference on Decision and Control, San Diego, pages 761–766, 1979.Google Scholar
  16. [HOB99]
    L.O. Hall, I.O. Ozyurt, and J.C. Bezdek. Clustering with a genetically optimized approach. Transactions on Evolutionary Computations, 3 (2): 103–112, Jul 1999.CrossRefGoogle Scholar
  17. [Hub81]
    P.J. Huber. Robust Statistics.Wiley, New York, 1981.Google Scholar
  18. [JKP94]
    G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Machine Learning Conference, pages 121–129, 1994.Google Scholar
  19. [KK93]
    R. Krishnapuram and J. M. Keller. A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1 (2): 98–110, May 1993.CrossRefGoogle Scholar
  20. [Kor77]
    R.R. Korfhage. Information Storage and Retrieval. Wiley, New York, 1977.Google Scholar
  21. [Kow97]
    G. Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Hingham, MA, 1997.MATHGoogle Scholar
  22. [KR92]
    K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 129–134, 1992.Google Scholar
  23. [KS95]
    R. Kohavi and D. Sommerfield. Feature subset selection using the wrapper model: Overfitting and dynamic search space topology. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 192–197, 1995.Google Scholar
  24. [M1a99]
    D. Mladenic. Text learning and related intelligent agents. JEEE Expert, Jul 1999.Google Scholar
  25. [NK96]
    O. Nasraoui and R. Krishnapuram. An improved possibilistic c-means algorithm with finite rejection and robust scale estimation. In Proceedings of the North American Fuzzy Information Processing Society Conference, Berkeley, CA, pages 395–399, Jun 1996.Google Scholar
  26. [NK97]
    O. Nasraoui and R. Krishnapuram. A genetic algorithm for robust clustering based on a fuzzy least median of squares criterion. In Proceedings of the North American Fuzzy Information Processing Society Conference, Syracuse, NY, pages 217–221, Sept 1997.Google Scholar
  27. [NK00]
    O. Nasraoui and R. Krishnapuram. A novel approach to unsupervised robust clustering using genetic niching. In Proceedings of the IEEE International Conference on Fuzzy Systems, New Orleans, pages 170–175, 2000.Google Scholar
  28. [RK92]
    L.A. Rendell and K. Kira. A practical approach to feature selection. In Proceedings of the International Conference on Machine Learning, pages 249–256, 1992.Google Scholar
  29. [RL87]
    P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New York, 1987.MATHCrossRefGoogle Scholar
  30. [Ska94]
    D. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Proceedings of the Eleventh International Machine Learning Conference (ICML-94), pages 293–301, 1994.Google Scholar
  31. [vR79]
    C.J. van Rijsbergen. Information Retrieval. second edition. Butterworths, London, 1979.Google Scholar
  32. [ZEMK97]
    O. Zamir, O. Etzioni, O. Madani, and R.M. Karp. Fast and intuitive clustering of web documents.In KDD ‘87, pages 287–290, 1997.Google Scholar

Copyright information

© Springer Science+Business Media New York 2004

Authors and Affiliations

  • Hichem Frigui
  • Olfa Nasraoui

There are no affiliations available

Personalised recommendations