Improving Rocchio with Weakly Supervised Clustering

  • Romain Vinot
  • François Yvon
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2837)


This paper presents a novel approach for adapting the complexity of a text categorization system to the difficulty of the task. In this study, we adapt a simple text classifier (Rocchio), using weakly supervised clustering techniques. The idea is to identify sub-topics of the original classes which can help improve the categorization process. To this end, we propose several clustering algorithms, and report results of various evaluations on standard benchmark corpora such as the Newsgroups corpus.


Hide Layer Cluster Algorithm Information Retrieval Text Categorization Unsupervised Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Buckley, C., Salton, G.: Optimization of relevance weights. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference of Research and Development in Information Retrieval, pp. 351–357 (1995)Google Scholar
  2. 2.
    Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., Freeman, D.: Autoclass: A bayesian classification system. In: Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, pp. 54–64. Morgan Kaufmann, San Francisco (1988)Google Scholar
  3. 3.
    de Kroon, H., Mitchell, T., Kerckhoffs, E.: Improving learning accuracy in information filtering. In: International Conference on Machine Learning – Workshop on Machine Learning Meets HCI (1996)Google Scholar
  4. 4.
    Eichmann, D., Ruiz, M., Srinivasan, P., Street, N., Chris, C., Menczer, F.: A cluster based approach to tracking, detection and segmentation of broadcast news. In: Proceedings of the DARPA Broadcast News Workshop, pp. 69–76 (1999)Google Scholar
  5. 5.
    Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Principles of Data Mining and Knowledge Discovery, pp. 424–431 (2000)Google Scholar
  6. 6.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  7. 7.
    Lam, W.: Using a generalized instance set for automatic text categorization. In: Proceedings of SIGIR-1998, 21th ACM International Conference on Research and Development in Information Retrieval, pp. 81–89 (1998)Google Scholar
  8. 8.
    Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training algorithms for linear text classifiers. In: Proceedings of SIGIR 1996, pp. 298–306, Zürich, CH (1996)Google Scholar
  9. 9.
    Moschitti, A.: A study on optimal parameter tuning for rocchio text classifier. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 420–435. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Neural Information Processing Systems (2001)Google Scholar
  11. 11.
    Oh, J.-H., Lee, K.-S., Chang, D.-S., Won Seo, C., Choi, K.-S.: Trec-10 experiments at kaist: Batch filtering and question answering. In: Proceedings of The Tenth Text Retrieval Conference (TREC-10), pp. 347–354 (2001)Google Scholar
  12. 12.
    Rocchio, J.J.: The SMART Retrieval System: Experiments in Automatic Document Processing. In: Salton, G. (ed.) Relevance Feedback in Information Retrieval, ch. 14, pp. 313–323. Prentice-Hall Inc., New Jersey (1971)Google Scholar
  13. 13.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)zbMATHCrossRefGoogle Scholar
  14. 14.
    Schapire, R.E., Singer, Y.: BoosTexter: A boosting system for text classification. Machine Learning 39(2/3), 135–168 (2000)zbMATHCrossRefGoogle Scholar
  15. 15.
    Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Bruce Croft, W., Moffat, A., van Rijsbergen, C.J., Wilkinson, R., Zobel, J. (eds.) Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 215–223. ACM Press, New York (1998)CrossRefGoogle Scholar
  16. 16.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  17. 17.
    Singhal, A., Mitra, M., Buckley, C.: Learning routing queries in a query zone. In: Proceedings of SIGIR 1997, 20th ACM International Conference on Research and Development in Information Retrieval, Philadelphia, US, pp. 25–32 (1997)Google Scholar
  18. 18.
    Vinot, R., Yvon, F.: Semi-automatic response in a Mail Center. In: ASMDA 2001, 10th International Symposium on Applied Stochastic Models and Data Analysis. Université de Technologie de Compiègne, pp. 992–997 (2001)Google Scholar
  19. 19.
    Vinot, R., Yvon, F.: Quand simplicité rime avec efficacit´e: Analyse d’un catégoriseur de textes. In: Colloque International sur la Fouille de Texte (CIFT 2002), pp. 17–26, Tunisie (2002)Google Scholar
  20. 20.
    Yang, Y.: An evaluation of statistical approach to text categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Romain Vinot
    • 1
  • François Yvon
    • 1
  1. 1.GET/ENSTParis CedexFrance

Personalised recommendations