SIGIR ’94 pp 23-30 | Cite as

Towards Language Independent Automated Learning of Text Categorization Models

  • Chidanand Apté
  • Fred Damerau
  • Sholom M. Weiss
Conference paper

Abstract

We describe the results of extensive machine learning experiments on large collections of Reuters’ English and German newswires. The goal of these experiments was to automatically discover classification patterns that can be used for assignment of topics to the individual newswires. Our results with the English newswire collection show a very large gain in performance as compared to published benchmarks, while our initial results with the German newswires appear very promising. We present our methodology, which seems to be insensitive to the language of the document collections, and discuss issues related to the differences in results that we have obtained for the two collections.

Keywords

Expense Weinstein 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Apté et al., 1993]
    C. Apté, F. Damerau, and S. Weiss. Automated Learning of Decison Rules for Text Categorization. Technical Report RC 18879, IBM T.J. Watson Research Center, 1993. To appear in ACM Transactions on Office Information Systems.Google Scholar
  2. [Breiman et al., 1984]
    L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Monterrey, Ca., 1984.MATHGoogle Scholar
  3. [Hayes and Weinstein, 1991]
    P. Hayes and S. Weinsteun. Adding Value to Financial News by Computer. In Proceedings of the First International Conference on Artificial Application on Wall Street, pages 2–8, 1991.CrossRefGoogle Scholar
  4. [Hayes et al., 1990]
    P.J. Hayes, P.M. Andersen, I.B. Nirenburg, and L.M. Schmandt. TCS: A Shell for Content-Based Text Categorization. In Proceedings of the Sixth IEEE CALA, pages 320–326, 1990.Google Scholar
  5. [Lewis and Ringuette, 1994]
    D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Symposium on Document Analysis and Information Retrieval,Las Vegas, NV, April 1994. ISRI; Univ. of Nevada, Las Vegas. To appear.Google Scholar
  6. [Lewis, 1992a]
    D. Lewis. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 37–50, June 1992. Edited by Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen.Google Scholar
  7. [Lewis, 1992b]
    D. Lewis. Feature Selection and Feature Extraction for Text Categorization. In Procceedings of the Speech and Natural language Workshop,pages 212–217, February 1992. Sponsored by the Defense Advanced Research Projects Agency.CrossRefGoogle Scholar
  8. [Masand et al, 1992]
    B. Masand, G. Linoff, and D. Waltz. Classifying News Stories using Memory Based Reasoning. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–65,June 1992. Edited by Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen.Google Scholar
  9. [Quinlan, 1993]
    J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.Google Scholar
  10. [Weiss and Indurkhya, 1993]
    S. Weiss and N. Indurkhya. Optimized Rule Induction. IEEE EXPERT, 8 (6): 61–69, December 1993.CrossRefGoogle Scholar
  11. [Weiss and Kulikowski, 1991]
    S.M. Weiss and C.A. Kulikowski. Computer Systems That Learn. Morgan Kaufmann, 1991.Google Scholar

Copyright information

© Springer-Verlag London Limited 1994

Authors and Affiliations

  • Chidanand Apté
    • 1
  • Fred Damerau
    • 1
  • Sholom M. Weiss
    • 2
  1. 1.IBM Research DivisionT.J. Watson Research CenterYorktown HeightsUSA
  2. 2.Dept. of Computer ScienceRutgers UniversityNew BrunswickCanada

Personalised recommendations