Towards Language Independent Automated Learning of Text Categorization Models
We describe the results of extensive machine learning experiments on large collections of Reuters’ English and German newswires. The goal of these experiments was to automatically discover classification patterns that can be used for assignment of topics to the individual newswires. Our results with the English newswire collection show a very large gain in performance as compared to published benchmarks, while our initial results with the German newswires appear very promising. We present our methodology, which seems to be insensitive to the language of the document collections, and discuss issues related to the differences in results that we have obtained for the two collections.
KeywordsNews Story Rule Induction Training Case Breakeven Point Document Classification
Unable to display preview. Download preview PDF.
- [Apté et al., 1993]C. Apté, F. Damerau, and S. Weiss. Automated Learning of Decison Rules for Text Categorization. Technical Report RC 18879, IBM T.J. Watson Research Center, 1993. To appear in ACM Transactions on Office Information Systems.Google Scholar
- [Breiman et al., 1984]
- [Hayes and Weinstein, 1991]
- [Hayes et al., 1990]P.J. Hayes, P.M. Andersen, I.B. Nirenburg, and L.M. Schmandt. TCS: A Shell for Content-Based Text Categorization. In Proceedings of the Sixth IEEE CALA, pages 320–326, 1990.Google Scholar
- [Lewis and Ringuette, 1994]D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Symposium on Document Analysis and Information Retrieval,Las Vegas, NV, April 1994. ISRI; Univ. of Nevada, Las Vegas. To appear.Google Scholar
- [Lewis, 1992a]D. Lewis. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 37–50, June 1992. Edited by Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen.Google Scholar
- [Lewis, 1992b]
- [Masand et al, 1992]B. Masand, G. Linoff, and D. Waltz. Classifying News Stories using Memory Based Reasoning. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–65,June 1992. Edited by Nicholas Belkin, Peter Ingwersen, and Annelise Mark Pejtersen.Google Scholar
- [Quinlan, 1993]J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.Google Scholar
- [Weiss and Indurkhya, 1993]
- [Weiss and Kulikowski, 1991]S.M. Weiss and C.A. Kulikowski. Computer Systems That Learn. Morgan Kaufmann, 1991.Google Scholar