Learning-Free Text Categorization

  • Patrick Ruch
  • Robert Baud
  • Antoine Geissbühler
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2780)


In this paper, we report on the fusion of simple retrieval strategies with thesaural resources in order to perform large-scale text categorization tasks. Unlike most related systems, which rely on training data in order to infer text-to-concept relationships, our approach can be applied with any controlled vocabulary and does not use any training data. The first classification module uses a traditional vector-space retrieval engine, which has been fine-tuned for the task, while the second classifier is based on regular variations of the concept list. For evaluation purposes, the system uses a sample of MedLine and the Medical Subject Headings (MeSH) terminology as collection of concepts. Preliminary results show that performances of the hybrid system are significantly improved as compared to each single system. For top returned concepts, the system reaches performances comparable to machine learning systems, while genericity and scalability issues are clearly in favor of the learning-free approach. We draw conclusion on the importance of hybrids strategies combining data-poor classifiers and knowledge-based terminological resources for general text mapping tasks.


Concept Mapping Regular Expression Text Categorization MeSH Term Retrieval Engine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Shatkay, H., Edwards, S., Wilbur, W., Boguski, M.: Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–328 (2000)Google Scholar
  2. 2.
    Yang, Y.: Sampling strategies and learning efficiency in text categorization. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access (1996)Google Scholar
  3. 3.
    Larkey, L., Croft, W.: Combining classifiers in text categorization. In: SIGIR, pp. 289–297. ACM Press, New York (1996)Google Scholar
  4. 4.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)Google Scholar
  5. 5.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification (1998)Google Scholar
  6. 6.
    Joachims, T.: Making large-scale svm learning practical. Advances in Kernel Methods - Support Vector Learning (1999)Google Scholar
  7. 7.
    Schapire, R., Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 135–168 (2000)zbMATHCrossRefGoogle Scholar
  8. 8.
    Apté, C., Damerau, F., Weiss, S.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS) 12, 233–251 (1994)CrossRefGoogle Scholar
  9. 9.
    Hayes, P., Weinstein, S.: A system for content-based indexing of a database of news stories. In: Proceedings of the 2nd Annual Conference on Innovative Applications of Intelligence (1990)Google Scholar
  10. 10.
    Shaw, W., Wood, J., Wood, R., Tibbo, H.: The cystic fibrosis database: Content and research opportunities. LSIR 13, 347–366 (1991)Google Scholar
  11. 11.
    Aronson, A., Bodenreider, O., Chang, H., Humphrey, S., Mork, J., Nelson, S., Rindflesch, T., Wilbur, W.: The indexing initiative. A report to the board of scientific counselors of the lister hill national center for biomedical communications. Technical report, NLM (1999)Google Scholar
  12. 12.
    Manber, U., Wu, S.: GLIMPSE: A tool to search through entire file systems. In: Proceedings of the USENIX Winter 1994 Technical Conference, San Fransisco CA USA, pp. 23–32 (1994)Google Scholar
  13. 13.
    Ruch, P.: Using contextual spelling correction to improve retrieval effectiveness in degraded text collections. In: COLING 2002 (2002)Google Scholar
  14. 14.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: ACM SIGIR, pp. 21–29 (1996)Google Scholar
  15. 15.
    Hull, D.: Stemming algorithms: A case study for detailed evaluation. Journal of the American Society of Information Science 47, 70–84 (1996)CrossRefGoogle Scholar
  16. 16.
    Tan, C., Wang, Y., Lee, C.: The use of bigrams to enhance text categorization. Information Processing and Management 38, 529–546 (2002)zbMATHCrossRefGoogle Scholar
  17. 17.
    Aronson, A.: Exploiting a large thesaurus for information retrieval. In: Proceedings of RIAO, pp. 197–216 (1994)Google Scholar
  18. 18.
    Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Research and Development in Information Retrieval, pp. 68–73 (1995)Google Scholar
  19. 19.
    McKeown, K., Barzilay, R., Evans, D., Hatzivassiloglou, V., Schiffman, B., Teufel, S.: Columbia multi-document summarization: Approach and evaluation. In: Proceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001. (DARPA/NIST, Document Understanding Conference) (2001)Google Scholar
  20. 20.
    Yang, Y.: An evaluation of statistical approaches to medline indexing. In: AMIA, pp. 358–362 (1996)Google Scholar
  21. 21.
    Lam, W., Ho, C.: Using a generalized instance set for automatic text categorization. In: SIGIR, pp. 81–89 (1998)Google Scholar
  22. 22.
    Lewis, D.: Evaluating and Optimizing Autonomous Text Classification Systems. In: SIGIR, pp. 246–254. ACM Press, New York (1995)Google Scholar
  23. 23.
    Lewis, D., Shapire, R., Callan, J., Papka, R.: Training algorithms for linear text classifiers. In: SIGIR, pp. 298–303 (1996)Google Scholar
  24. 24.
    Yang, Y., Chute, C.: A linear least squares fit mapping method for information retrieval from natural language texts. In: COLING, pp. 447–453 (1992)Google Scholar
  25. 25.
    Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Croft, W., van Rijsbergen, C. (eds.) SIGIR, pp. 13–22. ACM, New York (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Patrick Ruch
    • 1
  • Robert Baud
    • 1
  • Antoine Geissbühler
    • 1
  1. 1.Medical Informatics DivisionUniversity Hospital of GenevaGenevaSwitzerland

Personalised recommendations