Advertisement

The Role of Word Sense Disambiguation in Automated Text Categorization

  • José María Gómez Hidalgo
  • Manuel de Buenaga Rodríguez
  • José Carlos Cortizo Pérez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)

Abstract

Automated Text Categorization has reached the levels of accuracy of human experts. Provided that enough training data is available, it is possible to learn accurate automatic classifiers by using Information Retrieval and Machine Learning Techniques. However, performance of this approach is damaged by the problems derived from language variation (specially polysemy and synonymy). We investigate how Word Sense Disambiguation can be used to alleviate these problems, by using two traditional methods for thesaurus usage in Information Retrieval, namely Query Expansion and Concept Indexing. These methods are evaluated on the problem of using the Lexical Database WordNet for text categorization, focusing on the Word Sense Disambiguation step involved. Our experiments demonstrate that rather simple dictionary methods, and baseline statistical approaches, can be used to disambiguate words and improve text representation and learning in both Query Expansion and Concept Indexing approaches.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Zhdanova, A.V., Shishkin, D.V.: Classification of email queries by topic: Approach based on hierarchically structured subject domain. In: Yin, H., Allinson, N.M., Freeman, R., Keane, J.A., Hubbard, S. (eds.) IDEAL 2002. LNCS, vol. 2412, pp. 99–104. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Mladenić, D.: Turning Yahoo! into an automatic Web page classifier. In: Prade, H. (ed.) Proceedings of ECAI 1998, 13th European Conference on Artificial Intelligence, Brighton, UK, pp. 473–474. John Wiley and Sons, Chichester (1998)Google Scholar
  3. 3.
    Gómez, J.: Evaluating cost-sensitive unsolicited bulk email categorization. In: Proceedings of SAC 2002, 17th ACM Symposium on Applied Computing, Madrid, ES, pp. 615–620 (2002)Google Scholar
  4. 4.
    Hepple, M., Ireson, N., Allegrini, P., Marchi, S., Montemagni, S., Gómez, J.: NLPenhanced content filtering within the POESIA project. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (2004)Google Scholar
  5. 5.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)Google Scholar
  7. 7.
    Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38, 39–41 (1995)CrossRefGoogle Scholar
  8. 8.
    Voorhees, E.M.: Using wordnet to disambiguate word sense for text retrieval. In: Proceedings of SIGIR 1993, 16th ACM International Conference on Research and Development in Information Retrieval, Pittsburgh, US, pp. 171–180 (1993)Google Scholar
  9. 9.
    Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Croft, W.B., van Rijsbergen, C.J. (eds.) Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, London, UK, pp. 61–70. Springer, Heidelberg (1994)Google Scholar
  10. 10.
    Voorhees, E.: Using WordNet for text retrieval. In: WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998)Google Scholar
  11. 11.
    Gonzalo, J., Verdejo, F., Chugur, I., Cigarrán, J.: Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems (1998)Google Scholar
  12. 12.
    Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)Google Scholar
  13. 13.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proc. Of the 14th International Conf. On Machine Learning (1997)Google Scholar
  14. 14.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)CrossRefGoogle Scholar
  15. 15.
    Scott, S.: Feature engineering for a symbolic approach to text classification. Master’s thesis, Computer Science Dept., University of Ottawa, Ottawa, CA (1998)Google Scholar
  16. 16.
    Fukumoto, F., Suzuki, Y.: Learning lexical representation for text categorization. In: Proceedings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources (2001)Google Scholar
  17. 17.
    Petridis, V., Kaburlasos, V., Fragkou, P., Kehagias, A.: Text classification using the σ-FLNMAP neural network. In: Proceedings of the 2001 International Joint Conference on Neural Networks (2001)Google Scholar
  18. 18.
    Gómez, J., Cortizo, J., Puertas, E., Ruíz, M.: Concept indexing for automated text categorization. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 195–206. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  19. 19.
    de Buenaga Rodríguez, M., Gómez Hidalgo, J., Díaz Agudo, B.: Using wordnet to complement training information in text categorization. In: Nicolov, N., Mitkov, R. (eds.) Recent Advances in Natural Language Processing II: Selected Papers from RANLP 1997. Current Issues in Linguistic Theory (CILT), vol. 189, pp. 353–364. John Benjamins, Amsterdam (2000)Google Scholar
  20. 20.
    Ureña-López, L.A., Buenaga, M., Gómez, J.M.: Integrating linguistic resources in TC through WSD. Computers and the Humanities 35, 215–230 (2001)CrossRefGoogle Scholar
  21. 21.
    Benkhalifa, M., Mouradi, A., Bouyakhf, H.: Integrating external knowledge to supplement training data in semi-supervised learning for text categorization. Information Retrieval 4, 91–113 (2001)zbMATHCrossRefGoogle Scholar
  22. 22.
    Manning, C., Schütze, H.: 16: Text Categorization. In: Foundations of Statistical Natural Language Processing, pp. 575–608. The MIT Press, Cambridge (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • José María Gómez Hidalgo
    • 1
  • Manuel de Buenaga Rodríguez
    • 1
  • José Carlos Cortizo Pérez
    • 2
  1. 1.Universidad Europea de MadridVillaviciosa de OdónSpain
  2. 2.AINet SolutionsFuenlabradaSpain

Personalised recommendations