Abstract
Automated Text Categorization has reached the levels of accuracy of human experts. Provided that enough training data is available, it is possible to learn accurate automatic classifiers by using Information Retrieval and Machine Learning Techniques. However, performance of this approach is damaged by the problems derived from language variation (specially polysemy and synonymy). We investigate how Word Sense Disambiguation can be used to alleviate these problems, by using two traditional methods for thesaurus usage in Information Retrieval, namely Query Expansion and Concept Indexing. These methods are evaluated on the problem of using the Lexical Database WordNet for text categorization, focusing on the Word Sense Disambiguation step involved. Our experiments demonstrate that rather simple dictionary methods, and baseline statistical approaches, can be used to disambiguate words and improve text representation and learning in both Query Expansion and Concept Indexing approaches.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Zhdanova, A.V., Shishkin, D.V.: Classification of email queries by topic: Approach based on hierarchically structured subject domain. In: Yin, H., Allinson, N.M., Freeman, R., Keane, J.A., Hubbard, S. (eds.) IDEAL 2002. LNCS, vol. 2412, pp. 99–104. Springer, Heidelberg (2002)
Mladenić, D.: Turning Yahoo! into an automatic Web page classifier. In: Prade, H. (ed.) Proceedings of ECAI 1998, 13th European Conference on Artificial Intelligence, Brighton, UK, pp. 473–474. John Wiley and Sons, Chichester (1998)
Gómez, J.: Evaluating cost-sensitive unsolicited bulk email categorization. In: Proceedings of SAC 2002, 17th ACM Symposium on Applied Computing, Madrid, ES, pp. 615–620 (2002)
Hepple, M., Ireson, N., Allegrini, P., Marchi, S., Montemagni, S., Gómez, J.: NLPenhanced content filtering within the POESIA project. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (2004)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38, 39–41 (1995)
Voorhees, E.M.: Using wordnet to disambiguate word sense for text retrieval. In: Proceedings of SIGIR 1993, 16th ACM International Conference on Research and Development in Information Retrieval, Pittsburgh, US, pp. 171–180 (1993)
Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Croft, W.B., van Rijsbergen, C.J. (eds.) Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, London, UK, pp. 61–70. Springer, Heidelberg (1994)
Voorhees, E.: Using WordNet for text retrieval. In: WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998)
Gonzalo, J., Verdejo, F., Chugur, I., Cigarrán, J.: Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems (1998)
Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proc. Of the 14th International Conf. On Machine Learning (1997)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)
Scott, S.: Feature engineering for a symbolic approach to text classification. Master’s thesis, Computer Science Dept., University of Ottawa, Ottawa, CA (1998)
Fukumoto, F., Suzuki, Y.: Learning lexical representation for text categorization. In: Proceedings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources (2001)
Petridis, V., Kaburlasos, V., Fragkou, P., Kehagias, A.: Text classification using the σ-FLNMAP neural network. In: Proceedings of the 2001 International Joint Conference on Neural Networks (2001)
Gómez, J., Cortizo, J., Puertas, E., Ruíz, M.: Concept indexing for automated text categorization. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 195–206. Springer, Heidelberg (2004)
de Buenaga Rodríguez, M., Gómez Hidalgo, J., Díaz Agudo, B.: Using wordnet to complement training information in text categorization. In: Nicolov, N., Mitkov, R. (eds.) Recent Advances in Natural Language Processing II: Selected Papers from RANLP 1997. Current Issues in Linguistic Theory (CILT), vol. 189, pp. 353–364. John Benjamins, Amsterdam (2000)
Ureña-López, L.A., Buenaga, M., Gómez, J.M.: Integrating linguistic resources in TC through WSD. Computers and the Humanities 35, 215–230 (2001)
Benkhalifa, M., Mouradi, A., Bouyakhf, H.: Integrating external knowledge to supplement training data in semi-supervised learning for text categorization. Information Retrieval 4, 91–113 (2001)
Manning, C., Schütze, H.: 16: Text Categorization. In: Foundations of Statistical Natural Language Processing, pp. 575–608. The MIT Press, Cambridge (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gómez Hidalgo, J.M., de Buenaga Rodríguez, M., Cortizo Pérez, J.C. (2005). The Role of Word Sense Disambiguation in Automated Text Categorization. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_27
Download citation
DOI: https://doi.org/10.1007/11428817_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)