Abstract
Most of the text categorization algorithms in the literature represent documents as collections of words. An alternative which has not been sufficiently explored is the use of word meanings, also known as senses. In this paper, using several algorithms, we compare the categorization accuracy of classifiers based on words to that of classifiers based on senses. The document collection on which this comparison takes place is a subset of the annotated Brown Corpus semantic concordance. A series of experiments indicates that the use of senses does not result in any significant categorization improvement.
Similar content being viewed by others
References
Bengio, Y., Ducharme, R., and Vincent, P. (2000). A Neural Probabilistic Language Model. Technical Report No. 1178, Universite de Montreal, Montreal, Quebec, Canada.
Benkhalifa, M., Mouradi, A., and Bouyakhf, H. (2001). Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization. Information Retrieval, 4, 91–113.
Birkhoff, G. (1967). Lattice Theory, Vol. 25. Providence, RI: American Mathematical Society, Colloquium Publications.
Buenaga, R.M., Gomez-Hidalgo, J.M., and Diaz-Agudo, B. (1997). Using WordNet to Complement Training Information in Text Categorization. In Proc. of the 2nd International Conf. on Recent Advances in Natural Language Processing.
Domingos, P. and Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier Under Zero-One Loss. Mach. Learning, 29, 103–130.
Duda, R.O., Hart, P.E., and Stork, D.G. (2001). Pattern Classification, John Wiley & Sons.
Francis, W.N. and Kucera, H. (1982). Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin.
Gonzalo, J., Verdejo, F., Chugur, I., and Cigarran, J. (1998). Indexing with WordNet Synsets can Improve Text Retrieval. In Proc. of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems.
Joachims, T. (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proc. of the 14th International Conference on Machine Learning ICML97 (pp. 143–151).
Kaburlasos, V.G. and Petridis, V. (2000). Fuzzy Lattice Neurocomputing (FLN) Models. Neural Networks, 13, 1145–1170.
Lewis, D.D. (1998). Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In Proc. of the ECML'98 (pp. 4–15).
Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R. (1996). Training Algorithms for Linear Text Classifiers. In Proc. of the ACM/SIGIR-96 Conference (pp. 298–306).
Manning, C.D. and Schuetze, H. (1999). Foundations of Statistical Natural Language Processing, MIT Press.
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K.J. (1990). Introduction to WordNet: An On-Line Lexical Database. Int. J. of Lexicography, 3, 235–244.
Mitchell, T.M. (1997). Machine Learning, McGraw-Hill.
Mladenic, D. (1998). Machine Learning of Non-Homogeneous Distributed Text Data, Ph.D. dissertation, Dept. of Computer and Information Science, Univ. of Ljubljana.
Mladenic, D. (1999). Text Learning and Related Intelligent Agents:A Survey. IEEE Intelligent Systems, 14, 44–54.
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents Using EM. Mach. Learning, 39, 103–134.
Petridis, V. and Kaburlasos, V.G. (1999). Learning in the Framework of Fuzzy Lattices. IEEE Trans. on Fuzzy Systems, 7, 422–440.
Petridis, V. and Kaburlasos, V.G. (2000). An Intelligent Mechatronics Solution for Automated Tool Guidance in the Epidural Surgical Procedure. In Proc. 7th Conf. on Mechatronics and Machine Vision in Practice (M2VIP'00) (pp. 201–206).
Petridis, V. and Kehagias, A. (1996). Modular Neural Networks for Bayesian Classification of Time Series and the Partition Algorithm, IEEE Trans. on Neural Networks, 7, 73–86.
Petridis, V. and Kehagias, A. (1998). Predictive Modular Neural Networks: Time Series Applications, Kluwer.
Sanderson, M. (1994). Word Sense Disambiguation and Information Retrieval. In Proc. of the 17th ACM/SIGIR-94 Conference, pp. 142–150.
Sanderson, M. (2000). Retrieving with Good Sense. Information Retrieval, 2, 49–69.
Sanderson, M. and van Rijsbergen, C.J. (1999). The Impact on Retrieval Effectiveness of Skewed Frequency Distributions. ACM Trans. on Information Systems, 17, 440–465.
Scott, S. and Matwin, S. (1998). Text Classification Using WordNet Hypernyms. In Proc. of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, 45–52.
Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Comp. Surv., 34, 1–47.
Urena-Lopez, L.A., Buenaga, M., Garcia, M., and Gomez-Hidalgo, J.M. (1998). Integrating and Evaluating WSD in the Adaptation of a Lexical Database in Text Categorization Task. In Proc. of the 1st Workshop on Text, Speech, Dialogue.
Urena-Lopez, L.A., Buenaga, and Gomez-Hidalgo, J.M. (2001). Integrating Linguistic Resources in TC through WSD. Computers and the Humanities, 35, 215–230.
Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1, 69–90.
Yang, Y. and Liu, X. (1999). A Re-Examination of Text Categorization Methods. In Proc. of 22nd Annual International SIGIR Conference (pp. 42–49).
Yang, Y. and Pedersen, J.O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proc. of the 14th International Conf. on Machine Learning (ICML'97) (pp. 412–420).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kehagias, A., Petridis, V., Kaburlasos, V.G. et al. A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms. Journal of Intelligent Information Systems 21, 227–247 (2003). https://doi.org/10.1023/A:1025554732352
Issue Date:
DOI: https://doi.org/10.1023/A:1025554732352