Extract Semantic Information from WordNet to Improve Text Classification Performance
Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. In this paper, we propose a Concept-based Vector Space Model which reflects the more abstract version of the semantic information instead of the Vector Space Model for the text. This model adjusts the weight of the Vector Space by importing the hypernymy-hyponymy relation between synonymy sets and the Concept Chain in the WordNet. Experimental results on several data sets show that the proposed approach, conception built from Wordnet, can achieve significant improvements with respect to the baseline algorithm.
KeywordsText classification document representation Wordnet conception based VSM
Unable to display preview. Download preview PDF.
- 1.Yang, Y., Lin, X.: A re-examination of text categorization methods. SIGIR, 42–49 (1999)Google Scholar
- 3.McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI/ICML, Workshop on Learning for Text Categorization (1998)Google Scholar
- 5.Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Proceedings of the 19th international joint conference on artificial intelligence, IJCAI 2005 (2005)Google Scholar
- 6.Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st AAAI conference on artificial intelligence, AAAI 2006 (2006)Google Scholar
- 7.Miller, G.: WordNet: a lexical database for english. Communications of the ACM (1995)Google Scholar
- 8.de Buenaga Rodriguez, M., Gomez Hidalgo, J.M., Agudo, B.D.: UsingWordNet to complement training information in text categorization. In: The 2nd international conference on recent advances in natural language processing, RANLP 1997 (1999)Google Scholar
- 10.Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international World Wide Web conference, WWW 2003 (2003)Google Scholar
- 11.Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at SIGIR 2003 (2003)Google Scholar
- 12.Reuters-21578 text categorization test collection, Distribution 1.0. Reuters (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/
- 13.Hersh, W., Buckley, C., Leone, T., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR 1994), pp. 192–201 (1994)Google Scholar
- 14.Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning (ICML 1995), pp. 331–339 (1995)Google Scholar