Skip to main content
Log in

Exploiting semantic resources for large scale text categorization

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The traditional supervised classifier for Text Categorization (TC) is learned from a set of hand-labeled documents. However, the task of manual data labeling is labor intensive and time consuming, especially for a complex TC task with hundreds or thousands of categories. To address this issue, many semi-supervised methods have been reported to use both labeled and unlabeled documents for TC. But they still need a small set of labeled data for each category. In this paper, we propose a Fully Automatic Categorization approach for Text (FACT), where no manual labeling efforts are required. In FACT, the lexical databases serve as semantic resources for category name understanding. It combines the semantic analysis of category names and statistic analysis of the unlabeled document set for fully automatic training data construction. With the support of lexical databases, we first use the category name to generate a set of features as a representative profile for the corresponding category. Then, a set of documents is labeled according to the representative profile. To reduce the possible bias originating from the category name and the representative profile, document clustering is used to refine the quality of initial labeling. The training data are subsequently constructed to train the discriminative classifier. The empirical experiments show that one variant of our FACT approach outperforms the state-of-the-art unsupervised TC approach significantly. It can achieve more than 90% of F1 performance of the baseline SVM methods, which demonstrates the effectiveness of the proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Alan Smeaton, F. (1999). Using NLP or NLP resources for information retrieval tasks. In Natural language information retrieval. Dordrecht, NL: Kluwer Academic Publishers.

    Google Scholar 

  • Bai, R., Wang, X., & Liao, J. (2010). Extract semantic information from WordNet to improve text classification performance. In Proceedings of the international conference on Advances in computer science and information technology, June 23–25, 2010, LNCS 6059 (pp. 409–420).

  • Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (pp. 136–145).

  • Basili, R., Cammisa, M., & Moschitti, A. (2005). Effective use of Wordnet semantics via kernel-based learning. In Proceedings of the 9th conference on computational natural language learning (CoNLL 2005). USA, Ann Arbor (MI).

  • Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. of the workshop on computational learning theory (pp. 92–100).

  • Bradford, R. (2008). An empirical study of required dimensionality for large-scale latent semantic indexing applications. In Proceedings of the 17th ACM conference on information and knowledge management (pp. 153–162). California, USA: Napa Valley.

    Chapter  Google Scholar 

  • CoreNet (2012). http://korterm.kaist.ac.kr.

  • de Buenaga Rodriguez, M., Gomez-Hidalgo, J., & Diaz-Agudo, B. (1997). Using WordNet to complement training information in text categorization. In Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP’97) (pp. 150–157).

  • Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.

    Google Scholar 

  • Ellen Voorhees, M. (1993). Using wordnet to disambiguate word senses for text retrieval. In Proceedings SIGIR’93. PA, USA: Pittsburgh.

    Google Scholar 

  • EuroWordNet (2012). http://www.illc.uva.nl/EuroWordNet.

  • Ferrández, S., Toral, A., Ferrández, O., Ferrández, A., & Muñoz, R. (2009). Exploiting wikipedia and EuroWordNet to solve cross–lingual question answering. Information Sciences, 179(20), 3473–3488.

    Article  Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2005). Feature generation for text categorization using world knowledge. In International joint conference on artificial intelligence. Scotland: Edinburgh.

    Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In National conference on artificial intelligence (AAAI). Massachusetts: Boston.

    Google Scholar 

  • Gliozzo, A. M., & Strapparava, C. (2005). Domain kernels for text categorization. In Proceedings of the ninth conference on computational natural language learning (CoNLL-2005) (pp. 56–63). Michigan: Ann Arbor.

  • Gliozzo, A. M., Strapparava, C., & Dagan, I. (2005). Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the joint Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP) (pp. 129–136).

  • Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Proc. of the semantic web workshop at SIGIR (pp. 541–544).

  • Hownet (2012). http://www.keenage.com.

  • Ide, N., & Véronis, J. (1998). Word sense disambiguation: the state of the art. Computational Linguistics, 24(1), 1–40.

    Google Scholar 

  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning.

  • Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. 16th international conf. on machine learning (pp. 200–209).

  • Kehagias, A., Petridis, V., Kaburlasos, V., & Fragkou, P. (2003). A comparison of word- and sense-based text classification using several classification algorithms. Journal of Intelligent Information Systems, 21(3), 227–247.

    Article  Google Scholar 

  • Ko, Y., & Seo, J. (2000). Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics (COLING) (pp. 453–459).

  • Li, J. Q., Zhao, Y., & Liu, B. (2009). Fully automatic text categorization by exploiting WordNet. In Proceeding of Asia information retrieval societies conference, LNCS 5839 (pp. 1–12). Springer:New York/Heidelberg.

  • Li, C. H., Yang, J. C., & Park, S. C. (2012). Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet. Expert Systems with Applications, 39(1), 765–772.

    Article  Google Scholar 

  • Liu, B., Li X., Lee, W. S., & Yu, P. S. (2004). Text classification by labeling words. In Proc. 19th nat’l conf. artificial intelligence (pp. 425–430).

  • Liu, T., Yang, Y., Wan, H., Zhou, Q., Gao, B., Zeng, H. J., et al. (2005). An experimental study on large-scale web categorization. In Posters Proceedings of the 14th International World Wide Web Conference (pp. 1106–1107).

  • Luo, Q., Chen, E., & Xiong, H. (2011). A semantic term weighting scheme for text categorization. Expert Systems with Applications, 38(10), 12708–12716.

    Article  Google Scholar 

  • Mansuy, T. N., & Hilderman, R. J. (2006). A characterization of wordnet features in boolean models for text classification. In AusDM (pp. 103–109).

  • McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI/ICML-98 workshop on learning for text categorization (pp. 41–48).

  • Mohammed, M. & Mohammed, B. (2011). On the merging of domain-specific heterogeneous ontologies using WordNet and web pattern-based queries. Journal of Information and Knowledge Management, 10(1), 23–36.

    Article  Google Scholar 

  • Moldovan, D. I., & Mihalcea, R. (2000). Using WordNet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.

    Article  Google Scholar 

  • Navigli, R., Faralli, S., Soroa, A., Lacalle, O. L., & Agirre, E. (2011). Two birds with one stone: Learning semantic models for text categorization and word sense disambiguation. In Proc. of the 20th ACM Conference on Information and Knowledge Management (CIKM 2011), Glasgow, UK, October 24-28th (pp. 2317–2320).

  • Nigam, K., Lafferty, J., & Mccallum, A. (1991). Using maximum entropy for text classification. In IJCAI-99 workshop on machine learning for information filtering (pp. 61–67).

  • Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.

    Google Scholar 

  • Peng, X., & Choi, B. (2005). Document classifications based on word semantic hierarchies. In Proc. of the international conf. on artificial intelligence and application (AIA’05) (pp. 362–367).

  • Salon, G. (1991). Development in automatic text retrieval. Science, 253, 974–979.

    Article  Google Scholar 

  • Scott, S., & Matwin, S. (1998). Text classification using wordNet hypernyms. In Proc. Coling-ACL’98 (pp. 45–52).

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  • Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman and Hall.

  • Siolas, G., & d’Alch Buc, F. (2000). Support vector machines based on a semantic kernel for text categorization. In Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00) (Vol. 5, p. 5205). IEEE Computer Society: Washington, DC.

  • Sogou Labs (2012). http://www.sogou.com/labs/resources.html.

  • SVM-light (2012). http://svmlight.joachims.org/.

  • Vapnik, V. (1995). The nature of statistical learning theory. NY, USA: Springer-Verlag.

    MATH  Google Scholar 

  • Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using wikipedia. In The 14th ACM SIGKDD (pp. 713–721). New York: ACM Press.

  • Weka (2012). http://www.cs.waikato.ac.nz/ml/weka/.

  • Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd Edn.). San Francisco: Morgan Kaufmann.

    MATH  Google Scholar 

  • WordNet (2012). http://wordnet.princeton.edu/.

  • Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99) (pp. 42–49).

  • Zeng, H. J., et al. (2003). CBC: Clustering based text classification requiring minimal labeled data. In ICDM (pp. 443–450).

  • Zhang Y., Gong, L. & Wang, Y. (2005). Chinese word sense disambiguation using HowNet. Lecture Notes in Computer Science, 3610/2005, 925–932.

    Article  Google Scholar 

  • Zhu, X. J. (2007). Semi-supervised learning literature survey. http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Qiang Li.

Additional information

A short version of this paper appeared in (Li et al. 2009). This submission includes more complete description of the algorithms, cross-language validation experiments and extended discussions on the experiments and results.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J.Q., Zhao, Y. & Liu, B. Exploiting semantic resources for large scale text categorization. J Intell Inf Syst 39, 763–788 (2012). https://doi.org/10.1007/s10844-012-0211-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0211-x

Keywords

Navigation