Exploiting semantic resources for large scale text categorization

Li, Jian Qiang; Zhao, Yu; Liu, Bo

doi:10.1007/s10844-012-0211-x

Exploiting semantic resources for large scale text categorization

Published: 09 June 2012

Volume 39, pages 763–788, (2012)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Jian Qiang Li¹,
Yu Zhao¹ &
Bo Liu¹

418 Accesses
11 Citations
Explore all metrics

Abstract

The traditional supervised classifier for Text Categorization (TC) is learned from a set of hand-labeled documents. However, the task of manual data labeling is labor intensive and time consuming, especially for a complex TC task with hundreds or thousands of categories. To address this issue, many semi-supervised methods have been reported to use both labeled and unlabeled documents for TC. But they still need a small set of labeled data for each category. In this paper, we propose a Fully Automatic Categorization approach for Text (FACT), where no manual labeling efforts are required. In FACT, the lexical databases serve as semantic resources for category name understanding. It combines the semantic analysis of category names and statistic analysis of the unlabeled document set for fully automatic training data construction. With the support of lexical databases, we first use the category name to generate a set of features as a representative profile for the corresponding category. Then, a set of documents is labeled according to the representative profile. To reduce the possible bias originating from the category name and the representative profile, document clustering is used to refine the quality of initial labeling. The training data are subsequently constructed to train the discriminative classifier. The empirical experiments show that one variant of our FACT approach outperforms the state-of-the-art unsupervised TC approach significantly. It can achieve more than 90% of F1 performance of the baseline SVM methods, which demonstrates the effectiveness of the proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised learning in large scale text categorization

Article 30 May 2017

Zewen Xu, Jianqiang Li, … Rui Mao

A Fully Semantic Approach to Large Scale Text Categorization

Text Categorization from category name in an industry-motivated scenario

Article 28 February 2015

Chaya Liebeskind, Lili Kotlerman & Ido Dagan

References

Alan Smeaton, F. (1999). Using NLP or NLP resources for information retrieval tasks. In Natural language information retrieval. Dordrecht, NL: Kluwer Academic Publishers.
Google Scholar
Bai, R., Wang, X., & Liao, J. (2010). Extract semantic information from WordNet to improve text classification performance. In Proceedings of the international conference on Advances in computer science and information technology, June 23–25, 2010, LNCS 6059 (pp. 409–420).
Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (pp. 136–145).
Basili, R., Cammisa, M., & Moschitti, A. (2005). Effective use of Wordnet semantics via kernel-based learning. In Proceedings of the 9th conference on computational natural language learning (CoNLL 2005). USA, Ann Arbor (MI).
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. of the workshop on computational learning theory (pp. 92–100).
Bradford, R. (2008). An empirical study of required dimensionality for large-scale latent semantic indexing applications. In Proceedings of the 17th ACM conference on information and knowledge management (pp. 153–162). California, USA: Napa Valley.
Chapter Google Scholar
CoreNet (2012). http://korterm.kaist.ac.kr.
de Buenaga Rodriguez, M., Gomez-Hidalgo, J., & Diaz-Agudo, B. (1997). Using WordNet to complement training information in text categorization. In Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP’97) (pp. 150–157).
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
Google Scholar
Ellen Voorhees, M. (1993). Using wordnet to disambiguate word senses for text retrieval. In Proceedings SIGIR’93. PA, USA: Pittsburgh.
Google Scholar
EuroWordNet (2012). http://www.illc.uva.nl/EuroWordNet.
Ferrández, S., Toral, A., Ferrández, O., Ferrández, A., & Muñoz, R. (2009). Exploiting wikipedia and EuroWordNet to solve cross–lingual question answering. Information Sciences, 179(20), 3473–3488.
Article Google Scholar
Gabrilovich, E., & Markovitch, S. (2005). Feature generation for text categorization using world knowledge. In International joint conference on artificial intelligence. Scotland: Edinburgh.
Google Scholar
Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In National conference on artificial intelligence (AAAI). Massachusetts: Boston.
Google Scholar
Gliozzo, A. M., & Strapparava, C. (2005). Domain kernels for text categorization. In Proceedings of the ninth conference on computational natural language learning (CoNLL-2005) (pp. 56–63). Michigan: Ann Arbor.
Gliozzo, A. M., Strapparava, C., & Dagan, I. (2005). Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the joint Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP) (pp. 129–136).
Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Proc. of the semantic web workshop at SIGIR (pp. 541–544).
Hownet (2012). http://www.keenage.com.
Ide, N., & Véronis, J. (1998). Word sense disambiguation: the state of the art. Computational Linguistics, 24(1), 1–40.
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. 16th international conf. on machine learning (pp. 200–209).
Kehagias, A., Petridis, V., Kaburlasos, V., & Fragkou, P. (2003). A comparison of word- and sense-based text classification using several classification algorithms. Journal of Intelligent Information Systems, 21(3), 227–247.
Article Google Scholar
Ko, Y., & Seo, J. (2000). Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics (COLING) (pp. 453–459).
Li, J. Q., Zhao, Y., & Liu, B. (2009). Fully automatic text categorization by exploiting WordNet. In Proceeding of Asia information retrieval societies conference, LNCS 5839 (pp. 1–12). Springer:New York/Heidelberg.
Li, C. H., Yang, J. C., & Park, S. C. (2012). Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet. Expert Systems with Applications, 39(1), 765–772.
Article Google Scholar
Liu, B., Li X., Lee, W. S., & Yu, P. S. (2004). Text classification by labeling words. In Proc. 19th nat’l conf. artificial intelligence (pp. 425–430).
Liu, T., Yang, Y., Wan, H., Zhou, Q., Gao, B., Zeng, H. J., et al. (2005). An experimental study on large-scale web categorization. In Posters Proceedings of the 14th International World Wide Web Conference (pp. 1106–1107).
Luo, Q., Chen, E., & Xiong, H. (2011). A semantic term weighting scheme for text categorization. Expert Systems with Applications, 38(10), 12708–12716.
Article Google Scholar
Mansuy, T. N., & Hilderman, R. J. (2006). A characterization of wordnet features in boolean models for text classification. In AusDM (pp. 103–109).
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI/ICML-98 workshop on learning for text categorization (pp. 41–48).
Mohammed, M. & Mohammed, B. (2011). On the merging of domain-specific heterogeneous ontologies using WordNet and web pattern-based queries. Journal of Information and Knowledge Management, 10(1), 23–36.
Article Google Scholar
Moldovan, D. I., & Mihalcea, R. (2000). Using WordNet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.
Article Google Scholar
Navigli, R., Faralli, S., Soroa, A., Lacalle, O. L., & Agirre, E. (2011). Two birds with one stone: Learning semantic models for text categorization and word sense disambiguation. In Proc. of the 20th ACM Conference on Information and Knowledge Management (CIKM 2011), Glasgow, UK, October 24-28th (pp. 2317–2320).
Nigam, K., Lafferty, J., & Mccallum, A. (1991). Using maximum entropy for text classification. In IJCAI-99 workshop on machine learning for information filtering (pp. 61–67).
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.
Google Scholar
Peng, X., & Choi, B. (2005). Document classifications based on word semantic hierarchies. In Proc. of the international conf. on artificial intelligence and application (AIA’05) (pp. 362–367).
Salon, G. (1991). Development in automatic text retrieval. Science, 253, 974–979.
Article Google Scholar
Scott, S., & Matwin, S. (1998). Text classification using wordNet hypernyms. In Proc. Coling-ACL’98 (pp. 45–52).
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman and Hall.
Siolas, G., & d’Alch Buc, F. (2000). Support vector machines based on a semantic kernel for text categorization. In Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00) (Vol. 5, p. 5205). IEEE Computer Society: Washington, DC.
Sogou Labs (2012). http://www.sogou.com/labs/resources.html.
SVM-light (2012). http://svmlight.joachims.org/.
Vapnik, V. (1995). The nature of statistical learning theory. NY, USA: Springer-Verlag.
MATH Google Scholar
Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using wikipedia. In The 14th ACM SIGKDD (pp. 713–721). New York: ACM Press.
Weka (2012). http://www.cs.waikato.ac.nz/ml/weka/.
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd Edn.). San Francisco: Morgan Kaufmann.
MATH Google Scholar
WordNet (2012). http://wordnet.princeton.edu/.
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99) (pp. 42–49).
Zeng, H. J., et al. (2003). CBC: Clustering based text classification requiring minimal labeled data. In ICDM (pp. 443–450).
Zhang Y., Gong, L. & Wang, Y. (2005). Chinese word sense disambiguation using HowNet. Lecture Notes in Computer Science, 3610/2005, 925–932.
Article Google Scholar
Zhu, X. J. (2007). Semi-supervised learning literature survey. http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html.

Download references

Author information

Authors and Affiliations

NEC Laboratories China, 11F, Bldg.A, Innovation Plaza, Tsinghua Science Park, Haidian District, Beijing, 100084, China
Jian Qiang Li, Yu Zhao & Bo Liu

Authors

Jian Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Qiang Li.

Additional information

A short version of this paper appeared in (Li et al. 2009). This submission includes more complete description of the algorithms, cross-language validation experiments and extended discussions on the experiments and results.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J.Q., Zhao, Y. & Liu, B. Exploiting semantic resources for large scale text categorization. J Intell Inf Syst 39, 763–788 (2012). https://doi.org/10.1007/s10844-012-0211-x

Download citation

Received: 15 May 2011
Revised: 05 February 2012
Accepted: 21 May 2012
Published: 09 June 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10844-012-0211-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting semantic resources for large scale text categorization

Abstract

Access this article

Similar content being viewed by others

Semi-supervised learning in large scale text categorization

A Fully Semantic Approach to Large Scale Text Categorization

Text Categorization from category name in an industry-motivated scenario

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting semantic resources for large scale text categorization

Abstract

Access this article

Similar content being viewed by others

Semi-supervised learning in large scale text categorization

A Fully Semantic Approach to Large Scale Text Categorization

Text Categorization from category name in an industry-motivated scenario

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation