Improving Semi-supervised Text Classification by Using Wikipedia Knowledge

Zhang, Zhilin; Lin, Huaizhong; Li, Pengfei; Wang, Huazhong; Lu, Dongming

doi:10.1007/978-3-642-38562-9_3

Improving Semi-supervised Text Classification by Using Wikipedia Knowledge

Zhilin Zhang²¹,
Huaizhong Lin²¹,
Pengfei Li²¹,
Huazhong Wang²¹ &
…
Dongming Lu²¹

Conference paper

3537 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7923))

Abstract

Semi-supervised text classification uses both labeled and unlabeled data to construct classifiers. The key issue is how to utilize the unlabeled data. Clustering based classification method outperforms other semi-supervised text classification algorithms. However, its achievements are still limited because the vector space model representation largely ignores the semantic relationships between words. In this paper, we propose a new approach to address this problem by using Wikipedia knowledge. We enrich document representation with Wikipedia semantic features (concepts and categories), propose a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Experiment results on several corpora show that our proposed method can effectively improve semi-supervised text classification performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hu, X., Zhang, X., et al.: Exploiting Wikipedia as external knowledge for document clustering. In: ACM SIGKDD, pp. 389–396. ACM, New York (2009)
Google Scholar
Cai, L., Zhou, G., et al.: Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge. In: Proceedings of ACM CIKM, pp. 1321–1330. ACM, New York (2011)
Google Scholar
Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proceedings of ACM SIGKDD, pp. 713–721. ACM, New York (2008)
Google Scholar
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 179–186. ACM, New York (2008)
Chapter Google Scholar
Pu, W., Jian, H., et al.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of 7th IEEE ICDM, pp. 332–341. IEEE Press, New York (2007)
Google Scholar
Wu, Z., Xu, G., et al.: Leveraging Wikipedia concept and category information to enhance contextual advertising. In: Proceedings of ACM CIKM, pp. 2105–2108. ACM, New York (2011)
Google Scholar
Banerjee, S.: Improving text classification accuracy using topic modeling over an additional corpus. In: Proceedings of ACM SIGIR, pp. 867–868. ACM, New York (2008)
Chapter Google Scholar
Hua-Jun, Z., Xuan-Hui, W., et al.: CBC: clustering based text classification requiring minimal labeled data. In: 3rd IEEE International Conference on ICDM, pp. 443–450. IEEE Press, New York (2003)
Google Scholar
Dai, W., Xue, G.R., et al.: Co-clustering based classification for out-of-domain documents. In: Proceedings of ACM SIGKDD, pp. 210–219. ACM, New York (2007)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: Proceedings of ACM SIGIR, pp. 805–806. ACM, New York (2007)
Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of AAAI, pp. 1301–1306. AAAI Press (2006)
Google Scholar
Ko, Y., Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing & Management, 70–83 (2009)
Google Scholar
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers Inc. (1999)
Google Scholar
Nigam, K., McCallum, A.K., et al.: Text classification from labeled and unlabeled documents using EM. In: Machine Learning, pp. 103–134 (2000)
Google Scholar
Su, J., Shirab, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial naive bayes. In: ICML, New York, NY, USA, pp. 25–32 (2011)
Google Scholar
Nizamani, S., Memon, N., et al.: CCM: A Text Classification Model by Clustering. In: Advances in Social Networks Analysis and Mining (ASONAM), pp. 461–467. IEEE Press, New York (2011)
Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota (2002)
Google Scholar
Vogrinčič, S., Bosnić, Z.: Ontology-based multi-label classification of economic articles. Computer Science and Information Systems, 101–119 (2011)
Google Scholar
Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of AAAI. AAAI Press (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P.R. China
Zhilin Zhang, Huaizhong Lin, Pengfei Li, Huazhong Wang & Dongming Lu

Authors

Zhilin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Huaizhong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Li
View author publications
You can also search for this author in PubMed Google Scholar
Huazhong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongming Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jianyong Wang
Management Science and Information Systems Department, Rutgers, the State University of New Jersey, 1, Washington Park, 07102, Newark, NJ, USA
Hui Xiong
Department of Information Engineering, Nagoya University, 464-8601, Nagoya, Japan
Yoshiharu Ishikawa
Department of Computer Science, Hong Kong Baptist University, Hong Kong
Jianliang Xu
School of Information Science and Engineering, Yanshan University, Qinhuangdao, China
Junfeng Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Lin, H., Li, P., Wang, H., Lu, D. (2013). Improving Semi-supervised Text Classification by Using Wikipedia Knowledge. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-38562-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38561-2
Online ISBN: 978-3-642-38562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics