Abstract
Semi-supervised text classification uses both labeled and unlabeled data to construct classifiers. The key issue is how to utilize the unlabeled data. Clustering based classification method outperforms other semi-supervised text classification algorithms. However, its achievements are still limited because the vector space model representation largely ignores the semantic relationships between words. In this paper, we propose a new approach to address this problem by using Wikipedia knowledge. We enrich document representation with Wikipedia semantic features (concepts and categories), propose a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Experiment results on several corpora show that our proposed method can effectively improve semi-supervised text classification performance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Hu, X., Zhang, X., et al.: Exploiting Wikipedia as external knowledge for document clustering. In: ACM SIGKDD, pp. 389–396. ACM, New York (2009)
Cai, L., Zhou, G., et al.: Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge. In: Proceedings of ACM CIKM, pp. 1321–1330. ACM, New York (2011)
Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proceedings of ACM SIGKDD, pp. 713–721. ACM, New York (2008)
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 179–186. ACM, New York (2008)
Pu, W., Jian, H., et al.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of 7th IEEE ICDM, pp. 332–341. IEEE Press, New York (2007)
Wu, Z., Xu, G., et al.: Leveraging Wikipedia concept and category information to enhance contextual advertising. In: Proceedings of ACM CIKM, pp. 2105–2108. ACM, New York (2011)
Banerjee, S.: Improving text classification accuracy using topic modeling over an additional corpus. In: Proceedings of ACM SIGIR, pp. 867–868. ACM, New York (2008)
Hua-Jun, Z., Xuan-Hui, W., et al.: CBC: clustering based text classification requiring minimal labeled data. In: 3rd IEEE International Conference on ICDM, pp. 443–450. IEEE Press, New York (2003)
Dai, W., Xue, G.R., et al.: Co-clustering based classification for out-of-domain documents. In: Proceedings of ACM SIGKDD, pp. 210–219. ACM, New York (2007)
Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: Proceedings of ACM SIGIR, pp. 805–806. ACM, New York (2007)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of AAAI, pp. 1301–1306. AAAI Press (2006)
Ko, Y., Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing & Management, 70–83 (2009)
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers Inc. (1999)
Nigam, K., McCallum, A.K., et al.: Text classification from labeled and unlabeled documents using EM. In: Machine Learning, pp. 103–134 (2000)
Su, J., Shirab, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial naive bayes. In: ICML, New York, NY, USA, pp. 25–32 (2011)
Nizamani, S., Memon, N., et al.: CCM: A Text Classification Model by Clustering. In: Advances in Social Networks Analysis and Mining (ASONAM), pp. 461–467. IEEE Press, New York (2011)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota (2002)
Vogrinčič, S., Bosnić, Z.: Ontology-based multi-label classification of economic articles. Computer Science and Information Systems, 101–119 (2011)
Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of AAAI. AAAI Press (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Z., Lin, H., Li, P., Wang, H., Lu, D. (2013). Improving Semi-supervised Text Classification by Using Wikipedia Knowledge. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-38562-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38561-2
Online ISBN: 978-3-642-38562-9
eBook Packages: Computer ScienceComputer Science (R0)