Skip to main content

Improving Semi-supervised Text Classification by Using Wikipedia Knowledge

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7923))

Abstract

Semi-supervised text classification uses both labeled and unlabeled data to construct classifiers. The key issue is how to utilize the unlabeled data. Clustering based classification method outperforms other semi-supervised text classification algorithms. However, its achievements are still limited because the vector space model representation largely ignores the semantic relationships between words. In this paper, we propose a new approach to address this problem by using Wikipedia knowledge. We enrich document representation with Wikipedia semantic features (concepts and categories), propose a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Experiment results on several corpora show that our proposed method can effectively improve semi-supervised text classification performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hu, X., Zhang, X., et al.: Exploiting Wikipedia as external knowledge for document clustering. In: ACM SIGKDD, pp. 389–396. ACM, New York (2009)

    Google Scholar 

  2. Cai, L., Zhou, G., et al.: Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge. In: Proceedings of ACM CIKM, pp. 1321–1330. ACM, New York (2011)

    Google Scholar 

  3. Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proceedings of ACM SIGKDD, pp. 713–721. ACM, New York (2008)

    Google Scholar 

  4. Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 179–186. ACM, New York (2008)

    Chapter  Google Scholar 

  5. Pu, W., Jian, H., et al.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of 7th IEEE ICDM, pp. 332–341. IEEE Press, New York (2007)

    Google Scholar 

  6. Wu, Z., Xu, G., et al.: Leveraging Wikipedia concept and category information to enhance contextual advertising. In: Proceedings of ACM CIKM, pp. 2105–2108. ACM, New York (2011)

    Google Scholar 

  7. Banerjee, S.: Improving text classification accuracy using topic modeling over an additional corpus. In: Proceedings of ACM SIGIR, pp. 867–868. ACM, New York (2008)

    Chapter  Google Scholar 

  8. Hua-Jun, Z., Xuan-Hui, W., et al.: CBC: clustering based text classification requiring minimal labeled data. In: 3rd IEEE International Conference on ICDM, pp. 443–450. IEEE Press, New York (2003)

    Google Scholar 

  9. Dai, W., Xue, G.R., et al.: Co-clustering based classification for out-of-domain documents. In: Proceedings of ACM SIGKDD, pp. 210–219. ACM, New York (2007)

    Google Scholar 

  10. Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: Proceedings of ACM SIGIR, pp. 805–806. ACM, New York (2007)

    Google Scholar 

  11. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of AAAI, pp. 1301–1306. AAAI Press (2006)

    Google Scholar 

  12. Ko, Y., Seo, J.: Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing & Management, 70–83 (2009)

    Google Scholar 

  13. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers Inc. (1999)

    Google Scholar 

  14. Nigam, K., McCallum, A.K., et al.: Text classification from labeled and unlabeled documents using EM. In: Machine Learning, pp. 103–134 (2000)

    Google Scholar 

  15. Su, J., Shirab, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial naive bayes. In: ICML, New York, NY, USA, pp. 25–32 (2011)

    Google Scholar 

  16. Nizamani, S., Memon, N., et al.: CCM: A Text Classification Model by Clustering. In: Advances in Social Networks Analysis and Mining (ASONAM), pp. 461–467. IEEE Press, New York (2011)

    Google Scholar 

  17. Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota (2002)

    Google Scholar 

  18. Vogrinčič, S., Bosnić, Z.: Ontology-based multi-label classification of economic articles. Computer Science and Information Systems, 101–119 (2011)

    Google Scholar 

  19. Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of AAAI. AAAI Press (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Z., Lin, H., Li, P., Wang, H., Lu, D. (2013). Improving Semi-supervised Text Classification by Using Wikipedia Knowledge. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38562-9_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38561-2

  • Online ISBN: 978-3-642-38562-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics