Knowledge-Based Short Text Categorization Using Entity and Category Embedding

  • Rima TürkerEmail author
  • Lei Zhang
  • Maria Koutraki
  • Harald Sack
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11503)


Short text categorization is an important task due to the rapid growth of online available short texts in various domains such as web search snippets, etc. Most of the traditional methods suffer from sparsity and shortness of the text. Moreover, supervised learning methods require a significant amount of training data and manually labeling such data can be very time-consuming and costly. In this study, we propose a novel probabilistic model for Knowledge-Based Short Text Categorization (KBSTC), which does not require any labeled training data to classify a short text. This is achieved by leveraging entities and categories from large knowledge bases, which are further embedded into a common vector space, for which we propose a new entity and category embedding model. Given a short text, its category (e.g. Business, Sports, etc.) can then be derived based on the entities mentioned in the text by exploiting semantic similarity between entities and categories. To validate the effectiveness of the proposed method, we conducted experiments on two real-world datasets, i.e., AG News and Google Snippets. The experimental results show that our approach significantly outperforms the classification approaches which do not require any labeled data, while it comes close to the results of the supervised approaches.


Short text classification Dataless text classification Network embeddings 


  1. 1.
    Burel, G., Saif, H., Alani, H.: Semantic wide and deep learning for detecting crisis-information categories on social media. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 138–155. Springer, Cham (2017). Scholar
  2. 2.
    Chang, M.W., Ratinov, L.A., Roth, D., Srikumar, V.: Importance of semantic representation: dataless classification. In: AAAI (2008)Google Scholar
  3. 3.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI (2007)Google Scholar
  4. 4.
    Li, C., Xing, J., Sun, A., Ma, Z.: Effective document labeling with very few seed words: a topic model approach. In: CIKM (2016)Google Scholar
  5. 5.
    Li, Y., Zheng, R., Tian, T., Hu, Z., Iyer, R., Sycara, K.P.: Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. In: COLING (2016)Google Scholar
  6. 6.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: shedding light on the web of documents. In: I-SEMANTICS (2011)Google Scholar
  7. 7.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)Google Scholar
  8. 8.
    Nigam, K., McCallum, A., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 103–134 (2000)CrossRefGoogle Scholar
  9. 9.
    Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: KDD (2014)Google Scholar
  10. 10.
    Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW (2008)Google Scholar
  11. 11.
    Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 498–514. Springer, Cham (2016). Scholar
  12. 12.
    Röder, M., Usbeck, R., Hellmann, S., Gerber, D., Both, A.: N\({^3}\) - a collection of datasets for named entity recognition and disambiguation in the NLP interchange format. In: LREC (2014)Google Scholar
  13. 13.
    Song, G., Ye, Y., Du, X., Huang, X., Bie, S.: Short text classification: a survey. J. Multimedia 9(5), 635–644 (2014)CrossRefGoogle Scholar
  14. 14.
    Song, Y., Roth, D.: On dataless hierarchical text classification. In: AAAI (2014)Google Scholar
  15. 15.
    Tang, J., Qu, M., Mei, Q.: PTE: predictive text embedding through large-scale heterogeneous text networks. In: KDD (2015)Google Scholar
  16. 16.
    Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: LINE: large-scale information network embedding. In: WWW (2015)Google Scholar
  17. 17.
    Türker, R., Zhang, L., Koutraki, M., Sack, H.: TECNE: knowledge based text classification using network embeddings. In: EKAW (2018)Google Scholar
  18. 18.
    Türker, R., Zhang, L., Koutraki, M., Sack, H.: “The less is more” for text classification. In: SEMANTiCS (2018)Google Scholar
  19. 19.
    Usbeck, R., et al.: GERBIL: general entity annotator benchmarking framework. In: WWW (2015)Google Scholar
  20. 20.
    Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification. In: IJCAI (2017)Google Scholar
  21. 21.
    Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)CrossRefGoogle Scholar
  22. 22.
    Xuan, J., Jiang, H., Ren, Z., Yan, J., Luo, Z.: Automatic bug triage using semi-supervised text classification. In: SEKE (2010)Google Scholar
  23. 23.
    Zhang, X., LeCun, Y.: Text understanding from scratch. CoRR (2015)Google Scholar
  24. 24.
    Zhang, X., Wu, B.: Short text classification based on feature extension using the n-gram model. In: FSKD. IEEE (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Rima Türker
    • 1
    • 2
    Email author
  • Lei Zhang
    • 1
  • Maria Koutraki
    • 1
    • 2
    • 3
  • Harald Sack
    • 1
    • 2
  1. 1.FIZ Karlsruhe – Leibniz Institute for Information InfrastructureKarlsruheGermany
  2. 2.Karlsruhe Institute of Technology, Institute AIFBKarlsruheGermany
  3. 3.L3S Research CenterLeibniz University of HannoverHannoverGermany

Personalised recommendations