Skip to main content

Does Multilevel Semantic Representation Improve Text Categorization?

  • Conference paper
  • First Online:
Database and Expert Systems Applications (Globe 2015, DEXA 2015)

Abstract

This paper presents a novel approach for text categorization by fusing “Bag-of-words” (BOW) word feature and multilevel semantic feature (SF). By extending Online LDA (OLDA) as multilevel topic model for learning a semantic space with different topic granularity, multilevel semantic features are extracted for representing text component. The effectiveness of our approach is evaluated on both large scale Wikipedia corpus and middle-sized 20newsgroups dataset. The former experiment shows that our approach is able to preform semantic feature extraction on large scale dataset. It also demonstrates the topics generated from different topic level have different semantic scopes, which is more appropriate to represent text content. Our classification experiments on 20newsgroups achieved 82.19 % accuracy, which illustrates the effectiveness of fusing BOW and SF features. The further investigation on word and semantic feature fusion proves that Support Vector Machine (SVM) is more sensitive to semantic feature than Naive Bayes (NB), K Nearest Neighbor(KNN), Decision Tree (DT). It is shown that appropriately fusing low-level word feature and high-level semantic feature can achieve equally well or even better result than state-of-the-art with reduced feature dimension and computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The number of topics in different topic level can be arbitrary and should not be same in order to observe topics generated with different topic granularities.

  2. 2.

    http://qwone.com/~jason/20newsgroups/.

References

  1. Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, US (2012)

    Chapter  Google Scholar 

  2. Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)

    Article  Google Scholar 

  3. Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147–154 (2006)

    MATH  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Cao, J., Li, J., Zhang, Y., Tang, S.: Lda-based retrieval framework for semantic news video retrieval. In: International Conference on Semantic Computing, 2007, ICSC 2007, pp. 155–160. IEEE (2007)

    Google Scholar 

  6. Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manage. 47(2), 202–214 (2011)

    Article  Google Scholar 

  7. Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: IJCAI, pp. 1776–1781. Citeseer (2011)

    Google Scholar 

  8. Griffiths, D.M.B.T.L., Tenenbaum, M.I.J.J.B.: Hierarchical topic models and the nested chinese restaurant process. Adv. Neural Inf. Process. Syst. 16, 17–24 (2004)

    Google Scholar 

  9. Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: advances in neural information processing systems, pp. 856–864 (2010)

    Google Scholar 

  10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  11. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)

    Article  Google Scholar 

  12. Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)

    Google Scholar 

  13. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM (2010)

    Google Scholar 

  14. Li, L., Roth, B., Sporleder, C.: Topic models for word sense disambiguation and token-based idiom detection. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1138–1147. Association for Computational Linguistics (2010)

    Google Scholar 

  15. Lienou, M., Maître, H., Datcu, M.: Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geosci. Remote Sens. Lett. 7(1), 28–32 (2010)

    Article  Google Scholar 

  16. Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, pp. 121–128. MIT Press, Cambridge (2008)

    Google Scholar 

  17. Lo Presti, L., Sclaroff, S., La Cascia, M.: Object matching in distributed video surveillance systems by LDA-based appearance descriptors. In: Foggia, P., Sansone, C., Vento, M. (eds.) ICIAP 2009. LNCS, vol. 5716, pp. 547–557. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  18. Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3408–3415. IEEE (2010)

    Google Scholar 

  19. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  20. Titov, I., McDonald, R.: Modeling online reviews with multi-grain topic models. In: Proceedings of the 17th International Conference on World Wide Web, pp. 111–120. ACM (2008)

    Google Scholar 

  21. Wang, C., Blei, D., Li, F.-F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009, pp. 1903–1910. IEEE (2009)

    Google Scholar 

  22. Wu, H.C., Luk, R.W.P., Wong, K.F., Kwok, K.L.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS) 26(3), 13 (2008)

    Article  Google Scholar 

  23. Zhang, A., Zhu, J., Zhang, B.: Sparse online topic models. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1489–1500. International World Wide Web Conferences Steering Committee (2013)

    Google Scholar 

  24. Zhu, J., Xing, E.P.: Sparse topical coding (2012). arXiv preprint arXiv:1202.3778

  25. Zhu, Y., Li, L., Luo, L.: Learning to classify short text with topic model and external knowledge. In: Wang, M. (ed.) KSEM 2013. LNCS, vol. 8041, pp. 493–503. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, C., Yang, H., Meinel, C. (2015). Does Multilevel Semantic Representation Improve Text Categorization?. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22849-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22848-8

  • Online ISBN: 978-3-319-22849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics