Does Multilevel Semantic Representation Improve Text Categorization?

Wang, Cheng; Yang, Haojin; Meinel, Christoph

doi:10.1007/978-3-319-22849-5_22

Cheng Wang¹⁸,
Haojin Yang¹⁸ &
Christoph Meinel¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9261))

Included in the following conference series:

1210 Accesses

Abstract

This paper presents a novel approach for text categorization by fusing “Bag-of-words” (BOW) word feature and multilevel semantic feature (SF). By extending Online LDA (OLDA) as multilevel topic model for learning a semantic space with different topic granularity, multilevel semantic features are extracted for representing text component. The effectiveness of our approach is evaluated on both large scale Wikipedia corpus and middle-sized 20newsgroups dataset. The former experiment shows that our approach is able to preform semantic feature extraction on large scale dataset. It also demonstrates the topics generated from different topic level have different semantic scopes, which is more appropriate to represent text content. Our classification experiments on 20newsgroups achieved 82.19 % accuracy, which illustrates the effectiveness of fusing BOW and SF features. The further investigation on word and semantic feature fusion proves that Support Vector Machine (SVM) is more sensitive to semantic feature than Naive Bayes (NB), K Nearest Neighbor(KNN), Decision Tree (DT). It is shown that appropriately fusing low-level word feature and high-level semantic feature can achieve equally well or even better result than state-of-the-art with reduced feature dimension and computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The number of topics in different topic level can be arbitrary and should not be same in order to observe topics generated with different topic granularities.
2.
http://qwone.com/~jason/20newsgroups/.

References

Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, US (2012)
Chapter Google Scholar
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)
Article Google Scholar
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147–154 (2006)
MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Cao, J., Li, J., Zhang, Y., Tang, S.: Lda-based retrieval framework for semantic news video retrieval. In: International Conference on Semantic Computing, 2007, ICSC 2007, pp. 155–160. IEEE (2007)
Google Scholar
Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manage. 47(2), 202–214 (2011)
Article Google Scholar
Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: IJCAI, pp. 1776–1781. Citeseer (2011)
Google Scholar
Griffiths, D.M.B.T.L., Tenenbaum, M.I.J.J.B.: Hierarchical topic models and the nested chinese restaurant process. Adv. Neural Inf. Process. Syst. 16, 17–24 (2004)
Google Scholar
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: advances in neural information processing systems, pp. 856–864 (2010)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Article Google Scholar
Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)
Google Scholar
Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM (2010)
Google Scholar
Li, L., Roth, B., Sporleder, C.: Topic models for word sense disambiguation and token-based idiom detection. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1138–1147. Association for Computational Linguistics (2010)
Google Scholar
Lienou, M., Maître, H., Datcu, M.: Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geosci. Remote Sens. Lett. 7(1), 28–32 (2010)
Article Google Scholar
Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, pp. 121–128. MIT Press, Cambridge (2008)
Google Scholar
Lo Presti, L., Sclaroff, S., La Cascia, M.: Object matching in distributed video surveillance systems by LDA-based appearance descriptors. In: Foggia, P., Sansone, C., Vento, M. (eds.) ICIAP 2009. LNCS, vol. 5716, pp. 547–557. Springer, Heidelberg (2009)
Chapter Google Scholar
Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3408–3415. IEEE (2010)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Titov, I., McDonald, R.: Modeling online reviews with multi-grain topic models. In: Proceedings of the 17th International Conference on World Wide Web, pp. 111–120. ACM (2008)
Google Scholar
Wang, C., Blei, D., Li, F.-F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009, pp. 1903–1910. IEEE (2009)
Google Scholar
Wu, H.C., Luk, R.W.P., Wong, K.F., Kwok, K.L.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS) 26(3), 13 (2008)
Article Google Scholar
Zhang, A., Zhu, J., Zhang, B.: Sparse online topic models. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1489–1500. International World Wide Web Conferences Steering Committee (2013)
Google Scholar
Zhu, J., Xing, E.P.: Sparse topical coding (2012). arXiv preprint arXiv:1202.3778
Zhu, Y., Li, L., Luo, L.: Learning to classify short text with topic model and external knowledge. In: Wang, M. (ed.) KSEM 2013. LNCS, vol. 8041, pp. 493–503. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Germany
Cheng Wang, Haojin Yang & Christoph Meinel

Authors

Cheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haojin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Wang .

Editor information

Editors and Affiliations

Hewlett-Packard Enterprise, Sunnyvale, California, USA
Qiming Chen
Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
Blaise Pascal University, Aubiere, France
Farouk Toumani
University of Linz, Linz, Austria
Roland Wagner
Universidad Politécnica de Valencia, Valencia, Spain
Hendrik Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Yang, H., Meinel, C. (2015). Does Multilevel Semantic Representation Improve Text Categorization?. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-22849-5_22
Published: 11 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics