A Hybrid Approach for Sparse Data Classification Based on Topic Model

  • Guangjing Wang
  • Jie Zhang
  • Xiaobin Yang
  • Li LiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9998)


With an increasing number of short text emerging, sparse text classification is becoming crucial in data mining and information retrieval area. Many efforts have been devoted to improve the efficiency of normal text classification. However, it is still immature in terms of high-dimension and sparse data processing. In this paper, we present a new method which fancifully utilizes Biterm Topic Model (BTM) and Support Vector Machine (SVM). By using BTM, though the dimensionality of training data is reduced significantly, it is still able to keep rich semantic information for the sparse data. We then employ SVM on the generated topics or features. Experiments on 20 Newsgroups and Tencent microblog dataset demonstrate that our approach can achieve excellent classifier performance in terms of precision, recall and F1 measure. Furthermore, it is proved that the proposed method has high efficiency compared with the combination of Latent Dirichlet Allocation (LDA) and SVM. Our method enhances the previous work in this field and establishes the foundation for further studies.



This work is supported by Natural Science Foundations of China (No. 61170192), National High-tech R&D Program of China (No. 2013AA013801), Fundamental Research Funds for the Central Universities (No. XDJK2016E064).


  1. 1.
    Altınel, B., Ganiz, M.C., Diri, B.: A corpus-based semantic kernel for text classification by using meaning values of terms. Eng. Appl. Artif. Intell. 43, 54–66 (2015)CrossRefGoogle Scholar
  2. 2.
    Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining, p. 4. ACM (2010)Google Scholar
  3. 3.
    Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  4. 4.
    Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)CrossRefGoogle Scholar
  5. 5.
    Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1–2), 143–175 (2001)CrossRefzbMATHGoogle Scholar
  6. 6.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  7. 7.
    Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, pp. 2267–2273 (2015)Google Scholar
  8. 8.
    Landeiro, V., Culotta, A.: Robust text classification in the presence of confounding bias (2016)Google Scholar
  9. 9.
    Liu, C.-L., Hsaio, W.-H., Lee, C.-H., Chang, T.-H., Kuo, T.-H.: Semi-supervised text classification with universum learning. IEEE Trans. Cybern. 46(2), 462–473 (2015)CrossRefGoogle Scholar
  10. 10.
    Luo, L., Li, L.: Defining and evaluating classification algorithm for high-dimensional data based on latent topics. PloS one 9(1), e82119 (2014)CrossRefGoogle Scholar
  11. 11.
    Luss, R., d’Aspremont, A.: Predicting abnormal returns from news using text classification. Quant. Financ. 15(6), 999–1012 (2015)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Minh, H.Q., Niyogi, P., Yao, Y.: Mercer’s theorem, feature maps, and smoothing. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 154–168. Springer, Heidelberg (2006). doi: 10.1007/11776420_14 CrossRefGoogle Scholar
  13. 13.
    Moura, S., Partalas, I., Amini, M.-R.: Sparsification of linear models for large-scale text classification. In: Conférence sur l’APprentissage automatique (CAp 2015) (2015)Google Scholar
  14. 14.
    Nguyen, V.T., Huy, H.N.K., Tai, P.T., Hung, H.A.: Improving multi-class text classification method combined the svm classifier with oao and ddag strategies. J. Convergence Inf. Technol. 10(2), 62–70 (2015)Google Scholar
  15. 15.
    Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)Google Scholar
  16. 16.
    Seetha, H., Murty, M.N., Saravanan, R.: Effective feature selection technique for text classification. Int. J. Data Min. Model. Manag. 7(3), 165–184 (2015)Google Scholar
  17. 17.
    Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Song J., Zhang P., Qin S., Gong, J.: A method of the feature selection in hierarchical text classification based on the category discrimination and position information. In: 2015 International Conference on Industrial Informatics-Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), pp. 132–135. IEEE (2015)Google Scholar
  19. 19.
    Wang, J., Li, L., Tan, F., Zhu, Y., Feng, W.: Detecting hotspot information using multi-attribute based topic model. PloS one 10(10), e0140539 (2015)CrossRefGoogle Scholar
  20. 20.
    Xia, C.-Y., Wang, Z., Sanz, J., Meloni, S., Moreno, Y.: Effects of delayed recovery and nonuniform transmission on the spreading of diseases in complex networks. Phys. A: Stat. Mech. Appl. 392(7), 1577–1585 (2013)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. International World Wide Web Conferences Steering Committee (2013)Google Scholar
  22. 22.
    Yin, C., Xiang, J., Zhang, H., Wang, J., Yin, Z., Kim, J.-U.: A new svm method for short text classification based on semi-supervised learning. In: 2015 4th International Conference on Advanced Information Technology and Sensor Application (AITS), pp. 100–103. IEEE (2015)Google Scholar
  23. 23.
    Zhang, H., Zhong, G.: Improving short text classification by learning vector representations of both words and hidden topics. Knowl.-Based Syst. 102, 76–86 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Guangjing Wang
    • 1
  • Jie Zhang
    • 1
  • Xiaobin Yang
    • 1
  • Li Li
    • 1
    Email author
  1. 1.Faculty of Computer and Information ScienceSouthwest UniversityChongqingChina

Personalised recommendations