Skip to main content

A Hybrid Approach for Sparse Data Classification Based on Topic Model

  • Conference paper
  • First Online:
Book cover Web-Age Information Management (WAIM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9998))

Included in the following conference series:

Abstract

With an increasing number of short text emerging, sparse text classification is becoming crucial in data mining and information retrieval area. Many efforts have been devoted to improve the efficiency of normal text classification. However, it is still immature in terms of high-dimension and sparse data processing. In this paper, we present a new method which fancifully utilizes Biterm Topic Model (BTM) and Support Vector Machine (SVM). By using BTM, though the dimensionality of training data is reduced significantly, it is still able to keep rich semantic information for the sparse data. We then employ SVM on the generated topics or features. Experiments on 20 Newsgroups and Tencent microblog dataset demonstrate that our approach can achieve excellent classifier performance in terms of precision, recall and F1 measure. Furthermore, it is proved that the proposed method has high efficiency compared with the combination of Latent Dirichlet Allocation (LDA) and SVM. Our method enhances the previous work in this field and establishes the foundation for further studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altınel, B., Ganiz, M.C., Diri, B.: A corpus-based semantic kernel for text classification by using meaning values of terms. Eng. Appl. Artif. Intell. 43, 54–66 (2015)

    Article  Google Scholar 

  2. Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining, p. 4. ACM (2010)

    Google Scholar 

  3. Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  4. Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)

    Article  Google Scholar 

  5. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1–2), 143–175 (2001)

    Article  MATH  Google Scholar 

  6. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  7. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, pp. 2267–2273 (2015)

    Google Scholar 

  8. Landeiro, V., Culotta, A.: Robust text classification in the presence of confounding bias (2016)

    Google Scholar 

  9. Liu, C.-L., Hsaio, W.-H., Lee, C.-H., Chang, T.-H., Kuo, T.-H.: Semi-supervised text classification with universum learning. IEEE Trans. Cybern. 46(2), 462–473 (2015)

    Article  Google Scholar 

  10. Luo, L., Li, L.: Defining and evaluating classification algorithm for high-dimensional data based on latent topics. PloS one 9(1), e82119 (2014)

    Article  Google Scholar 

  11. Luss, R., d’Aspremont, A.: Predicting abnormal returns from news using text classification. Quant. Financ. 15(6), 999–1012 (2015)

    Article  MathSciNet  Google Scholar 

  12. Minh, H.Q., Niyogi, P., Yao, Y.: Mercer’s theorem, feature maps, and smoothing. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 154–168. Springer, Heidelberg (2006). doi:10.1007/11776420_14

    Chapter  Google Scholar 

  13. Moura, S., Partalas, I., Amini, M.-R.: Sparsification of linear models for large-scale text classification. In: Conférence sur l’APprentissage automatique (CAp 2015) (2015)

    Google Scholar 

  14. Nguyen, V.T., Huy, H.N.K., Tai, P.T., Hung, H.A.: Improving multi-class text classification method combined the svm classifier with oao and ddag strategies. J. Convergence Inf. Technol. 10(2), 62–70 (2015)

    Google Scholar 

  15. Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)

    Google Scholar 

  16. Seetha, H., Murty, M.N., Saravanan, R.: Effective feature selection technique for text classification. Int. J. Data Min. Model. Manag. 7(3), 165–184 (2015)

    Google Scholar 

  17. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  18. Song J., Zhang P., Qin S., Gong, J.: A method of the feature selection in hierarchical text classification based on the category discrimination and position information. In: 2015 International Conference on Industrial Informatics-Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), pp. 132–135. IEEE (2015)

    Google Scholar 

  19. Wang, J., Li, L., Tan, F., Zhu, Y., Feng, W.: Detecting hotspot information using multi-attribute based topic model. PloS one 10(10), e0140539 (2015)

    Article  Google Scholar 

  20. Xia, C.-Y., Wang, Z., Sanz, J., Meloni, S., Moreno, Y.: Effects of delayed recovery and nonuniform transmission on the spreading of diseases in complex networks. Phys. A: Stat. Mech. Appl. 392(7), 1577–1585 (2013)

    Article  MathSciNet  Google Scholar 

  21. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. International World Wide Web Conferences Steering Committee (2013)

    Google Scholar 

  22. Yin, C., Xiang, J., Zhang, H., Wang, J., Yin, Z., Kim, J.-U.: A new svm method for short text classification based on semi-supervised learning. In: 2015 4th International Conference on Advanced Information Technology and Sensor Application (AITS), pp. 100–103. IEEE (2015)

    Google Scholar 

  23. Zhang, H., Zhong, G.: Improving short text classification by learning vector representations of both words and hidden topics. Knowl.-Based Syst. 102, 76–86 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by Natural Science Foundations of China (No. 61170192), National High-tech R&D Program of China (No. 2013AA013801), Fundamental Research Funds for the Central Universities (No. XDJK2016E064).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Wang, G., Zhang, J., Yang, X., Li, L. (2016). A Hybrid Approach for Sparse Data Classification Based on Topic Model. In: Song, S., Tong, Y. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9998. Springer, Cham. https://doi.org/10.1007/978-3-319-47121-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47121-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47120-4

  • Online ISBN: 978-3-319-47121-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics