A Hybrid Approach for Sparse Data Classification Based on Topic Model

Wang, Guangjing; Zhang, Jie; Yang, Xiaobin; Li, Li

doi:10.1007/978-3-319-47121-1_2

Guangjing Wang¹⁵,
Jie Zhang¹⁵,
Xiaobin Yang¹⁵ &
…
Li Li¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9998))

Included in the following conference series:

International Conference on Web-Age Information Management

929 Accesses
1 Citations

Abstract

With an increasing number of short text emerging, sparse text classification is becoming crucial in data mining and information retrieval area. Many efforts have been devoted to improve the efficiency of normal text classification. However, it is still immature in terms of high-dimension and sparse data processing. In this paper, we present a new method which fancifully utilizes Biterm Topic Model (BTM) and Support Vector Machine (SVM). By using BTM, though the dimensionality of training data is reduced significantly, it is still able to keep rich semantic information for the sparse data. We then employ SVM on the generated topics or features. Experiments on 20 Newsgroups and Tencent microblog dataset demonstrate that our approach can achieve excellent classifier performance in terms of precision, recall and F1 measure. Furthermore, it is proved that the proposed method has high efficiency compared with the combination of Latent Dirichlet Allocation (LDA) and SVM. Our method enhances the previous work in this field and establishes the foundation for further studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altınel, B., Ganiz, M.C., Diri, B.: A corpus-based semantic kernel for text classification by using meaning values of terms. Eng. Appl. Artif. Intell. 43, 54–66 (2015)
Article Google Scholar
Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining, p. 4. ACM (2010)
Google Scholar
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Article Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1–2), 143–175 (2001)
Article MATH Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, pp. 2267–2273 (2015)
Google Scholar
Landeiro, V., Culotta, A.: Robust text classification in the presence of confounding bias (2016)
Google Scholar
Liu, C.-L., Hsaio, W.-H., Lee, C.-H., Chang, T.-H., Kuo, T.-H.: Semi-supervised text classification with universum learning. IEEE Trans. Cybern. 46(2), 462–473 (2015)
Article Google Scholar
Luo, L., Li, L.: Defining and evaluating classification algorithm for high-dimensional data based on latent topics. PloS one 9(1), e82119 (2014)
Article Google Scholar
Luss, R., d’Aspremont, A.: Predicting abnormal returns from news using text classification. Quant. Financ. 15(6), 999–1012 (2015)
Article MathSciNet Google Scholar
Minh, H.Q., Niyogi, P., Yao, Y.: Mercer’s theorem, feature maps, and smoothing. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 154–168. Springer, Heidelberg (2006). doi:10.1007/11776420_14
Chapter Google Scholar
Moura, S., Partalas, I., Amini, M.-R.: Sparsification of linear models for large-scale text classification. In: Conférence sur l’APprentissage automatique (CAp 2015) (2015)
Google Scholar
Nguyen, V.T., Huy, H.N.K., Tai, P.T., Hung, H.A.: Improving multi-class text classification method combined the svm classifier with oao and ddag strategies. J. Convergence Inf. Technol. 10(2), 62–70 (2015)
Google Scholar
Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Seetha, H., Murty, M.N., Saravanan, R.: Effective feature selection technique for text classification. Int. J. Data Min. Model. Manag. 7(3), 165–184 (2015)
Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)
Article MathSciNet MATH Google Scholar
Song J., Zhang P., Qin S., Gong, J.: A method of the feature selection in hierarchical text classification based on the category discrimination and position information. In: 2015 International Conference on Industrial Informatics-Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), pp. 132–135. IEEE (2015)
Google Scholar
Wang, J., Li, L., Tan, F., Zhu, Y., Feng, W.: Detecting hotspot information using multi-attribute based topic model. PloS one 10(10), e0140539 (2015)
Article Google Scholar
Xia, C.-Y., Wang, Z., Sanz, J., Meloni, S., Moreno, Y.: Effects of delayed recovery and nonuniform transmission on the spreading of diseases in complex networks. Phys. A: Stat. Mech. Appl. 392(7), 1577–1585 (2013)
Article MathSciNet Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. International World Wide Web Conferences Steering Committee (2013)
Google Scholar
Yin, C., Xiang, J., Zhang, H., Wang, J., Yin, Z., Kim, J.-U.: A new svm method for short text classification based on semi-supervised learning. In: 2015 4th International Conference on Advanced Information Technology and Sensor Application (AITS), pp. 100–103. IEEE (2015)
Google Scholar
Zhang, H., Zhong, G.: Improving short text classification by learning vector representations of both words and hidden topics. Knowl.-Based Syst. 102, 76–86 (2016)
Article Google Scholar

Download references

Acknowledgments

This work is supported by Natural Science Foundations of China (No. 61170192), National High-tech R&D Program of China (No. 2013AA013801), Fundamental Research Funds for the Central Universities (No. XDJK2016E064).

Author information

Authors and Affiliations

Faculty of Computer and Information Science, Southwest University, Chongqing, 400715, China
Guangjing Wang, Jie Zhang, Xiaobin Yang & Li Li

Authors

Guangjing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Li Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Li .

Editor information

Editors and Affiliations

Tsinghua University , Beijing, China
Shaoxu Song
Beihang University , Beijing, China
Yongxin Tong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, G., Zhang, J., Yang, X., Li, L. (2016). A Hybrid Approach for Sparse Data Classification Based on Topic Model. In: Song, S., Tong, Y. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9998. Springer, Cham. https://doi.org/10.1007/978-3-319-47121-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-47121-1_2
Published: 15 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47120-4
Online ISBN: 978-3-319-47121-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics