Document Labeling Using Source-LDA Combined with Correlation Matrix

Roul, Rajendra Kumar; Sahoo, Jajati Keshari

doi:10.1007/978-981-10-8055-5_62

Document Labeling Using Source-LDA Combined with Correlation Matrix

Rajendra Kumar Roul¹⁸ &
Jajati Keshari Sahoo¹⁹

Conference paper
First Online: 04 July 2018

1075 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 711))

Abstract

Topic modeling is one of the most applied and active research areas in the domain of information retrieval. Topic modeling has become increasingly important due to the large and varied amount of data produced every second. In this paper, we try to exploit two major drawbacks (topic independence and unsupervised learning) of latent Dirichlet allocation (LDA). To remove the first drawback, we use Wikipedia as a knowledge source to make a semi-supervised model (Source-LDA) for generating predefined topic-word distribution. The second drawback is removed using a correlation matrix containing cosine-similarity measure of all the topics. The reason for using a semi-supervised LDA instead of a supervised model is not to overfit the data for new labels. Experimental results show that the performance of Source-LDA combine with correlation matrix is better than the traditional LDA and Source-LDA.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
Google Scholar
S. Lacoste-Julien, F. Sha, and M. I. Jordan, “Disclda: Discriminative learning for dimensionality reduction and classification,” in Advances in neural information processing systems, vol. 21, 2008, pp. 897–904.
Google Scholar
J. Zhu, A. Ahmed, and E. P. Xing, “Medlda: maximum margin supervised topic models for regression and classification,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 1257–1264.
Google Scholar
L. Du, L. Ren, L. Carin, and D. B. Dunson, “A bayesian model for simultaneous image clustering, annotation and object segmentation,” in Advances in neural information processing systems, 2009, pp. 486–494.
Google Scholar
R. K. Roul, S. R. Asthana, and G. Kumar, “Study on suitability and importance of multilayer extreme learning machine for classification of text data,” Soft Computing, vol. 21, no. 15, pp. 4239–4256, 2017.
Article Google Scholar
K. Toutanova and M. Johnson, “A bayesian lda-based model for semi-supervised part-of-speech tagging,” in Advances in neural information processing systems, 2008, pp. 1521–1528.
Google Scholar
A. Mukherjee and B. Liu, “Aspect extraction through semi-supervised modeling,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012, pp. 339–348.
Google Scholar
D. Wang, M. Thint, and A. Al-Rubaie, “Semi-supervised latent dirichlet allocation and its application for document classification,” in Proceedings of the the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 03. IEEE Computer Society, 2012, pp. 306–310.
Google Scholar
S. Bodrunova, S. Koltsov, O. Koltsova, S. Nikolenko, and A. Shimorina, “Interval semi-supervised lda: Classifying needles in a haystack,” in Mexican International Conference on Artificial Intelligence. Springer, 2013, pp. 265–274.
Google Scholar
R. K. Roul and S. Sanjay, “Cluster labelling using chi-square-based keyword ranking and mutual information score: a hybrid approach,” International Journal of Intelligent Systems Design and Computing, vol. 1, no. 2, pp. 145–167, 2017.
Article Google Scholar
S. Jameel, W. Lam, and L. Bing, “Supervised topic models with word order structure for document classification and retrieval learning,” Information Retrieval Journal, vol. 18, no. 4, pp. 283–330, 2015.
Article Google Scholar
C. D. Manning, P. Raghavan, H. Schütze et al., Introduction to information retrieval. Cambridge university press Cambridge, 2008, vol. 1, no. 1.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, BITS-Pilani, K. K. Birla Goa Campus, Zuarinagar, 403726, Goa, India
Rajendra Kumar Roul
Department of Mathematics, BITS-Pilani, K. K. Birla Goa Campus, Zuarinagar, 403726, Goa, India
Jajati Keshari Sahoo

Authors

Rajendra Kumar Roul
View author publications
You can also search for this author in PubMed Google Scholar
Jajati Keshari Sahoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajendra Kumar Roul .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering & Information Technology, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Himansu Sekhar Behera
Department of Computer Science and Engineering, Sri Sivani College of Engineering (SSCE), Srikakulam, Andhra Pradesh, India
Janmenjoy Nayak
Department of Computer Application, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Bighnaraj Naik
Machine Intelligence Research (MIR) Lab, Auburn, WA, USA
Ajith Abraham

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roul, R.K., Sahoo, J.K. (2019). Document Labeling Using Source-LDA Combined with Correlation Matrix. In: Behera, H., Nayak, J., Naik, B., Abraham, A. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 711. Springer, Singapore. https://doi.org/10.1007/978-981-10-8055-5_62

Download citation

DOI: https://doi.org/10.1007/978-981-10-8055-5_62
Published: 04 July 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8054-8
Online ISBN: 978-981-10-8055-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics