Learning Focused Hierarchical Topic Models with Semi-Supervision in Microblogs

Slutsky, Anton; Hu, Xiaohua; An, Yuan

doi:10.1007/978-3-319-18032-8_47

Anton Slutsky¹⁰,
Xiaohua Hu¹⁰ &
Yuan An¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4197 Accesses

Abstract

Topic modeling approaches, such as Latent Dirichlet Allocation (LDA) and Hierarchical LDA (hLDA) have been used extensively to discover topics in various corpora. Unfortunately, these approaches do not perform well when applied to collections of social media posts. Further, these approaches do not allow users to focus topic discovery around subjectively interesting concepts. We propose the new Semi-Supervised Microblog-hLDA (SS-Micro-hLDA) model to discover topic hierarchies in short, noisy microblog documents in a way that allows users to focus topic discovery around interesting areas. We test SS-Micro-hLDA using a large, public collection of Twitter messages and Reddit social blogging site and show that our model outperforms hLDA, Constrained-hLDA, Recursive-rCRP and TSSB in terms of Pointwise Mutual Information (PMI) Score. Further, we test our model in terms of information entropy of held-out data and show that the new approach produces highly focused topic hierarchies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Google Scholar
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011)
Chapter Google Scholar
Bellaachia, A., Al-Dhelaan, M.: Ne-rank: a novel graph-based keyphrase extraction in twitter. In: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 372–379. IEEE Computer Society (2012)
Google Scholar
Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)
Google Scholar
Hu, Y., John, A., Seligmann, D.D., Wang, F.: What were the tweets about? topical associations between public events and twitter feeds. In: ICWSM (2012)
Google Scholar
Zhao, W.X., Jiang, J., He, J., Song, Y., Achananuparp, P., Lim, E.-P., Li, X.: Topical keyphrase extraction from twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 379–388. Association for Computational Linguistics (2011)
Google Scholar
Conrad, C.: Cognitive economy in semantic memory (1972)
Google Scholar
Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychological Review 82(6), 407 (1975)
Article Google Scholar
Deerwester, S.: Improving information retrieval with latent semantic indexing (1988)
Google Scholar
Dhillon, I.S., Sra, S.: Generalized nonnegative matrix approximations with bregman divergences. In: NIPS, vol. 18 (2005)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Google Scholar
Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested chinese restaurant process. In: NIPS, vol. 16 (2003)
Google Scholar
Aldous, D.: Exchangeability and related topics. In: École d’Été de Probabilités de Saint-Flour XIII1983, pp. 1–198 (1985)
Google Scholar
Andrzejewski, D., Zhu, X.: Latent dirichlet allocation with topic-in-set knowledge. In: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, pp. 43–48. Association for Computational Linguistics (2009)
Google Scholar
Bodrunova, S., Koltsov, S., Koltsova, O., Nikolenko, S., Shimorina, A.: Interval semi-supervised LDA: classifying needles in a haystack. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS, vol. 8265, pp. 265–274. Springer, Heidelberg (2013)
Chapter Google Scholar
Mao, X.L., Ming, Z.Y., Chua, T.S., Li, S., Yan, H., Li, X.: Sshlda: a semi-supervised hierarchical topic model. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 800–809. Association for Computational Linguistics (2012)
Google Scholar
Zipf, G.K.: Selected studies of the principle of relative frequency in language. Harvard University Press, Cambridge (1932)
Book Google Scholar
Tweets2011 corpus, Online, TREC (2011). https://sites.google.com/site/microblogtrack/2011-guidelines
Newman, D., Karimi, S., Cavedon, L., Kay, J., Thomas, P., Trotman, A.: External evaluation of topic models. In: Australasian Document Computing Symposium (ADCS), pp. 1–8. School of Information Technologies, University of Sydney (2009)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. Association for Computational Linguistics (2010)
Google Scholar
Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital libraries, pp. 215–224. ACM (2010)
Google Scholar
Shannon, C.: A mathematical theory of communication. The Bell System Technical Journal 27(3), 379–423 (1948)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

College of Computing and Informatics, Drexel University, Philadelphia, USA
Anton Slutsky, Xiaohua Hu & Yuan An

Authors

Anton Slutsky
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yuan An
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anton Slutsky .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tru Cao
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, Hong Kong SAR
David Cheung
Osaka University, Osaka, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Slutsky, A., Hu, X., An, Y. (2015). Learning Focused Hierarchical Topic Models with Semi-Supervision in Microblogs. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-18032-8_47
Published: 09 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics