A Hierarchical Topic Modelling Approach for Tweet Clustering

Wang, Bo; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob

doi:10.1007/978-3-319-67256-4_30

Bo Wang¹⁶,
Maria Liakata^16,17,
Arkaitz Zubiaga¹⁶ &
…
Rob Procter^16,17

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10540))

Included in the following conference series:

International Conference on Social Informatics

4538 Accesses
13 Citations
5 Altmetric

Abstract

While social media platforms such as Twitter can provide rich and up-to-date information for a wide range of applications, manually digesting such large volumes of data is difficult and costly. Therefore it is important to automatically infer coherent and discriminative topics from tweets. Conventional topic models and document clustering approaches fail to achieve good results due to the noisy and sparse nature of tweets. In this paper, we explore various ways of tackling this challenge and finally propose a two-stage hierarchical topic modelling system that is efficient and effective in alleviating the data sparsity problem. We present an extensive evaluation on two datasets, and report our proposed system achieving the best performance in both document clustering performance and topic coherence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
Topic proportion: the proportion of words in document d that are assigned to topic t or the topic probabilities of a document, i.e. p(t|d).
2.
http://www.snow-workshop.org/2017/challenge/.
3.
Their data is not evaluated due to its lack of annotated tweets.
4.
https://nlp.stanford.edu/projects/glove/.
5.
We have also evaluated LCTM with number of concepts setting to 600 and 1000, however we observed little difference in the performance.
6.
The GloVe model was trained using 2 billion tweets while the Word2Vec model was trained on 5 million tweets using the skip-gram algorithm.

References

Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: Proceedings of the 2012 SIAM International Conference on Data Mining, pp. 624–635. SIAM (2012)
Google Scholar
Allan, J.: Topic Detection and Tracking: Event-based Information Organization, vol. 12. Springer Science & Business Media (2012)
Google Scholar
Alvarez-Melis, D., Saveski, M.: Topic modeling in twitter: aggregating tweets by conversations. In: ICWSM, pp. 519–522 (2016)
Google Scholar
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013)
Article Google Scholar
Becker, H., Naaman, M., Gravano, L.: Beyond trending topics: real-world event identification on twitter. In: ICWSM 2011, pp. 438–441 (2011)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009)
Google Scholar
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 269 (2016)
Google Scholar
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1057–1060. ACM (2016)
Google Scholar
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)
Google Scholar
Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Google Scholar
Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 380 (2016)
Google Scholar
Ifrim, G., Shi, B., Brigadir, I.: Event detection in twitter using aggressive filtering and hierarchical tweet clustering. In: Second Workshop on Social News on the Web (SNOW), Seoul, Korea, vol. 8. ACM, April 2014
Google Scholar
Jordaan, M.: Poke me, i’m a journalist: the impact of facebook and twitter on newsroom routines and cultures at two south african weeklies. Ecquid Novi: African Journalism Stud. 34(1), 21–35 (2013)
Article MathSciNet Google Scholar
Lau, J.H., Baldwin, T.: The sensitivity of topic coherence evaluation to topic cardinality. In: Proceedings of NAACL-HLT, pp. 483–487 (2016)
Google Scholar
Lau, J.H., Collier, N., Baldwin, T.: On-line trend analysis with topic models: \(\backslash \)# twitter trends detection topic model online. In: COLING, pp. 1519–1534 (2012)
Google Scholar
Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: EACL, pp. 530–539 (2014)
Google Scholar
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 165–174. ACM (2016)
Google Scholar
Li, S., Chua, T.S., Zhu, J., Miao, C.: Generative topic embedding: a continuous representation of documents. In: Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL) (2016)
Google Scholar
McMinn, A.J., Moshfeghi, Y., Jose, J.M.: Building a large-scale corpus for evaluating event detection on twitter. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 409–418. ACM (2013)
Google Scholar
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Müllner, D., et al.: fastcluster: Fast hierarchical, agglomerative clustering routines for R and python. J. Stat. Softw. 53(9), 1–18 (2013)
Article Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. Association for Computational Linguistics (2010)
Google Scholar
Newman, N.: The rise of social media and its impact on mainstream journalism (2009)
Google Scholar
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Computat. Linguist. 3, 299–313 (2015)
Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2), 103–134 (2000)
Article MATH Google Scholar
Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to twitter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 181–189. Association for Computational Linguistics (2010)
Google Scholar
Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 338–346. Association for Computational Linguistics (2012)
Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp. 2270–2276 (2015)
Google Scholar
Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM (2011)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Sokal, R.R., Rohlf, F.J.: The comparison of dendrograms by objective methods. In: Taxon, pp. 33–40 (1962)
Google Scholar
Vakulenko, S., Nixon, L., Lupu, M.: Character-based neural embeddings for tweet clustering. In: SocialNLP 2017, p. 36 (2017)
Google Scholar
Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Google Scholar
Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
Google Scholar
Yin, J.: Clustering microtext streams for event identification. In: IJCNLP, pp. 719–725 (2013)
Google Scholar
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34
Chapter Google Scholar

Download references

Acknowledgments

This work is partly supported by The Alan Turing Institute. We would also like to thank Anjie Fang, Dat Quoc Nguyen, Jey Han Lau, Svitlana Vakulenko and Weihua Hu for answering questions regarding their work, respectively.

Author information

Authors and Affiliations

Department of Computer Science, University of Warwick, Coventry, UK
Bo Wang, Maria Liakata, Arkaitz Zubiaga & Rob Procter
The Alan Turing Institute, London, UK
Maria Liakata & Rob Procter

Authors

Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Maria Liakata
View author publications
You can also search for this author in PubMed Google Scholar
Arkaitz Zubiaga
View author publications
You can also search for this author in PubMed Google Scholar
Rob Procter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Wang .

Editor information

Editors and Affiliations

Indiana University, Bloomington, Indiana, USA
Giovanni Luca Ciampaglia
University of Washington, Seattle, Washington, USA
Afra Mashhadi
University of Oxford, Oxford, United Kingdom
Taha Yasseri

Appendix

Table 3. Example topics detected on FSD corpus

Full size table

Table 4. Example topics detected on ED corpus - day one

Full size table

We present a set of randomly selected example topics generated by GSDMM +LFLDA, on both the first story detection (FSD) corpus and the first day of the event detection (ED) corpus, as seen in Tables 3 and 4. Each detected topic is presented with its top-10 topic words, and is matched with the corresponding topic description or story from the ground truth (given by the creators of these data sets), as well as a sample tweet retrieved using the topic keywords.

As shown in Tables 3 and 4, words in obtained topics are mostly coherent and well aligned with a ground-truth topic description. We can also discover more useful information with regard to the corresponding real-world story, by simply looking at its topic words. For example, in the first topic of Table 3 we see the Twittersphere has mentioned ‘Amy Winehouse’ and ‘death’ along with the word ‘drug’. This information may have been missed if one only chooses to read a set of randomly sampled tweets mentioning ‘Amy Winehouse’.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, B., Liakata, M., Zubiaga, A., Procter, R. (2017). A Hierarchical Topic Modelling Approach for Tweet Clustering. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10540. Springer, Cham. https://doi.org/10.1007/978-3-319-67256-4_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-67256-4_30
Published: 02 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67255-7
Online ISBN: 978-3-319-67256-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Hierarchical Topic Modelling Approach for Tweet Clustering

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation