Skip to main content

A Hierarchical Topic Modelling Approach for Tweet Clustering

  • Conference paper
  • First Online:
Social Informatics (SocInfo 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10540))

Included in the following conference series:

Abstract

While social media platforms such as Twitter can provide rich and up-to-date information for a wide range of applications, manually digesting such large volumes of data is difficult and costly. Therefore it is important to automatically infer coherent and discriminative topics from tweets. Conventional topic models and document clustering approaches fail to achieve good results due to the noisy and sparse nature of tweets. In this paper, we explore various ways of tackling this challenge and finally propose a two-stage hierarchical topic modelling system that is efficient and effective in alleviating the data sparsity problem. We present an extensive evaluation on two datasets, and report our proposed system achieving the best performance in both document clustering performance and topic coherence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    Topic proportion: the proportion of words in document d that are assigned to topic t or the topic probabilities of a document, i.e. p(t|d).

  2. 2.

    http://www.snow-workshop.org/2017/challenge/.

  3. 3.

    Their data is not evaluated due to its lack of annotated tweets.

  4. 4.

    https://nlp.stanford.edu/projects/glove/.

  5. 5.

    We have also evaluated LCTM with number of concepts setting to 600 and 1000, however we observed little difference in the performance.

  6. 6.

    The GloVe model was trained using 2 billion tweets while the Word2Vec model was trained on 5 million tweets using the skip-gram algorithm.

References

  1. Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: Proceedings of the 2012 SIAM International Conference on Data Mining, pp. 624–635. SIAM (2012)

    Google Scholar 

  2. Allan, J.: Topic Detection and Tracking: Event-based Information Organization, vol. 12. Springer Science & Business Media (2012)

    Google Scholar 

  3. Alvarez-Melis, D., Saveski, M.: Topic modeling in twitter: aggregating tweets by conversations. In: ICWSM, pp. 519–522 (2016)

    Google Scholar 

  4. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013)

    Article  Google Scholar 

  5. Becker, H., Naaman, M., Gravano, L.: Beyond trending topics: real-world event identification on twitter. In: ICWSM 2011, pp. 438–441 (2011)

    Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  7. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009)

    Google Scholar 

  8. Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 269 (2016)

    Google Scholar 

  9. Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1057–1060. ACM (2016)

    Google Scholar 

  10. Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)

    Google Scholar 

  11. Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)

    Google Scholar 

  12. Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 380 (2016)

    Google Scholar 

  13. Ifrim, G., Shi, B., Brigadir, I.: Event detection in twitter using aggressive filtering and hierarchical tweet clustering. In: Second Workshop on Social News on the Web (SNOW), Seoul, Korea, vol. 8. ACM, April 2014

    Google Scholar 

  14. Jordaan, M.: Poke me, i’m a journalist: the impact of facebook and twitter on newsroom routines and cultures at two south african weeklies. Ecquid Novi: African Journalism Stud. 34(1), 21–35 (2013)

    Article  MathSciNet  Google Scholar 

  15. Lau, J.H., Baldwin, T.: The sensitivity of topic coherence evaluation to topic cardinality. In: Proceedings of NAACL-HLT, pp. 483–487 (2016)

    Google Scholar 

  16. Lau, J.H., Collier, N., Baldwin, T.: On-line trend analysis with topic models: \(\backslash \)# twitter trends detection topic model online. In: COLING, pp. 1519–1534 (2012)

    Google Scholar 

  17. Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: EACL, pp. 530–539 (2014)

    Google Scholar 

  18. Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 165–174. ACM (2016)

    Google Scholar 

  19. Li, S., Chua, T.S., Zhu, J., Miao, C.: Generative topic embedding: a continuous representation of documents. In: Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL) (2016)

    Google Scholar 

  20. McMinn, A.J., Moshfeghi, Y., Jose, J.M.: Building a large-scale corpus for evaluating event detection on twitter. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 409–418. ACM (2013)

    Google Scholar 

  21. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)

    Google Scholar 

  22. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  23. Müllner, D., et al.: fastcluster: Fast hierarchical, agglomerative clustering routines for R and python. J. Stat. Softw. 53(9), 1–18 (2013)

    Article  Google Scholar 

  24. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. Association for Computational Linguistics (2010)

    Google Scholar 

  25. Newman, N.: The rise of social media and its impact on mainstream journalism (2009)

    Google Scholar 

  26. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Computat. Linguist. 3, 299–313 (2015)

    Google Scholar 

  27. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2), 103–134 (2000)

    Article  MATH  Google Scholar 

  28. Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to twitter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 181–189. Association for Computational Linguistics (2010)

    Google Scholar 

  29. Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 338–346. Association for Computational Linguistics (2012)

    Google Scholar 

  30. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)

    Google Scholar 

  31. Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp. 2270–2276 (2015)

    Google Scholar 

  32. Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM (2011)

    Google Scholar 

  33. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  34. Sokal, R.R., Rohlf, F.J.: The comparison of dendrograms by objective methods. In: Taxon, pp. 33–40 (1962)

    Google Scholar 

  35. Vakulenko, S., Nixon, L., Lupu, M.: Character-based neural embeddings for tweet clustering. In: SocialNLP 2017, p. 36 (2017)

    Google Scholar 

  36. Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)

    Google Scholar 

  37. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)

    Google Scholar 

  38. Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)

    Google Scholar 

  39. Yin, J.: Clustering microtext streams for event identification. In: IJCNLP, pp. 719–725 (2013)

    Google Scholar 

  40. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34

    Chapter  Google Scholar 

Download references

Acknowledgments

This work is partly supported by The Alan Turing Institute. We would also like to thank Anjie Fang, Dat Quoc Nguyen, Jey Han Lau, Svitlana Vakulenko and Weihua Hu for answering questions regarding their work, respectively.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Wang .

Editor information

Editors and Affiliations

Appendix

Appendix

Table 3. Example topics detected on FSD corpus
Table 4. Example topics detected on ED corpus - day one

We present a set of randomly selected example topics generated by GSDMM +LFLDA, on both the first story detection (FSD) corpus and the first day of the event detection (ED) corpus, as seen in Tables 3 and 4. Each detected topic is presented with its top-10 topic words, and is matched with the corresponding topic description or story from the ground truth (given by the creators of these data sets), as well as a sample tweet retrieved using the topic keywords.

As shown in Tables 3 and 4, words in obtained topics are mostly coherent and well aligned with a ground-truth topic description. We can also discover more useful information with regard to the corresponding real-world story, by simply looking at its topic words. For example, in the first topic of Table 3 we see the Twittersphere has mentioned ‘Amy Winehouse’ and ‘death’ along with the word ‘drug’. This information may have been missed if one only chooses to read a set of randomly sampled tweets mentioning ‘Amy Winehouse’.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wang, B., Liakata, M., Zubiaga, A., Procter, R. (2017). A Hierarchical Topic Modelling Approach for Tweet Clustering. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10540. Springer, Cham. https://doi.org/10.1007/978-3-319-67256-4_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67256-4_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67255-7

  • Online ISBN: 978-3-319-67256-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics