Advertisement

LDA-PSTR: A Topic Modeling Method for Short Text

  • Kai Zhou
  • Qun YangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11323)

Abstract

Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in which context information of words are ignored. Moreover, when directly applied to short text, it will arise the lack of co-occurrence patterns problem due to the sparseness of unigrams representations. Existing work either performs data expansion by utilizing external knowledge resource, or simply aggregates these semantically related short texts. These methods generally produce low-quality topic representation or suffer from poor semantically correlation between different data resource. In this paper, we propose a different method that is computationally efficient and effective. Our method applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words. We use these frequent patterns as feature units to represent texts, referred as pattern set-based text representation (PSTR). Besides that, in order to represent text more precisely, we propose a new probabilistic topic model called LDA-PSTR. And an improved Gibbs algorithm has been developed for LDA-PSTR. Experiments on different corpus show that such an approach can discover more prominent and coherent topics, and achieve significant performance improvement on several evaluation metrics.

Keywords

Topic modeling Short text Text representation Frequent pattern LDA 

References

  1. 1.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychol. Rev. 114(2), 211 (2007)CrossRefGoogle Scholar
  4. 4.
    Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)Google Scholar
  5. 5.
    Tang, J., Zhang, M., Mei, Q.: One theme in all views: modeling consensus topics in multiple contexts. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 5–13. ACM (2013)Google Scholar
  6. 6.
    Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006)Google Scholar
  7. 7.
    Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining, pp. 697–702. IEEE (2007)Google Scholar
  8. 8.
    Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147 (2006)Google Scholar
  9. 9.
    Teh, Y.W., Jordan, M.I., Beal, M.J.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. (2012)Google Scholar
  10. 10.
    Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Advances in Neural Information Processing Systems, pp. 121–128 (2008)Google Scholar
  11. 11.
    Kim, H.D., Park, D.H., Lu, Y.: Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proc. Am. Soc. Inf. Sci. Technol. 49(1), 1–10 (2012)Google Scholar
  12. 12.
    Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)Google Scholar
  13. 13.
    Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)Google Scholar
  14. 14.
    Jin, O., Liu, N.N., Zhao, K.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)Google Scholar
  15. 15.
    Bordino, I., Castillo, C., Donato, D.: Query similarity by projecting the query-flow graph. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM (2010)Google Scholar
  16. 16.
    Yan, X., Guo, J., Lan, Y.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)Google Scholar
  17. 17.
    Guo, J., Cheng, X., Xu, G.: Intent-aware query similarity. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 259–268. ACM (2011)Google Scholar
  18. 18.
    Weng, J., Lim, E.P., Jiang, J.: TwitterRank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)Google Scholar
  19. 19.
    Mehrotra, R., Sanner, S., Buntine, W.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)Google Scholar
  20. 20.
    Lin, T., Tian, W., Mei, Q.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550. ACM (2014)Google Scholar
  21. 21.
    Banerjee, K.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (2007)Google Scholar
  22. 22.
    Chen, W., et al.: EEG-based motion intention recognition via multi-task RNNs. In: Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 279–287. Society for Industrial and Applied Mathematics (2018)Google Scholar
  23. 23.
    Yue, L., Chen, W., Li, X., Zuo, W., Yin, M.: A survey of sentiment analysis in social media. Knowl. Inf. Syst. 1–47 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.College of Computer Science and TechnologyNanjing University of Aeronautics and AstronauticsNanjingChina

Personalised recommendations