Topic modeling for sequential documents based on hybrid inter-document topic dependency

Abstract

We propose two new topic modeling methods for sequential documents based on hybrid inter-document topic dependency. Topic modeling for sequential documents is the basis of many attractive applications such as emerging topic clustering and novel topic detection. For these tasks, most of the existing models introduce inter-document dependencies between topic distributions. However, in a real situation, adjacent emerging topics are often intertwined and mixed with outliers. These single-dependency based models have difficulties in handling the topic evolution in such multi-topic and outlier mixed sequential documents. To solve this problem, our first method considers three kinds of topic dependencies for each document to handle its probabilities of belonging to a fading topic, an emerging topic, or an independent topic. Secondly, we extend our first method by considering fine-grained dependencies in a given context for more complex topic evolution sequences. Our experiments conducted on six standard datasets on topic modeling show that our proposals outperform state-of-the-art models in terms of the accuracy of topic modeling, the quality of topic clustering, and the effectiveness of outlier detection.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    The emerging topic clustering task is to group the sequential documents belonging to the same emerging topic into a set known as a cluster without knowing their category (Fiscus and Doddington 2002).

  2. 2.

    The Novel Topic Detection (also called First Story Detection) refers to detecting the first document to discuss a topic, which allows us to know when to start a new cluster (Fiscus and Doddington 2002).

  3. 3.

    Citizen Journalism (also known as “We Media”) is a media based on public citizens who collect, report, analyze and disseminate news or information (Bowman and Willis 2015).

  4. 4.

    These two equations are our definitions for the target problems based on the similar definitions in those two studies (Sisodia et al. 2012; Em et al. 2017).

  5. 5.

    The topic k in these equations is the ground-truth topic of a document, which is determined by the topic label or topic keyword of the dataset.

  6. 6.

    The “Bag-of-Words” (BoW) assumes that a document is a multiset of words, disregarding grammar and even word order but keeping multiplicity (Sivic and Zisserman 2008).

  7. 7.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/

  8. 8.

    https://catalog.ldc.upenn.edu/LDC2001T57

  9. 9.

    thuctc.thunlp.org

  10. 10.

    https://www.nlm.nih.gov/databases/download/pubmed_medline.html

  11. 11.

    https://github.com/liliverpool/Dataset

  12. 12.

    http://mir.dcs.gla.ac.uk/resources/

  13. 13.

    https://github.com/liliverpool/SOT.git

  14. 14.

    We used the Gaussian kernel function and set σ = 0.5.

References

  1. Amoualian, H., Clausel, M., Gaussier, E., & Amini, MR. (2016). Streaming-LDA: a copula-based approach to modeling topic dependencies in document streams. In Proceedings of the SIGKDD (pp. 695–704).

  2. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

    Article  Google Scholar 

  3. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the ICML (pp. 113–120).

  4. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  5. Bowman, S., & Willis, C. (2015). We media: how audiences are shaping the future of news and information. The Media Center American Press Institute.

  6. Carlo, C. M. (2004). Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581.

  7. Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: a survey. ACM Computing Surveys (CSUR), 41(3), 15.

    Article  Google Scholar 

  8. Cheng, X., Yan, X., Lan, Y., & Guo, J. (2014). BTM: topic modeling over short texts. IEEE Transactions on Knowledge & Data Engineering, (1), 1–1.

  9. Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the SIGKDD (pp. 551–556).

  10. Em, Y., Gag, F., Lou, Y., Wang, S., Huang, T., & Duan, LY. (2017). Incorporating intra-class variance to fine-grained visual recognition. In Proceedings of the ICME (pp. 1452–1457): IEEE.

  11. Fiscus, J. G., & Doddington, G. R. (2002). Topic detection and tracking evaluation overview. In Topic detection and tracking (pp. 17–31). Springer.

  12. Gama, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44.

    Article  Google Scholar 

  13. Geman, S., & Geman, D. (2009). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6(6), 721–741.

    Article  Google Scholar 

  14. Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. arXiv:1605.09096.

  15. Hao, Z., Kim, G., & Xing, E.P. (2015). Dynamic topic modeling for monitoring market competition from online text and image data. In Proceedings of the SIGKDD (pp. 1425–1434).

  16. He, Q., Chang, K., Lim, E. P., & Zhang, J. (2007). Bursty feature representation for clustering text streams. In Proceedings of the ICDM (pp. 491–496).

  17. He, Y., Wang, C., & Jiang, C. (2017). Incorporating the latent link categories in relational topic modeling. In Proceedings of the CIKM (pp. 1877–1886).

  18. Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., & Zhang, X. (2017). A probabilistic method for emerging topic tracking in microblog stream. World Wide Web, 20(2), 325–350.

    Article  Google Scholar 

  19. Iwata, T., Watanabe, S., Yamada, T., & Ueda, N. (2009). Topic tracking model for analyzing consumer purchase behavior. In Proceedings of the IJCAI, (Vol. 9 pp. 1427–1432).

  20. Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017a). Outlier detection for text data. In Proceedings of the 2017 SIAM international conference on data mining (pp. 489–497). SIAM.

  21. Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017b). Outlier detection for text data: an extended version. arXiv:1701.01325.

  22. Kontaki, M., Gounaris, A., Papadopoulos, A. N., Tsichlas, K., & Manolopoulos, Y. (2011). Continuous monitoring of distance-based outliers over data streams. In Proceedings of the ICDE (pp. 135–146).

  23. Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th international conference on World Wide Web (pp. 625–635).

  24. Lefkimmiatis, S., Maragos, P., & Papandreou, G. (2009). Bayesian inference on multiscale models for poisson intensity estimation: applications to photon-limited image denoising. IEEE Transactions on Image Processing, 18(8), 1724–1741.

    MathSciNet  Article  Google Scholar 

  25. Li, X., Li, C., Chi, J., Ouyang, J., & Li, C. (2018). Dataless text classification: a topic modeling approach with document manifold. In Proceedings of the CIKM (pp. 973–982).

  26. Liang, S., Yilmaz, E., & Kanoulas, E. (2016). Dynamic clustering of streaming short documents. In Proceedings of the SIGKDD (pp. 995–1004).

  27. Liang, S., Ren, Z., Yilmaz, E., & Kanoulas, E. (2017). Collaborative user clustering for short text streams. In Proceedings of the AAAI (pp. 3504–3510).

  28. Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015). Topical word embeddings. In Proceedings of the AAAI.

  29. Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Proceedings of the ICML (pp. 545–552).

  30. Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100–103.

    Article  Google Scholar 

  31. Pfitzner, D., Leibbrandt, R., & Powers, D. (2009). Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3), 361.

    Article  Google Scholar 

  32. Robinson, D. W., & Ruelle, D. (1967). Mean entropy of states in classical statistical mechanics. Communications in Mathematical Physics, 5(4), 288–300.

    MathSciNet  Article  Google Scholar 

  33. Seidenfeld, T. (1986). Entropy and uncertainty. Philosophy of Science, 53(4), 467–491.

    MathSciNet  Article  Google Scholar 

  34. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.

    MathSciNet  Article  Google Scholar 

  35. Shi, B., Lam, W., Jameel, S., Schockaert, S., & Lai, K.P. (2017). Jointly learning word embeddings and latent topics. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval (pp. 375–384).

  36. Sisodia, D., Singh, L., Sisodia, S., & Saxena, K. (2012). Clustering techniques: a brief survey of different clustering algorithms. International Journal of Latest Trends in Engineering and Technology (IJLTET), 1(3), 82–87.

    Google Scholar 

  37. Sivic, J., & Zisserman, A. (2008). Efficient visual search of videos cast as text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 591–606.

    Article  Google Scholar 

  38. Wang, X., & McCallum, A. (2006). Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the SIGKDD (pp. 424–433).

  39. Wang, Y., Agichtein, E., & Benzi, M. (2012). TM-LDA: efficient online modeling of latent topic transitions in social media. In Proceedings of the SIGKDD (pp. 123–131).

  40. Wei, X., Sun, J., & Wang, X. (2007). Dynamic mixture models for multiple time-series. In Proceedings of the IJCAI, (Vol. 7 pp. 2909–2914).

  41. Yan, X., Guo, J., Lan, Y., Xu, J., & Cheng, X. (2015). A probabilistic model for bursty topic discovery in microblogs. In Proceedings of the AAAI (pp. 353–359).

  42. Yao, Z., Sun, Y., Ding, W., Rao, N., & Xiong, H. (2018). Dynamic word embeddings for evolving semantic discovery. In Proceedings of the eleventh ACM international conference on web search and data mining (pp. 673–681).

  43. Zhang, Y., Jatowt, A., Bhowmick, S. S., & Tanaka, K. (2016). The past is not a foreign country: detecting semantically similar terms across time. IEEE Transactions on Knowledge and Data Engineering, 28(10), 2793–2807.

    Article  Google Scholar 

  44. Zuo, Y., Zhao, J., & Xu, K. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398.

    Article  Google Scholar 

Download references

Acknowledgments

A part of this research was supported by Grant-in-Aid for Scientific Research JP18H03290 from JSPS and the State Scholarship Fund 201706680067 from China Scholarship Council.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Wenbo Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, W., Saigo, H., Tong, B. et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency. J Intell Inf Syst (2021). https://doi.org/10.1007/s10844-020-00635-4

Download citation

Keywords

  • Topic model
  • Sequential documents
  • Topic evolution
  • Outlier detection
  • Latent Dirichlet Allocation