Topic modeling for sequential documents based on hybrid inter-document topic dependency

Li, Wenbo; Saigo, Hiroto; Tong, Bin; Suzuki, Einoshin

doi:10.1007/s10844-020-00635-4

Topic modeling for sequential documents based on hybrid inter-document topic dependency

Published: 25 January 2021

Volume 56, pages 435–458, (2021)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Wenbo Li ORCID: orcid.org/0000-0002-2667-9360¹,
Hiroto Saigo²,
Bin Tong³ &
…
Einoshin Suzuki²

566 Accesses
5 Citations
Explore all metrics

Abstract

We propose two new topic modeling methods for sequential documents based on hybrid inter-document topic dependency. Topic modeling for sequential documents is the basis of many attractive applications such as emerging topic clustering and novel topic detection. For these tasks, most of the existing models introduce inter-document dependencies between topic distributions. However, in a real situation, adjacent emerging topics are often intertwined and mixed with outliers. These single-dependency based models have difficulties in handling the topic evolution in such multi-topic and outlier mixed sequential documents. To solve this problem, our first method considers three kinds of topic dependencies for each document to handle its probabilities of belonging to a fading topic, an emerging topic, or an independent topic. Secondly, we extend our first method by considering fine-grained dependencies in a given context for more complex topic evolution sequences. Our experiments conducted on six standard datasets on topic modeling show that our proposals outperform state-of-the-art models in terms of the accuracy of topic modeling, the quality of topic clustering, and the effectiveness of outlier detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 8

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Article 26 October 2022

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Notes

The emerging topic clustering task is to group the sequential documents belonging to the same emerging topic into a set known as a cluster without knowing their category (Fiscus and Doddington 2002).
The Novel Topic Detection (also called First Story Detection) refers to detecting the first document to discuss a topic, which allows us to know when to start a new cluster (Fiscus and Doddington 2002).
Citizen Journalism (also known as “We Media”) is a media based on public citizens who collect, report, analyze and disseminate news or information (Bowman and Willis 2015).
These two equations are our definitions for the target problems based on the similar definitions in those two studies (Sisodia et al. 2012; Em et al. 2017).
The topic k in these equations is the ground-truth topic of a document, which is determined by the topic label or topic keyword of the dataset.
The “Bag-of-Words” (BoW) assumes that a document is a multiset of words, disregarding grammar and even word order but keeping multiplicity (Sivic and Zisserman 2008).
http://www.daviddlewis.com/resources/testcollections/reuters21578/
https://catalog.ldc.upenn.edu/LDC2001T57
thuctc.thunlp.org
https://www.nlm.nih.gov/databases/download/pubmed_medline.html
https://github.com/liliverpool/Dataset
http://mir.dcs.gla.ac.uk/resources/
https://github.com/liliverpool/SOT.git
We used the Gaussian kernel function and set σ = 0.5.

References

Amoualian, H., Clausel, M., Gaussier, E., & Amini, MR. (2016). Streaming-LDA: a copula-based approach to modeling topic dependencies in document streams. In Proceedings of the SIGKDD (pp. 695–704).
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Article Google Scholar
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the ICML (pp. 113–120).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Bowman, S., & Willis, C. (2015). We media: how audiences are shaping the future of news and information. The Media Center American Press Institute.
Carlo, C. M. (2004). Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: a survey. ACM Computing Surveys (CSUR), 41(3), 15.
Article Google Scholar
Cheng, X., Yan, X., Lan, Y., & Guo, J. (2014). BTM: topic modeling over short texts. IEEE Transactions on Knowledge & Data Engineering, (1), 1–1.
Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the SIGKDD (pp. 551–556).
Em, Y., Gag, F., Lou, Y., Wang, S., Huang, T., & Duan, LY. (2017). Incorporating intra-class variance to fine-grained visual recognition. In Proceedings of the ICME (pp. 1452–1457): IEEE.
Fiscus, J. G., & Doddington, G. R. (2002). Topic detection and tracking evaluation overview. In Topic detection and tracking (pp. 17–31). Springer.
Gama, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44.
Article MATH Google Scholar
Geman, S., & Geman, D. (2009). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6(6), 721–741.
Article MATH Google Scholar
Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. arXiv:1605.09096.
Hao, Z., Kim, G., & Xing, E.P. (2015). Dynamic topic modeling for monitoring market competition from online text and image data. In Proceedings of the SIGKDD (pp. 1425–1434).
He, Q., Chang, K., Lim, E. P., & Zhang, J. (2007). Bursty feature representation for clustering text streams. In Proceedings of the ICDM (pp. 491–496).
He, Y., Wang, C., & Jiang, C. (2017). Incorporating the latent link categories in relational topic modeling. In Proceedings of the CIKM (pp. 1877–1886).
Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., & Zhang, X. (2017). A probabilistic method for emerging topic tracking in microblog stream. World Wide Web, 20(2), 325–350.
Article Google Scholar
Iwata, T., Watanabe, S., Yamada, T., & Ueda, N. (2009). Topic tracking model for analyzing consumer purchase behavior. In Proceedings of the IJCAI, (Vol. 9 pp. 1427–1432).
Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017a). Outlier detection for text data. In Proceedings of the 2017 SIAM international conference on data mining (pp. 489–497). SIAM.
Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017b). Outlier detection for text data: an extended version. arXiv:1701.01325.
Kontaki, M., Gounaris, A., Papadopoulos, A. N., Tsichlas, K., & Manolopoulos, Y. (2011). Continuous monitoring of distance-based outliers over data streams. In Proceedings of the ICDE (pp. 135–146).
Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th international conference on World Wide Web (pp. 625–635).
Lefkimmiatis, S., Maragos, P., & Papandreou, G. (2009). Bayesian inference on multiscale models for poisson intensity estimation: applications to photon-limited image denoising. IEEE Transactions on Image Processing, 18(8), 1724–1741.
Article MathSciNet MATH Google Scholar
Li, X., Li, C., Chi, J., Ouyang, J., & Li, C. (2018). Dataless text classification: a topic modeling approach with document manifold. In Proceedings of the CIKM (pp. 973–982).
Liang, S., Yilmaz, E., & Kanoulas, E. (2016). Dynamic clustering of streaming short documents. In Proceedings of the SIGKDD (pp. 995–1004).
Liang, S., Ren, Z., Yilmaz, E., & Kanoulas, E. (2017). Collaborative user clustering for short text streams. In Proceedings of the AAAI (pp. 3504–3510).
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015). Topical word embeddings. In Proceedings of the AAAI.
Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Proceedings of the ICML (pp. 545–552).
Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100–103.
Article MATH Google Scholar
Pfitzner, D., Leibbrandt, R., & Powers, D. (2009). Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3), 361.
Article Google Scholar
Robinson, D. W., & Ruelle, D. (1967). Mean entropy of states in classical statistical mechanics. Communications in Mathematical Physics, 5(4), 288–300.
Article MathSciNet MATH Google Scholar
Seidenfeld, T. (1986). Entropy and uncertainty. Philosophy of Science, 53(4), 467–491.
Article MathSciNet Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
Article MathSciNet MATH Google Scholar
Shi, B., Lam, W., Jameel, S., Schockaert, S., & Lai, K.P. (2017). Jointly learning word embeddings and latent topics. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval (pp. 375–384).
Sisodia, D., Singh, L., Sisodia, S., & Saxena, K. (2012). Clustering techniques: a brief survey of different clustering algorithms. International Journal of Latest Trends in Engineering and Technology (IJLTET), 1(3), 82–87.
Google Scholar
Sivic, J., & Zisserman, A. (2008). Efficient visual search of videos cast as text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 591–606.
Article Google Scholar
Wang, X., & McCallum, A. (2006). Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the SIGKDD (pp. 424–433).
Wang, Y., Agichtein, E., & Benzi, M. (2012). TM-LDA: efficient online modeling of latent topic transitions in social media. In Proceedings of the SIGKDD (pp. 123–131).
Wei, X., Sun, J., & Wang, X. (2007). Dynamic mixture models for multiple time-series. In Proceedings of the IJCAI, (Vol. 7 pp. 2909–2914).
Yan, X., Guo, J., Lan, Y., Xu, J., & Cheng, X. (2015). A probabilistic model for bursty topic discovery in microblogs. In Proceedings of the AAAI (pp. 353–359).
Yao, Z., Sun, Y., Ding, W., Rao, N., & Xiong, H. (2018). Dynamic word embeddings for evolving semantic discovery. In Proceedings of the eleventh ACM international conference on web search and data mining (pp. 673–681).
Zhang, Y., Jatowt, A., Bhowmick, S. S., & Tanaka, K. (2016). The past is not a foreign country: detecting semantically similar terms across time. IEEE Transactions on Knowledge and Data Engineering, 28(10), 2793–2807.
Article Google Scholar
Zuo, Y., Zhao, J., & Xu, K. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398.
Article Google Scholar

Download references

Acknowledgments

A part of this research was supported by Grant-in-Aid for Scientific Research JP18H03290 from JSPS and the State Scholarship Fund 201706680067 from China Scholarship Council.

Author information

Authors and Affiliations

Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
Wenbo Li
Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
Hiroto Saigo & Einoshin Suzuki
Alibaba, Ltd., Hangzhou, China
Bin Tong

Authors

Wenbo Li
View author publications
You can also search for this author in PubMed Google Scholar
Hiroto Saigo
View author publications
You can also search for this author in PubMed Google Scholar
Bin Tong
View author publications
You can also search for this author in PubMed Google Scholar
Einoshin Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbo Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, W., Saigo, H., Tong, B. et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency. J Intell Inf Syst 56, 435–458 (2021). https://doi.org/10.1007/s10844-020-00635-4

Download citation

Received: 24 June 2020
Revised: 27 December 2020
Accepted: 28 December 2020
Published: 25 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10844-020-00635-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic modeling for sequential documents based on hybrid inter-document topic dependency

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A survey on neural topic models: methods, applications, and challenges

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Topic modeling for sequential documents based on hybrid inter-document topic dependency

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A survey on neural topic models: methods, applications, and challenges

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation