We propose two new topic modeling methods for sequential documents based on hybrid inter-document topic dependency. Topic modeling for sequential documents is the basis of many attractive applications such as emerging topic clustering and novel topic detection. For these tasks, most of the existing models introduce inter-document dependencies between topic distributions. However, in a real situation, adjacent emerging topics are often intertwined and mixed with outliers. These single-dependency based models have difficulties in handling the topic evolution in such multi-topic and outlier mixed sequential documents. To solve this problem, our first method considers three kinds of topic dependencies for each document to handle its probabilities of belonging to a fading topic, an emerging topic, or an independent topic. Secondly, we extend our first method by considering fine-grained dependencies in a given context for more complex topic evolution sequences. Our experiments conducted on six standard datasets on topic modeling show that our proposals outperform state-of-the-art models in terms of the accuracy of topic modeling, the quality of topic clustering, and the effectiveness of outlier detection.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
The emerging topic clustering task is to group the sequential documents belonging to the same emerging topic into a set known as a cluster without knowing their category (Fiscus and Doddington 2002).
The Novel Topic Detection (also called First Story Detection) refers to detecting the first document to discuss a topic, which allows us to know when to start a new cluster (Fiscus and Doddington 2002).
Citizen Journalism (also known as “We Media”) is a media based on public citizens who collect, report, analyze and disseminate news or information (Bowman and Willis 2015).
The topic k in these equations is the ground-truth topic of a document, which is determined by the topic label or topic keyword of the dataset.
The “Bag-of-Words” (BoW) assumes that a document is a multiset of words, disregarding grammar and even word order but keeping multiplicity (Sivic and Zisserman 2008).
We used the Gaussian kernel function and set σ = 0.5.
Amoualian, H., Clausel, M., Gaussier, E., & Amini, MR. (2016). Streaming-LDA: a copula-based approach to modeling topic dependencies in document streams. In Proceedings of the SIGKDD (pp. 695–704).
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the ICML (pp. 113–120).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Bowman, S., & Willis, C. (2015). We media: how audiences are shaping the future of news and information. The Media Center American Press Institute.
Carlo, C. M. (2004). Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: a survey. ACM Computing Surveys (CSUR), 41(3), 15.
Cheng, X., Yan, X., Lan, Y., & Guo, J. (2014). BTM: topic modeling over short texts. IEEE Transactions on Knowledge & Data Engineering, (1), 1–1.
Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the SIGKDD (pp. 551–556).
Em, Y., Gag, F., Lou, Y., Wang, S., Huang, T., & Duan, LY. (2017). Incorporating intra-class variance to fine-grained visual recognition. In Proceedings of the ICME (pp. 1452–1457): IEEE.
Fiscus, J. G., & Doddington, G. R. (2002). Topic detection and tracking evaluation overview. In Topic detection and tracking (pp. 17–31). Springer.
Gama, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44.
Geman, S., & Geman, D. (2009). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6(6), 721–741.
Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. arXiv:1605.09096.
Hao, Z., Kim, G., & Xing, E.P. (2015). Dynamic topic modeling for monitoring market competition from online text and image data. In Proceedings of the SIGKDD (pp. 1425–1434).
He, Q., Chang, K., Lim, E. P., & Zhang, J. (2007). Bursty feature representation for clustering text streams. In Proceedings of the ICDM (pp. 491–496).
He, Y., Wang, C., & Jiang, C. (2017). Incorporating the latent link categories in relational topic modeling. In Proceedings of the CIKM (pp. 1877–1886).
Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., & Zhang, X. (2017). A probabilistic method for emerging topic tracking in microblog stream. World Wide Web, 20(2), 325–350.
Iwata, T., Watanabe, S., Yamada, T., & Ueda, N. (2009). Topic tracking model for analyzing consumer purchase behavior. In Proceedings of the IJCAI, (Vol. 9 pp. 1427–1432).
Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017a). Outlier detection for text data. In Proceedings of the 2017 SIAM international conference on data mining (pp. 489–497). SIAM.
Kannan, R., Woo, H., Aggarwal, C. C., & Park, H. (2017b). Outlier detection for text data: an extended version. arXiv:1701.01325.
Kontaki, M., Gounaris, A., Papadopoulos, A. N., Tsichlas, K., & Manolopoulos, Y. (2011). Continuous monitoring of distance-based outliers over data streams. In Proceedings of the ICDE (pp. 135–146).
Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th international conference on World Wide Web (pp. 625–635).
Lefkimmiatis, S., Maragos, P., & Papandreou, G. (2009). Bayesian inference on multiscale models for poisson intensity estimation: applications to photon-limited image denoising. IEEE Transactions on Image Processing, 18(8), 1724–1741.
Li, X., Li, C., Chi, J., Ouyang, J., & Li, C. (2018). Dataless text classification: a topic modeling approach with document manifold. In Proceedings of the CIKM (pp. 973–982).
Liang, S., Yilmaz, E., & Kanoulas, E. (2016). Dynamic clustering of streaming short documents. In Proceedings of the SIGKDD (pp. 995–1004).
Liang, S., Ren, Z., Yilmaz, E., & Kanoulas, E. (2017). Collaborative user clustering for short text streams. In Proceedings of the AAAI (pp. 3504–3510).
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015). Topical word embeddings. In Proceedings of the AAAI.
Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Proceedings of the ICML (pp. 545–552).
Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100–103.
Pfitzner, D., Leibbrandt, R., & Powers, D. (2009). Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3), 361.
Robinson, D. W., & Ruelle, D. (1967). Mean entropy of states in classical statistical mechanics. Communications in Mathematical Physics, 5(4), 288–300.
Seidenfeld, T. (1986). Entropy and uncertainty. Philosophy of Science, 53(4), 467–491.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
Shi, B., Lam, W., Jameel, S., Schockaert, S., & Lai, K.P. (2017). Jointly learning word embeddings and latent topics. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval (pp. 375–384).
Sisodia, D., Singh, L., Sisodia, S., & Saxena, K. (2012). Clustering techniques: a brief survey of different clustering algorithms. International Journal of Latest Trends in Engineering and Technology (IJLTET), 1(3), 82–87.
Sivic, J., & Zisserman, A. (2008). Efficient visual search of videos cast as text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 591–606.
Wang, X., & McCallum, A. (2006). Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the SIGKDD (pp. 424–433).
Wang, Y., Agichtein, E., & Benzi, M. (2012). TM-LDA: efficient online modeling of latent topic transitions in social media. In Proceedings of the SIGKDD (pp. 123–131).
Wei, X., Sun, J., & Wang, X. (2007). Dynamic mixture models for multiple time-series. In Proceedings of the IJCAI, (Vol. 7 pp. 2909–2914).
Yan, X., Guo, J., Lan, Y., Xu, J., & Cheng, X. (2015). A probabilistic model for bursty topic discovery in microblogs. In Proceedings of the AAAI (pp. 353–359).
Yao, Z., Sun, Y., Ding, W., Rao, N., & Xiong, H. (2018). Dynamic word embeddings for evolving semantic discovery. In Proceedings of the eleventh ACM international conference on web search and data mining (pp. 673–681).
Zhang, Y., Jatowt, A., Bhowmick, S. S., & Tanaka, K. (2016). The past is not a foreign country: detecting semantically similar terms across time. IEEE Transactions on Knowledge and Data Engineering, 28(10), 2793–2807.
Zuo, Y., Zhao, J., & Xu, K. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398.
A part of this research was supported by Grant-in-Aid for Scientific Research JP18H03290 from JSPS and the State Scholarship Fund 201706680067 from China Scholarship Council.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Li, W., Saigo, H., Tong, B. et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency. J Intell Inf Syst (2021). https://doi.org/10.1007/s10844-020-00635-4
- Topic model
- Sequential documents
- Topic evolution
- Outlier detection
- Latent Dirichlet Allocation