Abstract
Clustering analysis aims to group a set of similar data objects into the same cluster. Topic models, which belong to the soft clustering methods, are powerful tools to discover latent clusters/topics behind large data sets. Due to the dynamic nature of temporal data, clusters often exhibit complicated patterns such as birth, branch and death. However, most existing temporal clustering models assume that clusters evolve as a linear chain, and they cannot model and detect branching of clusters. In this paper, we present evolving Dirichlet processes (EDP for short) to model nonlinear evolutionary traces behind temporal data, especially for temporal text collections. In the setting of EDP, temporal collections are divided into epochs. In order to model cluster branching over time, EDP allows each cluster in an epoch to form Dirichlet processes (DP) and uses a combination of the cluster-specific DPs as the prior for cluster distributions in the next epoch. To model hierarchical temporal data, such as online document collections, we propose a new class of evolving hierarchical Dirichlet processes (EHDP for short) which extends the hierarchical Dirichlet processes (HDP) to model evolving temporal data. We design an online learning framework based on Gibbs sampling to infer the evolutionary traces of clusters over time. In experiments, we validate that EDP and EHDP can capture nonlinear evolutionary traces of clusters on both synthetic and real-world text collections and achieve better results than its peers.
Similar content being viewed by others
References
Ahmed A, Ho Q, Teo C, Eisenstein J, Smola A, Xing E (2011) Online inference for the infinite cluster-topic model: storylines from streaming text. In: Proceedings of the 14th conference on artificial intelligence and statistics (AISTATS), pp 101–109
Ahmed A, Hong L, Smola A (2013) Nested chinese restaurant franchise process: Applications to user tracking and document modeling. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 1426–1434
Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 219–230
Ahmed A, Xing EP (2010) Timeline: a dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream. In: Proceedings of the 26th Uncertainty in Artificial Intelligence (UAI), UAI ’10, pp 20–29
Antoniak CE et al (1974) Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Ann Stat 2(6):1152–1174
Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. In: SDM. SIAM, vol 7, pp 437–442
Blei DM, Frazier PI (2011) Distance dependent chinese restaurant processes. J Mach Learn Res 12:2461–2488
Blei DM, Frazier PI (2011) Distance dependent chinese restaurant processes. J Mach Learn Res 12:2461–2488
Blei DM, Jordan MI et al (2006) Variational inference for dirichlet process mixtures. Bayesian Anal 1(1):121–143
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Boyles L, Welling M (2012) The time-marginalized coalescent prior for hierarchical clustering. Advances in neural information processing systems. MIT Press, London, pp 2969–2977
Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACM, New York, pp 554–560
Chen C, Ding N, Buntine W (2012) Dependent hierarchical normalized random measures for dynamic topic modeling. arXiv preprint arXiv:1206.4671 p 8
Chi Y, Song X, Zhou D, Hino K, Tseng BL (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 153–162
De Smet W, Moens MF (2013) Representations for multi-document event clustering. Data Min Knowl Discov 26(3):533–558. doi:10.1007/s10618-012-0270-1
Diao Q, Jiang J, Zhu F, Lim EP (2012) Finding bursty topics from microblogs. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 536–544
Gao Z, Song Y, Liu S, Wang H, Wei H, Chen Y, Cui W (2011) Tracking and connecting topics via incremental hierarchical dirichlet processes. In: 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE, pp 1056–1061
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. CRC Press, Boca Raton
Gordon N, Ristic B, Arulampalam S (2004) Beyond the kalman filter: particle filters for tracking applications. Artech House, London
Griffin JE, Steel MJ (2006) Order-based dependent dirichlet processes. J Am Stat Assoc 101(473):179–194
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Griffiths DMBTL, Tenenbaum MIJJB (2004) Hierarchical topic models and the nested Chinese restaurant process. Adv Neural Inf Process Syst 16:17
Kawamae N (2011) Trend analysis model: trend consists of temporal words, topics, and timestamps. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 317–326
Kawamae N (2012) Theme chronicle model: Chronicle consists of timestamp and topical words over each theme. In: Proceedings of the 21st ACM international conference on information and knowledge management, CIKM ’12. ACM, New York, pp 2065–2069
Kingman JF (1982a) On the genealogy of large populations. J Appl Probab 19:27–43
Kingman JFC (1982b) The coalescent. Stoch Process Appl 13(3):235–248
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 497–506
Li AQ, Ahmed A, Ravi S, Smola AJ (2014) Reducing the sampling complexity of topic models. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 891–900
Lin D, Grimson E, Fisher III JW (2010) Construction of dependent dirichlet processes based on poisson processes. Neural Inf Process Syst Found pp 1396–1404
MacEachern SN (2000) Dependent dirichlet processes. Unpublished manuscript, Department of Statistics, The Ohio State University pp 1–40
Neal RM (2000) Markov chain sampling methods for dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
Neal RM (2003) Density modeling and clustering using dirichlet diffusion trees. Bayesian Stat 7:619–629
Ren L, Dunson DB, Carin L (2008) The dynamic hierarchical dirichlet process. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 824–831
Shahaf D, Yang J, Suen C, Jacobs J, Wang H, Leskovec J (2013) Information cartography: creating zoomable, large-scale maps of information. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1097–1105
Sun Y, Tang J, Han J, Chen C, Gupta M (2013) Co-evolution of multi-typed objects in dynamic star networks. IEEE Trans Knowl Data Eng 99:1
Teh YW (2006) A hierarchical bayesian language model based on pitman-yor processes. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp 985–992
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Teh YW, Kurihara K, Welling M (2008) Collapsed variational inference for HDP. Advances in neural information processing systems. MIT Press, London, pp 1481–1488
Thibaux R, Jordan MI (2007) Hierarchical beta processes and the indian buffet process. In: International conference on artificial intelligence and statistics, pp 564–571
Wallach HM, Murray I, Salakhutdinov R, Mimno D (2009) Evaluation methods for topic models. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1105–1112
Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical dirichlet process. In: International conference on artificial intelligence and statistics, pp 752–760
Wang X, Ma X, Grimson WEL (2009) Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Trans Pattern Anal Mach Intell 31(3):539–555
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433. ACM
Xu K, Kliger M, Hero A III (2014) Adaptive evolutionary clustering. Data Min Knowl Discov 28(2):304–336. doi:10.1007/s10618-012-0302-x
Xu MEKJ (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI, pp 226–231
Yang T, Chi Y, Zhu S, Gong Y, Jin R (2011) Detecting communities and their evolutions in dynamic social networksa bayesian approach. Mach Learn 82(2):157–189
Yao L, Mimno D, McCallum A (2009) Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 937–946
Zhang J, Song Y, Zhang C, Liu S (2010) Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1079–1088
Zhang P, Gao BJ, Liu P, Shi Y, Guo L (2012) A framework for application-driven classification of data streams. Neurocomputing 92:170–182
Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474
Zhang W, Li R, Feng D, Chernikov A, Chrisochoides N, Osgood C, Ji S (2015) Evolutionary soft co-clustering: formulations, algorithms, and applications. Data Min Knowl Discov 29(3):765–791
Acknowledgments
This work was supported by NSFC (61370025, 61502479), Australia ARC Discovery Project (DP140102206) and the Strategic Leading Science and Technology Projects of CAS (No. XDA06030200).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Wang, P., Zhang, P., Zhou, C. et al. Hierarchical evolving Dirichlet processes for modeling nonlinear evolutionary traces in temporal data. Data Min Knowl Disc 31, 32–64 (2017). https://doi.org/10.1007/s10618-016-0454-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0454-1