Abstract
Most textual documents contain references to real-word entities such as people, locations and organizations. The understanding of their correlations is behind many applications including social relationship construction platform and major search engines, etc. This paper aims to discover entity correlations from news archives by means of the proposed hierarchical Entity Topic Model (hETM). hETM is a semantic-based analysis model which follows the gist of probabilistic topic models and in which a directed acyclic graph (DAG) is leveraged to capture arbitrary topic correlations. Entity extraction is taken as a preprocessing step of our model and we then employ different generative processes for ordinary words and entities. The discovering of entity correlations is achieved via the analysis of the dependencies between entities and their associated topics as well as topic correlations. We evaluate the approach upon BBC news dataset and results demonstrate the higher quality of discovered entity correlations compared with existing methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Elmacioglu, E., Lee, D.: On six degrees of separation in DBLP-DB and more. SIGMOD Record 34(2) (June 2005)
Kleinfeld, J.: Could it be a big world after all? the “six degrees of separation”. Myth. Society (2002)
Blei, D., Ng, A., Jordan, M., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3(993-1022) (2003)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC Berkeley (2004)
Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006)
Shiozaki, H., Eguchi, K., Ohkawa, T.: Entity Network Prediction Using Multitype Topic Models. IEICE-Transactions on Information and Systems E91-D(11), 2589–2598 (2008)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Sixth SIAM Conference on Data Mining (2006)
Shu, L., Long, B., Meng, W.: A Latent Topic Model for Complete Entity Resolution. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 880–891 (2009)
Kataria, S.S., Kumar, K.S., Rastogi, R.R., et al.: Entity disambiguation with hierarchical topic models. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011)
Dai, A.M., Storkey, A.J.: The grouped author-topic model for unsupervised entity resolution. In: Honkela, T. (ed.) ICANN 2011, Part I. LNCS, vol. 6791, pp. 241–249. Springer, Heidelberg (2011)
Guo, J., Xu, G., Cheng, X., Li, H.: Named entity recognition in query. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2009)
Du, J., Zhang, Z., Yan, J., et al.: Using search session context for named entity recognition in query. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2010)
Xu, G., Yang, S.-H., Li, H.: Named entity mining from click-through data using weakly supervised latent dirichlet allocation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009)
Blei, D., Lafferty, J.: A correlated topic model of Science. The Annals of Applied Statistics 1(1), 17–35 (2007)
Tam, Y.-C., Schultz, T.: Correlated latent semantic model for unsupervised LM adaptation. In: IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, pp. 41–44 (2007)
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
Mimno, D., Li, W., McCallum, A.: Mixtures of hierarchical topics with Pachinko allocation. In: Proceedings of the 24th International Conference on Machine Learning, pp. 633–640 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, L., Li, C., Ding, Q., Li, L. (2013). Discovering Correlated Entities from News Archives. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41154-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-41154-0_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41153-3
Online ISBN: 978-3-642-41154-0
eBook Packages: Computer ScienceComputer Science (R0)