Abstract
Cross-document entity disambiguation is the problem of identifying whether mentions from different documents refer to the same or distinct entities and rises in information fusion and automated knowledge base construction. In this paper, we describe a Chinese Information Extraction (IE) and fusion system based on Hadoop Framework, which involves document-level IE and corpus-level IE, a pipeline and multi-level modular approach to Name Entity Recognitions (EDR), entity relationship extraction and information fusion. In document-level IE, information associated with each mention of the name can be merged into rich profiles for entities based on our co-reference and alias modular, in corpus-level IE, entity disambiguation is performed based on agglomerative hierarchical clustering using Map Reduce. The visualized results of the entity centric information graph have been demonstrated.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Chinchor, N., Marsh, E.: MUC-7 information extraction task definition (version 5.1). In: Proceedings of MUC-7 (1998)
Hobbs, J.R.: FASTUS: a system for extracting information from text. In: Proceedings of the DARPA Workshop on Human Language Technology, pp. 133–137. Princeton, NJ (1993)
Mayfield, J., Alexander, D., Dorr, B.J., et al.: Cross-document coreference resolution: a key technology for learning by reading. In: AAAI Spring Symposium: Learning by Reading and Learning to Read, pp. 65–70 (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 793–803, Portland, Oregon, June 19-24 2011
Ding, H., Xiao, T., Zhu, J.: A multi-stage clustering approach to chinese person name disambiguation. In: Proceeding of the 6th National Information Retrieval Conference, China (2010)
Silberztein, M.: Tutorial notes: finite state processing with INTEX. In: COLING-ACL’98, Montreal, Canada (1998)
Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An algorithm that learns what’s in a name. Mach. Learn. 34(1–3), 211–231 (1999)
Borthwick, A.: Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University (1999)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of CoNLL, pp. 188–191, Canada (2003)
Li, Q., Anzaroot, S., Lin, W.-P., Li, X., Ji, H.: Joint Inference for Cross-document Information Extraction. In: CIKM’ 11, Glasgow, Scotland, UK, October 2011
Liu, Q., Zhang, H.-P., Yu, H.-K., Cheng, X.-Q.: Chinese lexical analysis using cascaded hidden markov model. J. Comput. Res. Dev. 41(8), 1421–1429 (2004)
Chang, A.X., Manning, C.D.: SUTIME: a library for recognizing and normalizing time expressions. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 23–25 May 2012
Li, H., Srihari, R.K., Niu, C., Li, W.: InfoXtract location normalization: a hybrid approach to geographic references in information extraction. In: Proceedings of NAACL-HLT Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May 2003
Li, W., McCallum, A.: Rapid development of hindi named entity recognition using conditional random fields and feature induction. ACM Trans. Asian Lang. Inf. Process. 2(3), 290–294 (2004)
Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)
Bagga, A., Baldwin, B.: Entity based cross-document coreferencing using the vector space model. In: Conference on Computational Linguistics (COLING) (1998)
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Conference on Natural Language Learning (CONLL) (2003)
Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: WWW (2008)
Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)
Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowledge base population. In: Proceedings of the 23rd International Conference on Computational Linguistics, p. 285. Association for Computational Linguistics (2010)
Zheng, Z.C., Si, X., Li, F., Chang, E.Y., Zhu, X.: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 82–89 (2012)
Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS evaluation: establishing a benchmark for the Web People Search Task. In: SemEval2007. ACL, June 2007
Jain, A.K., Narasimha Murty, M., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)
Dean, J., Ghemawat, S.: MapReduce: simplied data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Lee, K., Liu, L.: Efficient data partitioning model for heterogeneous graphs in the cloud. In: The 13th International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–21 November (2013)
Downey, D., Etzionib, O., Soderland, S.: Analysis of a probabilistic model of redundancy in unsupervised information extraction. J. Artif. Intell. 174(11), 726–748 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, X., Ma, S., Zhou, X. (2014). Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-10596-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10595-6
Online ISBN: 978-3-319-10596-3
eBook Packages: Computer ScienceComputer Science (R0)