Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion

Li, Xiaoge; Ma, Sugang; Zhou, Xiaohui

doi:10.1007/978-3-319-10596-3_9

Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion

Xiaoge Li¹⁹,
Sugang Ma¹⁹ &
Xiaohui Zhou¹⁹

Conference paper
First Online: 01 January 2014

1281 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8585))

Abstract

Cross-document entity disambiguation is the problem of identifying whether mentions from different documents refer to the same or distinct entities and rises in information fusion and automated knowledge base construction. In this paper, we describe a Chinese Information Extraction (IE) and fusion system based on Hadoop Framework, which involves document-level IE and corpus-level IE, a pipeline and multi-level modular approach to Name Entity Recognitions (EDR), entity relationship extraction and information fusion. In document-level IE, information associated with each mention of the name can be merged into rich profiles for entities based on our co-reference and alias modular, in corpus-level IE, entity disambiguation is performed based on agglomerative hierarchical clustering using Map Reduce. The visualized results of the entity centric information graph have been demonstrated.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Chinchor, N., Marsh, E.: MUC-7 information extraction task definition (version 5.1). In: Proceedings of MUC-7 (1998)
Google Scholar
Hobbs, J.R.: FASTUS: a system for extracting information from text. In: Proceedings of the DARPA Workshop on Human Language Technology, pp. 133–137. Princeton, NJ (1993)
Google Scholar
Mayfield, J., Alexander, D., Dorr, B.J., et al.: Cross-document coreference resolution: a key technology for learning by reading. In: AAAI Spring Symposium: Learning by Reading and Learning to Read, pp. 65–70 (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 793–803, Portland, Oregon, June 19-24 2011
Google Scholar
Ding, H., Xiao, T., Zhu, J.: A multi-stage clustering approach to chinese person name disambiguation. In: Proceeding of the 6th National Information Retrieval Conference, China (2010)
Google Scholar
Silberztein, M.: Tutorial notes: finite state processing with INTEX. In: COLING-ACL’98, Montreal, Canada (1998)
Google Scholar
Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An algorithm that learns what’s in a name. Mach. Learn. 34(1–3), 211–231 (1999)
Article MATH Google Scholar
Borthwick, A.: Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University (1999)
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of CoNLL, pp. 188–191, Canada (2003)
Google Scholar
Li, Q., Anzaroot, S., Lin, W.-P., Li, X., Ji, H.: Joint Inference for Cross-document Information Extraction. In: CIKM’ 11, Glasgow, Scotland, UK, October 2011
Google Scholar
Liu, Q., Zhang, H.-P., Yu, H.-K., Cheng, X.-Q.: Chinese lexical analysis using cascaded hidden markov model. J. Comput. Res. Dev. 41(8), 1421–1429 (2004)
Google Scholar
Chang, A.X., Manning, C.D.: SUTIME: a library for recognizing and normalizing time expressions. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 23–25 May 2012
Google Scholar
Li, H., Srihari, R.K., Niu, C., Li, W.: InfoXtract location normalization: a hybrid approach to geographic references in information extraction. In: Proceedings of NAACL-HLT Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May 2003
Google Scholar
Li, W., McCallum, A.: Rapid development of hindi named entity recognition using conditional random fields and feature induction. ACM Trans. Asian Lang. Inf. Process. 2(3), 290–294 (2004)
Article Google Scholar
Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)
Google Scholar
Bagga, A., Baldwin, B.: Entity based cross-document coreferencing using the vector space model. In: Conference on Computational Linguistics (COLING) (1998)
Google Scholar
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Conference on Natural Language Learning (CONLL) (2003)
Google Scholar
Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: WWW (2008)
Google Scholar
Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)
Google Scholar
Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowledge base population. In: Proceedings of the 23rd International Conference on Computational Linguistics, p. 285. Association for Computational Linguistics (2010)
Google Scholar
Zheng, Z.C., Si, X., Li, F., Chang, E.Y., Zhu, X.: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 82–89 (2012)
Google Scholar
Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS evaluation: establishing a benchmark for the Web People Search Task. In: SemEval2007. ACL, June 2007
Google Scholar
Jain, A.K., Narasimha Murty, M., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplied data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Lee, K., Liu, L.: Efficient data partitioning model for heterogeneous graphs in the cloud. In: The 13th International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–21 November (2013)
Google Scholar
Downey, D., Etzionib, O., Soderland, S.: Analysis of a probabilistic model of redundancy in unsupervised information extraction. J. Artif. Intell. 174(11), 726–748 (2010)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Xi’an University of Posts and Telecommunications, Xi’an, China
Xiaoge Li, Sugang Ma & Xiaohui Zhou

Authors

Xiaoge Li
View author publications
You can also search for this author in PubMed Google Scholar
Sugang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoge Li .

Editor information

Editors and Affiliations

University of Toronto, Toronto, Ontario, Canada
Tilmann Rabl
Cisco Systems, Inc., San José, USA
Nambiar Raghunath
Oracle Corporation, Redwood Shores, USA
Meikel Poess
Pivotal Software, Inc., Palo Alto, USA
Milind Bhandarkar
University of Toronto, Toronto, Canada
Hans-Arno Jacobsen
University of California at San Diego, La Jolla, USA
Chaitanya Baru

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Ma, S., Zhou, X. (2014). Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-10596-3_9
Published: 09 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10595-6
Online ISBN: 978-3-319-10596-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics