Skip to main content

Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion

  • Conference paper
  • First Online:
  • 1281 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8585))

Abstract

Cross-document entity disambiguation is the problem of identifying whether mentions from different documents refer to the same or distinct entities and rises in information fusion and automated knowledge base construction. In this paper, we describe a Chinese Information Extraction (IE) and fusion system based on Hadoop Framework, which involves document-level IE and corpus-level IE, a pipeline and multi-level modular approach to Name Entity Recognitions (EDR), entity relationship extraction and information fusion. In document-level IE, information associated with each mention of the name can be merged into rich profiles for entities based on our co-reference and alias modular, in corpus-level IE, entity disambiguation is performed based on agglomerative hierarchical clustering using Map Reduce. The visualized results of the entity centric information graph have been demonstrated.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://opennlp.apache.org/

  2. 2.

    http://cogcomp.cs.illinois.edu/demo/ner/?id=8

  3. 3.

    http://www-nlp.stanford.edu/software/CRF-NER.shtml

References

  1. Chinchor, N., Marsh, E.: MUC-7 information extraction task definition (version 5.1). In: Proceedings of MUC-7 (1998)

    Google Scholar 

  2. Hobbs, J.R.: FASTUS: a system for extracting information from text. In: Proceedings of the DARPA Workshop on Human Language Technology, pp. 133–137. Princeton, NJ (1993)

    Google Scholar 

  3. Mayfield, J., Alexander, D., Dorr, B.J., et al.: Cross-document coreference resolution: a key technology for learning by reading. In: AAAI Spring Symposium: Learning by Reading and Learning to Read, pp. 65–70 (2009)

    Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  5. Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 793–803, Portland, Oregon, June 19-24 2011

    Google Scholar 

  6. Ding, H., Xiao, T., Zhu, J.: A multi-stage clustering approach to chinese person name disambiguation. In: Proceeding of the 6th National Information Retrieval Conference, China (2010)

    Google Scholar 

  7. Silberztein, M.: Tutorial notes: finite state processing with INTEX. In: COLING-ACL’98, Montreal, Canada (1998)

    Google Scholar 

  8. Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An algorithm that learns what’s in a name. Mach. Learn. 34(1–3), 211–231 (1999)

    Article  MATH  Google Scholar 

  9. Borthwick, A.: Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University (1999)

    Google Scholar 

  10. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of CoNLL, pp. 188–191, Canada (2003)

    Google Scholar 

  11. Li, Q., Anzaroot, S., Lin, W.-P., Li, X., Ji, H.: Joint Inference for Cross-document Information Extraction. In: CIKM’ 11, Glasgow, Scotland, UK, October 2011

    Google Scholar 

  12. Liu, Q., Zhang, H.-P., Yu, H.-K., Cheng, X.-Q.: Chinese lexical analysis using cascaded hidden markov model. J. Comput. Res. Dev. 41(8), 1421–1429 (2004)

    Google Scholar 

  13. Chang, A.X., Manning, C.D.: SUTIME: a library for recognizing and normalizing time expressions. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 23–25 May 2012

    Google Scholar 

  14. Li, H., Srihari, R.K., Niu, C., Li, W.: InfoXtract location normalization: a hybrid approach to geographic references in information extraction. In: Proceedings of NAACL-HLT Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May 2003

    Google Scholar 

  15. Li, W., McCallum, A.: Rapid development of hindi named entity recognition using conditional random fields and feature induction. ACM Trans. Asian Lang. Inf. Process. 2(3), 290–294 (2004)

    Article  Google Scholar 

  16. Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)

    Google Scholar 

  17. Bagga, A., Baldwin, B.: Entity based cross-document coreferencing using the vector space model. In: Conference on Computational Linguistics (COLING) (1998)

    Google Scholar 

  18. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Conference on Natural Language Learning (CONLL) (2003)

    Google Scholar 

  19. Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: WWW (2008)

    Google Scholar 

  20. Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: Empirical Methods in Natural Language Processing (EMNLP) (2007)

    Google Scholar 

  21. Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowledge base population. In: Proceedings of the 23rd International Conference on Computational Linguistics, p. 285. Association for Computational Linguistics (2010)

    Google Scholar 

  22. Zheng, Z.C., Si, X., Li, F., Chang, E.Y., Zhu, X.: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 82–89 (2012)

    Google Scholar 

  23. Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS evaluation: establishing a benchmark for the Web People Search Task. In: SemEval2007. ACL, June 2007

    Google Scholar 

  24. Jain, A.K., Narasimha Murty, M., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)

    Article  Google Scholar 

  25. Dean, J., Ghemawat, S.: MapReduce: simplied data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  26. Lee, K., Liu, L.: Efficient data partitioning model for heterogeneous graphs in the cloud. In: The 13th International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–21 November (2013)

    Google Scholar 

  27. Downey, D., Etzionib, O., Soderland, S.: Analysis of a probabilistic model of redundancy in unsupervised information extraction. J. Artif. Intell. 174(11), 726–748 (2010)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoge Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, X., Ma, S., Zhou, X. (2014). Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10596-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10595-6

  • Online ISBN: 978-3-319-10596-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics