Abstract
The rapidly increasing use of large-scale data on the Web makes named entity disambiguation become one of the main challenges to research in Information Extraction and development of Semantic Web. This paper presents a novel method for detecting proper names in a text and linking them to the right entities in Wikipedia. The method is hybrid, containing two phases of which the first one utilizes some heuristics and patterns to narrow down the candidates, and the second one employs the vector space model to rank the ambiguous cases to choose the right candidate. The novelty is that the disambiguation process is incremental and includes several rounds that filter the candidates, by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. We test the performance of the proposed method in disambiguation of names of people, locations and organizations in texts of the news domain. The experiment results show that our approach achieves high accuracy and can be used to construct a robust named entity disambiguation system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Bunescu, R., Paşca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proc. of the 11th Conference of EACL, pp. 9–16 (2006)
Bontcheva, K., Dimitrov, M., Maynard, D., Tablan, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. In: Proc. of TALN 2002 Workshop, Nancy, France (2002)
Cucerzan, S.: Large-Scale Named Entity Disambiguation Based on Wikipedia data. In: Proc. of EMNLP-CoNLL Joint Conference (2007)
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Name-Matching Tasks. In: IJCAI-03 II-Web Workshop (2003)
Cunningham, H., et al.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. of the 40th ACL (2002)
Chinchor, N., Robinson, P.: MUC-7 Named Entity Task Definition. In: Proc. of MUC-7 (1998)
Cimiano, P., Völker, J.: Towards large-scale, open-domain and ontology-based named entity classification. In: Proc. of RANLP 2005, pp. 166–172 (2005)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 Shared Task: Language Independent Named Entity Recognition. In: Proc. of CoNLL 2003, pp. 142–147 (2003)
Fernandez, N., et al.: IdentityRank: Named entity disambiguation in the context of the NEWS project. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519. Springer, Heidelberg (2007)
Fleischman, M., Hovy, E.: Fine grained classification of named entities. In: Proc. of Conference on Computational Linguistics (2002)
Gooi, C.H., Allan, J.: Cross-document coreference on a large-scale corpus. In: Proc. of HLT-NAACL for Computational Linguistics Annual Meeting, Boston, MA (2004)
Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)
Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proc. of ANLP, pp. 202–208 (1997)
Nguyen, H.T., Cao, T.H.: A knowledge-based approach to named entity disambiguation in news articles. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 619–624. Springer, Heidelberg (2007)
Peng, Y., He, D., Mao, M.: Geographic Named Entity Disambiguation with Automatic Profile Generation. In: Proc. of WI 2006 (2006)
Raphael, V., Joachim, K., Wolfgang, M.: Towards Ontology-based Disambiguation of Geographical Identifiers. In: Proc. of the 16th WWW Workshop on I3: Identity, Identifiers, Identifications (2007)
Remy, M.: Wikipedia: The free encyclopedia. Information Review 26(6), 434 (2002)
Shadbolt, N., Hall, W., Berners-Lee, T.: The Semantic Web Revisited. IEEE Intelligent Systems 21(3), 96–101 (2006)
Overell, S., Rüger, S.: Geographic Co-occurrence as a Tool for GIR. In: Proc. of CIKM Workshop on Geographic Information Retrieval, Lisbon, Portugal, pp. 71–76 (2007)
Smith, D., Mann, G.: Bootstrapping toponym classifiers. In: HLT-NAACL Workshop on Analysis of Geographic References, pp. 45–49 (2003)
Weaver, G., Strickland, B., Crane, G.: Quantifying the accuracy of relational statements in Wikipedia: a methodology. In: Proc. of JCDL, pp. 358–358 (2006)
Zesch, T., Gurevych, I., Mühlhäuser, M.: Analyzing and Accessing Wikipedia as a Lexical Semantic Resource. In: Rehm, G., Witt, A., Lemnitzer, L. (eds.) Data Structures for Linguistic Resources and Applications, pp. 197–205 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, H.T., Cao, T.H. (2008). Named Entity Disambiguation: A Hybrid Statistical and Rule-Based Incremental Approach. In: Domingue, J., Anutariya, C. (eds) The Semantic Web. ASWC 2008. Lecture Notes in Computer Science, vol 5367. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89704-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-89704-0_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89703-3
Online ISBN: 978-3-540-89704-0
eBook Packages: Computer ScienceComputer Science (R0)