An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features

Pedersen, Ted; Kulkarni, Anagha; Angheluta, Roxana; Kozareva, Zornitsa; Solorio, Thamar

doi:10.1007/11671299_23

Ted Pedersen¹⁷,
Anagha Kulkarni¹⁷,
Roxana Angheluta¹⁸,
Zornitsa Kozareva¹⁹ &
…
Thamar Solorio²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1406 Accesses
8 Citations

Abstract

Previous work by Pedersen, Purandare and Kulkarni (2005) has resulted in an unsupervised method of name discrimination that represents the context in which an ambiguous name occurs using second order co–occurrence features. These contexts are then clustered in order to identify which are associated with different underlying named entities. It also extracts descriptive and discriminating bigrams from each of the discovered clusters in order to serve as identifying labels. These methods have been shown to perform well with English text, although we believe them to be language independent since they rely on lexical features and use no syntactic features or external knowledge sources. In this paper we apply this methodology in exactly the same way to Bulgarian, English, Romanian, and Spanish corpora. We find that it attains discrimination accuracy that is consistently well above that of a majority classifier, thus providing support for the hypothesis that the method is language independent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bagga, A., Baldwin, B.: Entity–based cross–document co–referencing using the vector space model. In: Proceedings of the 17th international conference on Computational linguistics, pp. 79–85. Association for Computational Linguistics (1998)
Google Scholar
Gaustad, T.: Statistical corpus-based word sense disambiguation: Pseudowords vs. real ambiguous words. In: Proceedings of the Student Research Workshop at ACL-2001, Toulouse, France, pp. 61–66 (2001)
Google Scholar
Ginter, F., Boberg, J., Jrvine, J., Salakoski, T.: New techniques for disambiguation in natural language and their application to biological text. Journal of Machine Learning Research 5, 605–621 (2004)
Google Scholar
Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: Dumais, S., Marcu, D., Roukos, S. (eds.) HLT-NAACL 2004: Main Proceedings, Boston, Massachusetts, USA, May 2 - May 7, pp. 9–16. Association for Computational Linguistics (2004)
Google Scholar
Hatzivassiloglou, V., Duboue, P., Rzhetsky, A.: Disambiguating proteins, genes, and RNA in text: A machine learning approach. In: Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, Tivoli Gardens, Denmark (July 2001)
Google Scholar
Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL-2003, Edmonton, Canada, pp. 33–40 (2003)
Google Scholar
Nakov, P., Hearst, M.: Category-based pseudowords. In: Companion Volume to the Proceedings of HLT-NAACL 2003 - Short Papers, Edmonton, Alberta, Canada, May 27 - June 1, pp. 67–69 (2003)
Google Scholar
Pedersen, T., Purandare, A., Kulkarni, A.: Name discrimination by clustering similar contexts. In: Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, February 2005, pp. 220–231 (2005)
Google Scholar
Purandare, A.: Discriminating among word senses using McQuitty’s similarity analysis. In: Proceedings of the Student Research Workshop at HLT-NAACL 2003, Edmonton, Alberta, Canada, May 27 - June 1, pp. 19–24 (2003)
Google Scholar
Purandare, A., Pedersen, T.: Word sense discrimination by clustering contexts in vector and similarity spaces. In: Proceedings of the Conference on Computational Natural Language Learning, Boston, MA, pp. 41–48 (2004)
Google Scholar
Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota, Duluth, USA
Ted Pedersen & Anagha Kulkarni
Katholieke Universiteit Leuven, Belgium
Roxana Angheluta
University of Alicante, Spain
Zornitsa Kozareva
University of Texas at El Paso, USA
Thamar Solorio

Authors

Ted Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Anagha Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Roxana Angheluta
View author publications
You can also search for this author in PubMed Google Scholar
Zornitsa Kozareva
View author publications
You can also search for this author in PubMed Google Scholar
Thamar Solorio
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pedersen, T., Kulkarni, A., Angheluta, R., Kozareva, Z., Solorio, T. (2006). An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_23

Download citation

DOI: https://doi.org/10.1007/11671299_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics