Learning to Match Names Across Languages

Mani, Inderjeet; Yeh, Alex; Condon, Sherri

doi:10.1007/978-3-642-28569-1_3

Learning to Match Names Across Languages

Inderjeet Mani⁵,
Alex Yeh⁵ &
Sherri Condon⁶

Chapter
First Online: 01 January 2012

1931 Accesses

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

Abstract

We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
www.informatica.com/solutions/identity_resolution_solution/Pages/index.aspx.
2.
www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34.
3.
webdocs.cs.ualberta.ca/~kondrak/aline1.1.zip.
4.
For the MALINE row in Table 3.3, the ALINE documentation explains the notation as follows: “every phonetic symbol is represented by a single lowercase letter followed by zero or more uppercase letters. The initial lowercase letter is the base letter most similar to the sound represented by the phonetic symbol. The remaining uppercase letters stand for the feature modifiers which alter the sound defined by the base letter. By default, the output contains the alignments together with overall similarity scores. The aligned subsequences are delimited by ‘|’ signs. The ‘<’ sign signifies that the previous phonetic segment has been aligned with two segments in the other sequence, a case of compression/expansion. The ‘–’ sign denotes a “skip”, a case of insertion/deletion.”
5.
The Predictive Accuracy was computed with exactly half the test examples being positive.
6.
sourceforge.net/projects/carafe.
7.
projects.ldc.upenn.edu/LCTL/.

References

Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, pp. 1–13. Association for Computational Linguistics, Stroudsburg (2002)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. ACM, New York (2003)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 475–480. ACM, New York (2002)
Google Scholar
Damerau, F.J.A.: Technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171176 (1964)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Soc. 64, 1183–1210 (1969)
Google Scholar
Freeman, A., Condon, S., Ackermann, C.: Cross linguistic name matching in English and Arabic. In: Proceedings of the Human Language Technology Conference, New York, pp. 471–478. Association for Computational Linguistics, Stroudsburg (2006)
Google Scholar
Freitag, D., Khadivi, S.: A sequence alignment model based on the averaged perceptron. In: Proceedings of EMNLP-CONLL, Prague (2007)
Google Scholar
Gao, W., Wong, K., Lam, W.: Phoneme-based transliteration of foreign names for OOV problem. In: Proceedings of First International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China, pp. 374–381 (2004)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1) (2009). www.cs.waikato.ac.nz/%ml/weka/
Google Scholar
Huang, F., Vogel, S., Waibel, A.: Improving named entity translation combining phonetic and semantic similarities. In: Proceedings of HLT-NAACL, Boston (2004)
Google Scholar
Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens R., Ney, H.: Name extraction and translation for distillation. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Springer (2011). DOI: 10.1007/978-1-4419-7713-7_3
Jiampojamarn, S., Bhargava, A., Dou, Q., Dwyer, K., Kondrak, G.: DIRECTL: a language-independent approach to transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore, pp. 28–31 (2009)
Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA (1999). svmlight.joachims.org/
Jung, S., Hong, S., Paek, E.: An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th Conference on Computational Linguistics (COLING), Saarbrücken, Germany, vol. 1, pp. 383–389. Association for Computational Linguistics, Stroudsburg (2000)
Google Scholar
Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, pp. 288–295. Association for Computational Linguistics, Stroudsburg (2000)
Google Scholar
Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 27(4), 599–612 (1998)
Google Scholar
Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report, Department of Computer Science, University of Newcastle upon Tyne, UK (1996)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Google Scholar
Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 machine transliteration shared task. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)
Google Scholar
Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of Conference of the Association for Computation Linguistics, Barcelona, Spain, pp. 159–166. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Conference on Uncertainty in AI, Edinburgh, Scotland, pp. 388–395 (2005)
Google Scholar
Meng, H., Lo, W., Chen B., Tang, T.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Italy (2001)
Google Scholar
(NEWS-2009) 2009 named entities workshop: shared task on transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)
Google Scholar
Oh, J., Choi, K., Isahara, H.: A comparison of different machine transliteration models. J. Artif. Intell. Res. 27, 119–151 (2006)
Google Scholar
Ristad, E.S., Yianilos, P.N.: Learning string edit distance. In: IEEE Transactions on Pattern Recognition and Machine Intelligence, pp. 522–532. IEEE Computer Society, Washington, DC (1998)
Google Scholar
Safalra: www.safalra.com/science/linguistics/pinyin-pronunciation/ (2006)
Samuel, K., Rubenstein, A., Condon, S., Yeh, A.: Name matching between Chinese and Roman scripts: machine complements human. In: Proceedings of the 2009 Named Entities Workshop, Singapore, pp. 152–160. ACL-IJCNLP, Stroudsburg (2009)
Google Scholar
Sproat, R., Tao, T., Zhai, C.: Named entity transliteration with comparable corpora. In: Proceedings of the Conference of the Association for Computational Linguistics, Sydney, Australia, pp. 73–80. Association for Computational Linguistics, Stroudsburg (2006)
Google Scholar
Tao, T., Yoon, S., Fister, A., Sproat, R., Zhai, C.: Unsupervised named entity transliteration using temporal and phonetic correlation. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, Sydney, Australia, pp. 250–257. Association for Computational Linguistics, Stroudsburg (2006)
Google Scholar
The CMU Pronouncing Dictionary: ftp://ftp.cs.cmu.edu/project/speech/dict/ (2008)
Ukkonnen, E.: Approximate string-matching with Q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992)
Google Scholar
Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Wan, S., Verspoor, C.M.: Automatic English-Chinese name transliteration for development of multilingual resources. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, pp. 1352–1356. Association for Computational Linguistics, Stroudsburg (1998)
Google Scholar
Wikipedia: Pinyin. en.wikipedia.org/wiki/Pinyin (2006)
Winkler, W., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, Energy Information Administration, Washington, DC (1991)
Google Scholar
Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995)
Google Scholar

Download references

Acknowledgements

This research has been funded by the MITRE Innovation Program (Public Release Case Number 07–0752). We are also grateful to the reviewers for their insightful comments.

Author information

Authors and Affiliations

The MITRE Corporation, 202 Burlington Road, Bedford, MA, 01730, USA
Inderjeet Mani & Alex Yeh
The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA
Sherri Condon

Authors

Inderjeet Mani
View author publications
You can also search for this author in PubMed Google Scholar
Alex Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Sherri Condon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Inderjeet Mani .

Editor information

Editors and Affiliations

Universite Sorbonne Nouvelle, LATTICE-CNRS, Ecole Normale Superieure and, rue d'Ulm 45, Paris, 75005, France
Thierry Poibeau
, Information & Communication Technologies, Universitat Pompeu Fabra, C/ Tanger 122-140, Barcelona, 08018, Spain
Horacio Saggion
Institute for Computer Science, Polish Acadmey of Science, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Jakub Piskorski
Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2, Helsinki, 00014, Finland
Roman Yangarber

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mani, I., Yeh, A., Condon, S. (2013). Learning to Match Names Across Languages. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-28569-1_3
Published: 12 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics