Skip to main content

Learning to Match Names Across Languages

  • Chapter
  • First Online:
  • 1931 Accesses

Abstract

We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    www.informatica.com/solutions/identity_resolution_solution/Pages/index.aspx.

  2. 2.

    www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34.

  3. 3.

    webdocs.cs.ualberta.ca/~kondrak/aline1.1.zip.

  4. 4.

    For the MALINE row in Table 3.3, the ALINE documentation explains the notation as follows: “every phonetic symbol is represented by a single lowercase letter followed by zero or more uppercase letters. The initial lowercase letter is the base letter most similar to the sound represented by the phonetic symbol. The remaining uppercase letters stand for the feature modifiers which alter the sound defined by the base letter. By default, the output contains the alignments together with overall similarity scores. The aligned subsequences are delimited by ‘|’ signs. The ‘<’ sign signifies that the previous phonetic segment has been aligned with two segments in the other sequence, a case of compression/expansion. The ‘–’ sign denotes a “skip”, a case of insertion/deletion.”

  5. 5.

    The Predictive Accuracy was computed with exactly half the test examples being positive.

  6. 6.

    sourceforge.net/projects/carafe.

  7. 7.

    projects.ldc.upenn.edu/LCTL/.

References

  1. Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, pp. 1–13. Association for Computational Linguistics, Stroudsburg (2002)

    Google Scholar 

  2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. ACM, New York (2003)

    Google Scholar 

  3. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 475–480. ACM, New York (2002)

    Google Scholar 

  4. Damerau, F.J.A.: Technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171176 (1964)

    Google Scholar 

  5. Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Soc. 64, 1183–1210 (1969)

    Google Scholar 

  6. Freeman, A., Condon, S., Ackermann, C.: Cross linguistic name matching in English and Arabic. In: Proceedings of the Human Language Technology Conference, New York, pp. 471–478. Association for Computational Linguistics, Stroudsburg (2006)

    Google Scholar 

  7. Freitag, D., Khadivi, S.: A sequence alignment model based on the averaged perceptron. In: Proceedings of EMNLP-CONLL, Prague (2007)

    Google Scholar 

  8. Gao, W., Wong, K., Lam, W.: Phoneme-based transliteration of foreign names for OOV problem. In: Proceedings of First International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China, pp. 374–381 (2004)

    Google Scholar 

  9. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1) (2009). www.cs.waikato.ac.nz/%ml/weka/

    Google Scholar 

  10. Huang, F., Vogel, S., Waibel, A.: Improving named entity translation combining phonetic and semantic similarities. In: Proceedings of HLT-NAACL, Boston (2004)

    Google Scholar 

  11. Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens R., Ney, H.: Name extraction and translation for distillation. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Springer (2011). DOI: 10.1007/978-1-4419-7713-7_3

  12. Jiampojamarn, S., Bhargava, A., Dou, Q., Dwyer, K., Kondrak, G.: DIRECTL: a language-independent approach to transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore, pp. 28–31 (2009)

    Google Scholar 

  13. Joachims, T.: Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA (1999). svmlight.joachims.org/

  14. Jung, S., Hong, S., Paek, E.: An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th Conference on Computational Linguistics (COLING), Saarbrücken, Germany, vol. 1, pp. 383–389. Association for Computational Linguistics, Stroudsburg (2000)

    Google Scholar 

  15. Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, pp. 288–295. Association for Computational Linguistics, Stroudsburg (2000)

    Google Scholar 

  16. Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 27(4), 599–612 (1998)

    Google Scholar 

  17. Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report, Department of Computer Science, University of Newcastle upon Tyne, UK (1996)

    Google Scholar 

  18. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    Google Scholar 

  19. Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 machine transliteration shared task. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)

    Google Scholar 

  20. Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of Conference of the Association for Computation Linguistics, Barcelona, Spain, pp. 159–166. Association for Computational Linguistics, Stroudsburg (2004)

    Google Scholar 

  21. McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Conference on Uncertainty in AI, Edinburgh, Scotland, pp. 388–395 (2005)

    Google Scholar 

  22. Meng, H., Lo, W., Chen B., Tang, T.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Italy (2001)

    Google Scholar 

  23. (NEWS-2009) 2009 named entities workshop: shared task on transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)

    Google Scholar 

  24. Oh, J., Choi, K., Isahara, H.: A comparison of different machine transliteration models. J. Artif. Intell. Res. 27, 119–151 (2006)

    Google Scholar 

  25. Ristad, E.S., Yianilos, P.N.: Learning string edit distance. In: IEEE Transactions on Pattern Recognition and Machine Intelligence, pp. 522–532. IEEE Computer Society, Washington, DC (1998)

    Google Scholar 

  26. Safalra: www.safalra.com/science/linguistics/pinyin-pronunciation/ (2006)

  27. Samuel, K., Rubenstein, A., Condon, S., Yeh, A.: Name matching between Chinese and Roman scripts: machine complements human. In: Proceedings of the 2009 Named Entities Workshop, Singapore, pp. 152–160. ACL-IJCNLP, Stroudsburg (2009)

    Google Scholar 

  28. Sproat, R., Tao, T., Zhai, C.: Named entity transliteration with comparable corpora. In: Proceedings of the Conference of the Association for Computational Linguistics, Sydney, Australia, pp. 73–80. Association for Computational Linguistics, Stroudsburg (2006)

    Google Scholar 

  29. Tao, T., Yoon, S., Fister, A., Sproat, R., Zhai, C.: Unsupervised named entity transliteration using temporal and phonetic correlation. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, Sydney, Australia, pp. 250–257. Association for Computational Linguistics, Stroudsburg (2006)

    Google Scholar 

  30. The CMU Pronouncing Dictionary: ftp://ftp.cs.cmu.edu/project/speech/dict/ (2008)

  31. Ukkonnen, E.: Approximate string-matching with Q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992)

    Google Scholar 

  32. Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  33. Wan, S., Verspoor, C.M.: Automatic English-Chinese name transliteration for development of multilingual resources. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, pp. 1352–1356. Association for Computational Linguistics, Stroudsburg (1998)

    Google Scholar 

  34. Wikipedia: Pinyin. en.wikipedia.org/wiki/Pinyin (2006)

  35. Winkler, W., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, Energy Information Administration, Washington, DC (1991)

    Google Scholar 

  36. Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995)

    Google Scholar 

Download references

Acknowledgements

This research has been funded by the MITRE Innovation Program (Public Release Case Number 07–0752). We are also grateful to the reviewers for their insightful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Inderjeet Mani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Mani, I., Yeh, A., Condon, S. (2013). Learning to Match Names Across Languages. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28569-1_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28568-4

  • Online ISBN: 978-3-642-28569-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics