Skip to main content

An Evaluation of N-Gram Correspondence Models for Transliteration Detection

  • Conference paper
  • First Online:
New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 312))

  • 2392 Accesses

Abstract

Transliteration detection (TD) is a natural language processing (NLP) subtask that is used to find matching Named Entities (NEs) from parallel or comparable text where each language text is in a different writing system. The task is aimed at building high quality transliteration lexicons for improving performance in cross language applications like machine translation (MT) and cross language information retrieval (CLIR). Recent evaluations of TD methods (for example those from the NEWS 2010 transliteration mining shared task (Kumaran et al., 2010)) underscore the need for more methods that can improve TD performance. This paper contributes to this need by evaluating the use of source-target language n-gram correspondences for TD. We present TD experiments that use three different classes of n-gram correspondence models on standard transliteration datasets from the 2009 and 2010 shared tasks on transliteration generation (Li et al. (Report of NEWS 2009 machine transliteration shared task. Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, 2009); Li et al. (Report of NEWS 2010 transliteration generation shared task. Proceedings of the 2010 Named Entities Workshop, 2010)). Results show relatively significant TD performance improvements between the use of lower order and higher order n-gram correspondence models, and between the different classes of n-gram correspondence models. We show that our TD experimental setup is more complex than that in related work (Li et al. (Report of NEWS 2010 transliteration generation shared task. Proceedings of the 2010 Named Entities Workshop, 2010)) in terms of the search effort required, and that the best TD performances from the n-gram correspondence models in this paper are comparable to those from state-of-the-art methods. Our results also show that the n-gram size that should be used for developing high quality transliteration models in different languages and writing systems varies. Our work therefore serves to provide preliminary insight to the n-gram sizes required to model related high quality TD models for different language pairs and writing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. P. Nabende, “Applying dynamic Bayesian Networks in transliteration detection and generation,” Ph.D. dissertation, Faculty of Mathematics and Natural Sciences, University of Groningen, Groningen, The Netherlands, 2011.

    Google Scholar 

  2. H. Li, A. Kumaran, V. Pervouchine and M. Zhang, “Report of NEWS 2009 machine transliteration shared task,” in Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, Suntec, Singapore, 2009, pp. 1–18.

    Google Scholar 

  3. H. Li, A. Kumaran, V. Pervouchine and M. Zhang, “Report of NEWS 2010 transliteration generation shared task,” in Proceedings of the 2010 Named Entities Workshop, Uppsala, Sweden, 2010, pp. 1–11.

    Google Scholar 

  4. A. Kumaran, M. Khapra and H. Li. “Report of NEWS 2010 Transliteration Mining Shared Task,” in Proceedings of the 2010 Named Entities Workshop, Uppsala, Sweden, 2010, pp. 21–28.

    Google Scholar 

  5. S. Jiampojamarn, K. Dwyer, S. Bergsma, A. Bhargava, Q. Dou, M-Y. Kim and G. Kondrak, “Transliteration generation and mining with limited training resources,” in Proceedings of the 2010 Named Entities Workshop, Uppsala, Sweden, 2010, pp. 38–47.

    Google Scholar 

  6. K.-J. Chen and M.-H. Bai. (1998). Unknown word detection for Chinese by a corpus-based learning method. Computational Linguistics and Chinese Language Processing. 3(1). Pp. 27–44.

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Nabende .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Nabende, P. (2015). An Evaluation of N-Gram Correspondence Models for Transliteration Detection. In: Elleithy, K., Sobh, T. (eds) New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering. Lecture Notes in Electrical Engineering, vol 312. Springer, Cham. https://doi.org/10.1007/978-3-319-06764-3_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06764-3_79

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06763-6

  • Online ISBN: 978-3-319-06764-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics