Advertisement

Information Retrieval

, Volume 9, Issue 3, pp 295–310 | Cite as

Multilingual modeling of cross-lingual spelling variants

  • Krister Lindén
Article

Abstract

Technical term translations are important for cross-lingual information retrieval. In many languages, new technical terms have a common origin rendered with different spelling of the underlying sounds, also known as cross-lingual spelling variants (CLSV).

To find the best CLSV in a text database index, we contribute a formulation of the problem in a probabilistic framework, and implement this with an instance of the general edit distance using weighted finite-state transducers. Some training data is required when estimating the costs for the general edit distance. We demonstrate that after some basic training our new multilingual model is robust and requires little or no adaptation for covering additional languages, as the model takes advantage of language independent transliteration patterns.

We train the model with medical terms in seven languages and test it with terms from varied domains in six languages. Two test languages are not in the training data. Against a large text database index, we achieve 64–78 % precision at the point of 100% recall. This is a relative improvement of 22% on the simple edit distance.

Keywords

Term translations Cross-lingual information retrieval Systematic spelling variants General edit distance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al-Onaizan Y and Knight K (2002) Machine Transliterations of Names in Arabic Text. In: Proceedings of ACL Workshop on Computational Approaches to Semitic LanguagesGoogle Scholar
  2. Bilac S and Tanaka H (2004) A hybrid back-transliteration system for Japanese. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling 2004. Geneva, Switzerland, pp. 597–603Google Scholar
  3. Cucerzan S and Brill E (2004) Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004). Barcelona, SpainGoogle Scholar
  4. Kanji GK (1999) 100 Statistical Tests. Sage Publications, new editionGoogle Scholar
  5. Keskustalo H, Pirkola A, Visala K, Leppänen E and Järvelin K (2003) Non-Adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants. In: SPIRE 2003 — 10th International Symposium on String Processing and Information Retrieval. Manaus, BrazilGoogle Scholar
  6. Knight K and Graehl J (1998) Machine Transliteration. Computational Linguistics 24(4):599–612Google Scholar
  7. Mohri M (1997) Finite-State Transducers in Language and Speech Processing. Computational Linguistics 23(2):269–311MathSciNetGoogle Scholar
  8. Mohri M (2003) Edit-Distance of Weighted Automata. In: J.-M. Champarnaud and D. Maurel (eds.): Seventh International Conference, CIAA 2002, Vol. 2608 of Lecture Notes in Computer Science. Tours, France, pp. 1–23, Springer, Berlin-NYGoogle Scholar
  9. Mohri M, Pereira FCN and Riley MD (2003) AT&T FSM Library — Finite-State Machine Library. [http://www.research.att.com/sw/tools/fsm/]
  10. Navarro G (2001) A guided tour to approximate string matching. ACM Computing Surveys 33(1):31–88CrossRefGoogle Scholar
  11. Nienstedt W (2003) Tohtori.fi — Lääkärikirja.[http://www.tohtori.fi/laakarikirja]
  12. Oard D and Diekema A (1998) Cross Language Information Retrieval. In: Annual Review of Information Science and Technology, Vol. 33. pp. 223–256Google Scholar
  13. Ohtake K, Sekiguchi Y and Yamamoto K (2004) Detecting Transliterated Orthographic Variants via Two Similarity Metrics. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling 2004. Geneva, Switzerland, pp. 709–715Google Scholar
  14. Peters C (2000) Cross Language Evaluation Forum.[http://clef.iei.pi.cnr.it/]
  15. Pirkola A, Hedlund T, Keskustalo H and Järvelin K (2001) Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4(3/4):209–230CrossRefGoogle Scholar
  16. Pirkola A and Järvelin K (2001) Employing the resolution power of search keys. Journal of the American Society of Information Science 52(7):575–583CrossRefGoogle Scholar
  17. Pirkola A, Toivonen J, Keskustalo H, Visala K and Järvelin K (2003) Fuzzy translation of cross-lingual spelling variants. In: SIGIR 2003. pp. 345–352, ACM PressGoogle Scholar
  18. Qu Y, Grefenstette G and Evans DA (2003) Automatic transliteration for Japanese-to-English text retrieval. In: SIGIR 2003. pp. 353–360, ACM PressGoogle Scholar
  19. Stichele RV (1995) Multilingual Glossary of Technical and Popular Medical Terms in Nine European Languages. [http://allserv.rug.ac.be/simrvdstich/eugloss/welcome.html]
  20. van Noord G (2002) FSA6.2xx: Finite State Automata Utilities. [http://odur.let.rug.nl/simvannoord/Fsa/fsa.html]
  21. Voutilainen A, Heikkilä J and Järvinen T (1995) ENGTWOL: English Morphological Analyzer.[http://www.lingsoft.fi/cgi-bin/engtwol]
  22. Zhang M, Li H and Su J (2004) Direct Orthographical Mapping for Machine Transliteration. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling 2004. Geneva, Switzerland, pp. 716–722Google Scholar
  23. Zhang Y and Vines P (2004) Using the web for automated translation extraction in cross-language information retrieval. In: SIGIR 2004. Sheffield, United Kingdom, pp. 162–169, ACMGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of General LinguisticsHelsinki UniversityUniversity of HelsinkiFinland

Personalised recommendations