Evaluation of Dictionary Creating Methods for Under-Resourced Languages

  • Eszter SimonEmail author
  • Iván Mittelholcz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


In this paper, we present several bilingual dictionary building methods applied for Northern Saami–{English, Finnish, Hungarian, Russian} language pairs. Since Northern Saami is an under-resourced language and standard dictionary building methods require a large amount of pre-processed data, we had to find alternative methods. In a thorough evaluation, we compared the results for each method, which proved our expectations that the precision of standard lexicon building methods is quite low. The most precise method is utilizing Wikipedia title pairs extracted via inter-language links, but Wiktionary-based methods also provided useful result.


Bilingual dictionaries Evaluation Under-resourced languages Dictionary building methods 



The research reported in the paper was conducted with the support of the Hungarian Scientific Research Fund (OTKA) grant #107885.


  1. 1.
    Ács, J.: Pivot-based multilingual dictionary building using Wiktionary. In: 9th Language Resources and Evaluation Conference. ELRA, Reykjavik (2014)Google Scholar
  2. 2.
    Ács, J., Pajkossy, K., Kornai, A.: Building basic vocabulary across 40 languages. In: 6th Workshop on Building and Using Comparable Corpora, pp. 52–58. ACL, Sofia (2013)Google Scholar
  3. 3.
    Benyeda, I., Koczka, P., Váradi, T.: Creating seed lexicons for under-resourced languages. In: GLOBALEX 2016 workshop. ELRA, Portorož (2016)Google Scholar
  4. 4.
    Bharadwaj, G.R., Tandon, N., Varma, V.: An iterative approach to extract dictionaries from Wikipedia for under-resourced languages. In: 8th International Conference on Natural Language Processing. Macmillan Publishers, India (2010)Google Scholar
  5. 5.
    Brown, R.D.: Automated dictionary extraction for “knowledge-free” example-based translation. In: 7th International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 111–118 (1997)Google Scholar
  6. 6.
    Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An approach for extracting bilingual terminology from Wikipedia. ACM Trans. Multimed. Comput. Commun. Appl. 5(4), 1–17 (2009)CrossRefGoogle Scholar
  7. 7.
    Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: 17th International Conference on Computational Linguistics, pp. 414–420. ACL, Stroudsburg (1998)Google Scholar
  8. 8.
    Grefenstette, G.: The problem of cross-language information retrieval. In: Grefenstette, G. (ed.) Cross-Language Information Retrieval, pp. 1–9. Kluwer Academic Publishers, Boston (1998)CrossRefGoogle Scholar
  9. 9.
    Lewis, M.P., Simons, G.F.: Assessing endangerment: expanding Fishman’s GIDS. Revue Roumaine de Linguistique 55(2), 103–120 (2010)Google Scholar
  10. 10.
    Mohammadi, M., Ghasem-Aghaee, N.: Building bilingual parallel corpora based on Wikipedia. In: 2nd International Conference on Computer Engineering and Applications, pp. 264–268 (2010)Google Scholar
  11. 11.
    Rapp, R.: Identifying word translations in non-parallel texts. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320–322. ACL, Stroudsburg (1995)Google Scholar
  12. 12.
    Simon, E., Benyeda, I., Koczka, P., Ludányi, Zs.: Automatic creation of bilingual dictionaries for Finno-Ugric languages. In: 1st International Workshop on Computational Linguistics for Uralic Languages, Tromsø (2015)Google Scholar
  13. 13.
    Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing V: Selected Papers from RANLP 2007, pp. 237–248. John Benjamins, Borovets (2009)CrossRefGoogle Scholar
  14. 14.
    Vulić, I., De Smet, W., Moens, M.F.: Identifying word translations from comparable corpora using latent topic models. In: 49th Annual Meeting of the Association for Computational Linguistics, pp. 479–484. ACL, Stroudsburg (2011)Google Scholar
  15. 15.
    Vulić, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: 53rd Annual Meeting of the Association for Computational Linguistics, pp. 719–725. ACL, Stroudsburg (2015)Google Scholar
  16. 16.
    Zesch, T., Müller, C., Gurevych, I.: Extracting lexical semantic knowledge from Wikipedia and Wiktionary. In: 6th Language Resources and Evaluation Conference. ELRA, Marrakech (2008)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Research Institute for Linguistics, Hungarian Academy of SciencesBudapestHungary

Personalised recommendations