Skip to main content

Development of an English-Macedonian Machine Readable Dictionary by Using Parallel Corpora

  • Conference paper
ICT Innovations 2010 (ICT Innovations 2010)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 83))

Included in the following conference series:

  • 863 Accesses

Abstract

The dictionaries are one of the most useful lexical resources. However, most of the dictionaries today are not in digital form. This makes them cumbersome for usage by humans and impossible for integration in computer programs. The process of digitalizing an existing traditional dictionary is expensive and labor intensive task. In this paper, we present a method for development of Machine Readable Dictionaries by using the already available resources. Machine readable dictionary consists of simple word-toword mappings, where word from the source language can be mapped into several optional words in the target language. We present a series of experiments where by using the parallel corpora and open source Statistical Machine Translation tools at our disposal, we managed to develop an English- Macedonian Machine Readable Dictionary containing 23,296 translation pairs (17,708 English and 18,343 Macedonian terms). A subset of the produced dictionary has been manually evaluated and showed accuracy of 79.8%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Charitakis, K.: Using parallel corpora to create a Greek-English dictionary with Uplug. In: Nodalida (2007)

    Google Scholar 

  2. Tiedemann, J.: Automatical Lexicon Extraction from Aligned Bilingual Corpora. Master Thesis at University of Magdeburg (1997)

    Google Scholar 

  3. Velupillai, S., Dalianis, H.: Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic Languages. In: Coling (ed.) Workshop on Multi-source Multilingual Information Extraction and Summarization, Manchester (2008)

    Google Scholar 

  4. Hao-chun, X., Xin, Z.: Using parallel corpora and Uplug to create a Chinese-English dictionary. Master Thesis at Stockholm University, Royal Institute of Technology (2008)

    Google Scholar 

  5. Stolic, M., Zdravkova, K.: Resources for Machine Translation of the Macedonian Language. In: ICT Innovations Conference, Ohrid, Macedonia (2009)

    Google Scholar 

  6. Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In: Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. 5, pp. 237–248. Amsterdam (2009)

    Google Scholar 

  7. Tibor, K., Strunk, J.: Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4) (2006)

    Google Scholar 

  8. NLTK - Natural Language Toolkit, http://www.nltk.org/

  9. Varga, D., et al.: Parallel corpora for medium density languages. In: Recent Advances in Natural Language Processing, pp. 590–596 (2005)

    Google Scholar 

  10. Tiedemann, J.: Recycling Translations - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral Thesis at Uppsala University (2003)

    Google Scholar 

  11. XCES - Corpus Encoding Standard for XML, http://www.xces.org/

  12. Petrovski, A.: Морфолошки компјутерски речник - придонес кон македонските јазични ресурси. Doctoral Thesis, Cyril and Methodius University. In Macedonian (2008)

    Google Scholar 

  13. Dagan, I., Church, W.: Termight: Identifying and Translating Technical Terminology. In: Conference on Applied Natural Language Processing, pp. 34–40 (1994)

    Google Scholar 

  14. Fung, P., McKeown, K.: A Technical Word and Term Translation Aid using Noisy Parallel Corpora Across Language Groups. In: The Machine Translation Journal, Special Issue on New Tools for Human Translators, pp. 53–87 (1996)

    Google Scholar 

  15. Merkel, M., Ahrenberg, L.: Evaluating Word Alignment Systems. In: Second International Conference on Language Resources and Evaluation (LREC), pp. 1255–1261 (2000)

    Google Scholar 

  16. WordNet, http://wordnet.princeton.edu/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saveski, M., Trajkovski, I. (2011). Development of an English-Macedonian Machine Readable Dictionary by Using Parallel Corpora. In: Gusev, M., Mitrevski, P. (eds) ICT Innovations 2010. ICT Innovations 2010. Communications in Computer and Information Science, vol 83. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19325-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19325-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19324-8

  • Online ISBN: 978-3-642-19325-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics