Development of an English-Macedonian Machine Readable Dictionary by Using Parallel Corpora

Saveski, Martin; Trajkovski, Igor

doi:10.1007/978-3-642-19325-5_20

Martin Saveski³ &
Igor Trajkovski⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 83))

Included in the following conference series:

International Conference on ICT Innovations

863 Accesses

Abstract

The dictionaries are one of the most useful lexical resources. However, most of the dictionaries today are not in digital form. This makes them cumbersome for usage by humans and impossible for integration in computer programs. The process of digitalizing an existing traditional dictionary is expensive and labor intensive task. In this paper, we present a method for development of Machine Readable Dictionaries by using the already available resources. Machine readable dictionary consists of simple word-toword mappings, where word from the source language can be mapped into several optional words in the target language. We present a series of experiments where by using the parallel corpora and open source Statistical Machine Translation tools at our disposal, we managed to develop an English- Macedonian Machine Readable Dictionary containing 23,296 translation pairs (17,708 English and 18,343 Macedonian terms). A subset of the produced dictionary has been manually evaluated and showed accuracy of 79.8%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Charitakis, K.: Using parallel corpora to create a Greek-English dictionary with Uplug. In: Nodalida (2007)
Google Scholar
Tiedemann, J.: Automatical Lexicon Extraction from Aligned Bilingual Corpora. Master Thesis at University of Magdeburg (1997)
Google Scholar
Velupillai, S., Dalianis, H.: Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic Languages. In: Coling (ed.) Workshop on Multi-source Multilingual Information Extraction and Summarization, Manchester (2008)
Google Scholar
Hao-chun, X., Xin, Z.: Using parallel corpora and Uplug to create a Chinese-English dictionary. Master Thesis at Stockholm University, Royal Institute of Technology (2008)
Google Scholar
Stolic, M., Zdravkova, K.: Resources for Machine Translation of the Macedonian Language. In: ICT Innovations Conference, Ohrid, Macedonia (2009)
Google Scholar
Tiedemann, J.: News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In: Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. 5, pp. 237–248. Amsterdam (2009)
Google Scholar
Tibor, K., Strunk, J.: Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4) (2006)
Google Scholar
NLTK - Natural Language Toolkit, http://www.nltk.org/
Varga, D., et al.: Parallel corpora for medium density languages. In: Recent Advances in Natural Language Processing, pp. 590–596 (2005)
Google Scholar
Tiedemann, J.: Recycling Translations - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral Thesis at Uppsala University (2003)
Google Scholar
XCES - Corpus Encoding Standard for XML, http://www.xces.org/
Petrovski, A.: Морфолошки компјутерски речник - придонес кон македонските јазични ресурси. Doctoral Thesis, Cyril and Methodius University. In Macedonian (2008)
Google Scholar
Dagan, I., Church, W.: Termight: Identifying and Translating Technical Terminology. In: Conference on Applied Natural Language Processing, pp. 34–40 (1994)
Google Scholar
Fung, P., McKeown, K.: A Technical Word and Term Translation Aid using Noisy Parallel Corpora Across Language Groups. In: The Machine Translation Journal, Special Issue on New Tools for Human Translators, pp. 53–87 (1996)
Google Scholar
Merkel, M., Ahrenberg, L.: Evaluating Word Alignment Systems. In: Second International Conference on Language Resources and Evaluation (LREC), pp. 1255–1261 (2000)
Google Scholar
WordNet, http://wordnet.princeton.edu/

Download references

Author information

Authors and Affiliations

Faculty of Computing, Engineering and Technology, Staffordshire University, College Road, Stoke-on-Trent, Staffordshire, UK
Martin Saveski
Faculty of Electrical Engineering and Information Technologies, Ss. Cyril and Methodius University, Rugjer Boshkovik bb, P.O. Box 574, Skopje, Macedonia
Igor Trajkovski

Authors

Martin Saveski
View author publications
You can also search for this author in PubMed Google Scholar
Igor Trajkovski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Natural Sciences and Mathematics, Institute of Informatics, University Sts. Cyril and Methodius, Arhimedova 5, 1000, Skopje, Macedonia
Marjan Gusev
Faculty of Technical Sciences, University of St. Kliment Ohridski, Ivo Lola Ribar bb, 7000, Bitola, Macedonia
Pece Mitrevski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saveski, M., Trajkovski, I. (2011). Development of an English-Macedonian Machine Readable Dictionary by Using Parallel Corpora. In: Gusev, M., Mitrevski, P. (eds) ICT Innovations 2010. ICT Innovations 2010. Communications in Computer and Information Science, vol 83. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19325-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-19325-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19324-8
Online ISBN: 978-3-642-19325-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics