Experimenting Statistical Machine Translation for Ethiopic Semitic Languages: The Case of Amharic-Tigrigna
Abstract
In this research an attempt have been made to experiment on Amharic-Tigrigna machine translation for promoting information sharing. Since there is no Amharic-Tigrigna parallel text corpus, we prepared a parallel text corpus for Amharic-Tigrigna machine translation system from religious domain specifically from bible. Consequently, the data preparation involves sentence alignment, sentence splitting, tokenization, normalization of Amharic-Tigrigna parallel corpora and then splitting the dataset into training, tuning and testing data. Then, Amharic-Tigrigna translation model have been constructed using training data and further tuned for better translation. Finally, given target language model, the Amharic-Tigrigna translation system generates a target output with reference to translation model using word and morpheme as a unit. The result we found from the experiment is promising to design Amharic-Tigrigna machine translation system between resource deficient languages. We are now working on post-editing to enhance the performance of the bi-lingual Amharic-Tigrigna translator.
Keywords
Under-resourced language Amharic-Tigrigna Semitic language Machine translationNotes
Acknowledgement
We would like to thank Ethiopia Ministry of Communication and Information Technology (MCIT) for funding to collect parallel text corpus and conduct an experiement for a bilingual Amharic-Tigrigna statistical machine translation research project.
References
- 1.Nakamura, S.: Overcoming the language barrier with speech translation technology. Sci. Technol. Trends Q. Rev. 31, 35–48 (2009)Google Scholar
- 2.What is machine translation, SYSTRAN: we speak your industry’s language. http://www.systran.co.uk/systran/corporate-profile/translation-technology/what-is-machine-translation
- 3.Martínez, L.G.: Human Translation Versus Machine Translation and Full Post-editing of Raw Machine Translation Output. Dublin City University, Dublin (2003)Google Scholar
- 4.Simons, G.F., Fennig, C.D.: Ethnologue: Languages of the World, 20th edn. SIL, Dallas (2017)Google Scholar
- 5.Zekaria, S.: Summary and Statistical Report of the 2007 Population and Housing Census. Central Statistical Agency, Addis Ababa (2008)Google Scholar
- 6.Ager, S.: Omniglot, the online Encyclopedia of writing systems and languagesGoogle Scholar
- 7.Hudson, G.: The world’s major languages: Amharic. In: The World’s Major Languages, 2nd edn, pp. 594–614. Routledge, Oxon/New York (2009)Google Scholar
- 8.Abyssinica dictionary: Amharic, the official language of Ethiopia (2015)Google Scholar
- 9.Teferra, S., Menzel, W., Tafila, B.: An Amharic speech corpus for large vocabulary continuous speech recognition. In: Proceedings of the XVth International Conference of Ethiopian Studies, Hamburg, Germany (2005)Google Scholar
- 10.Woldeyohannis, M.M., Besacier, L., Meshesha, M.: A corpus for Amharic-English speech translation: the case of tourism domain. In: Mekuria, F., Nigussie, E.E., Dargie, W., Edward, M., Tegegne, T., et al. (eds.) ICT4DA 2017. LNICST, vol. 244, pp. 129–139. Springer, Cham (2018)Google Scholar
- 11.Besacier, L., Le, V.-B., Boitet, C., Berment, V.: ASR and translation for under-resourced languages, Grenoble cedex 9, FranceGoogle Scholar
- 12.Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02, pp. 21–30, Philadelphia, Pennsylvania (2002)Google Scholar