Abstract
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established multidimensional quality metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.
Similar content being viewed by others
Notes
The instructions include a handy decision tree to aid in the annotation process. It can be found at the following URL: http://www.qt21.eu/downloads/annotatorsGuidelines-2014-06-11.pdf.
Unlike in SMT jargon, here a phrase refers to a grammatical unit, not just a string of contiguous words.
References
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of international conference on learning representations, San Diego, CA, USA, 2015
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, USA, pp 257–267
Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter JT, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108:159–170
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 136–158
Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108:109–120
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of association for computational linguistics conference, Baltimore, Maryland, USA, pp 1370–1380
Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Portland, Oregon, USA, pp 1045–1054
Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the conference on empirical methods in natural language processing, Waikiki, Honolulu, Hawaii, pp 848–856
Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. Prague Bull Math Linguist 108:121–132
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the conference on empirical methods in natural language processing, Barcelona, Spain, pp 388–395
Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of conference on empirical methods on natural language processing and computational natural language learning, Jeju Island, Korea, pp 868–876
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology, Edmonton, Canada, pp 48–54
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180
Ljubešić N, Klubička F (2014) bs,hr,sr WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th web as corpus workshop (WaC-9), Gothenburg, Sweden, pp 29–35
Lommel AR, Burchardt A, Uszkoreit H (2014a) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció pp 455–463
Lommel AR, Popovic M, Burchardt A (2014b) Assessing inter-annotator agreement for translation error annotation. In: MTE: workshop on automatic and manual metrics for operational translation evaluation, Reykjavik, Iceland
Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: Proceedings of the association for computational linguistics 2010 conference short papers, Stroudsburg, PA, USA, pp 220–224
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Dasgupta S, Mcallester D (eds) Proceedings of the 30th international conference on machine learning (ICML-13), JMLR workshop and conference proceedings, vol 28. Atlanta, USA, pp 1310–1318
Plackett RL (1983) Karl pearson and the chi-squared test. International statistical review/revue internationale de statistique pp 59–72
Popović M (2017) Comparing language related issues for NMT and PBMT between German and English. Prague Bull Math Linguist 108:209–220
Sánchez-Cartagena VM, Ljubešić N, Klubička F (2016) Dealing with data sparseness in SMT with factored models and morphological expansion: a case study on Croatian. Baltic J Mod Comput 4(2):354–360
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, Berlin, Germany, pp 1715–1725
Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Lubli S, Barone AVM, Mokry J, Nadejde M (2017) Nematus: a toolkit for neural machine translation. In: Proceedings of the European association for computational linguistics 2017 software demonstrations, Valencia, Spain, pp 65–68
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, Cambridge, Massachusetts, USA, pp 223–231
Tiedemann J (2009) News from OPUS—a collection of multilingual parallel corpora with tools and interfaces. In: Recent advances in natural language processing. Borovets, Bulgaria, pp 237–248
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eighth international conference on language resources and evaluation, Istanbul, Turkey, pp 2214–2218
Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In: Proceedings of the 15th Conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 1063–1073
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Acknowledgements
We would like to extend our thanks to Maja Popović, who provided invaluable advice, and Denis Kranjčić, who performed the annotation together with Filip Klubička, first author of the paper. This research was partly funded by the ADAPT Centre, which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. This research has also received funding from the European Union Seventh Framework Programme FP7/2007-2013 under Grant agreement PIAP-GA-2012-324414 (Abu-MaTran) and the Swiss National Science Foundation Grant 74Z0_160501 (ReLDI).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Klubička, F., Toral, A. & Sánchez-Cartagena, V.M. Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian. Machine Translation 32, 195–215 (2018). https://doi.org/10.1007/s10590-018-9214-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-018-9214-x