The continuing growth in digital content means that there is now significantly more linguistic content to translate using more diverse workflows and tools than ever before. This growth necessitates broader requirements for Translation Quality Assessment (TQA) that include appropriate methods for the domain, text type, workflow, and end-user. With this in mind, this volume sheds light on TQA research and practice from academic, institutional, and industry settings in its unique combination of human and machine translation evaluation (MTE). The focus in this book is on the product, rather than the process, of translation. The contributions trace the convergence of post-hoc TQA methods, with cross-pollination from one translation method to another: New error typologies are being taken on for MTE; the concept of ‘fitness for purpose’ when raw or post-edited MT is considered ‘good enough’ is now also used for crowdsourced translation. The state-of-the-art evinces a pragmatic focus, calibrated to a targeted end-user group. Understanding translation technologies and the appropriate evaluation techniques is critical to the successful integration of these technologies in the language services industry of today, where the lines between human and machine have become increasingly blurred and adaptability to change has become a key asset that can ultimately mean success or failure in a competitive landscape.
The continuing exponential growth in primarily digital content means that there is now significantly more linguistic content to translate using more diverse workflows than ever before. This growth necessitates broader requirements for Translation Quality Assessment (TQA) that include appropriate methods for the domain, text type, workflow, and end-user. With this in mind, this volume sheds light on TQA research and practice from academic, institutional, and industry settings in its unique combination of human and machine translation evaluation (MTE). Understanding translation technologies and the appropriate evaluation techniques is critical to the successful integration of these technologies in the language services industry of today, where the lines between human and machine have become increasingly blurred and adaptability to change has become a key asset that can ultimately mean success or failure in a competitive landscape. This tumultuous environment affects all translation stakeholders, from students and educators to project managers and language services professionals, including in-house and freelance translators, as well as, of course, translation scholars and researchers.
At a high level, translation providers may try to ensure trust in translation quality levels by following standard approaches and workflows, such as the International Standards Organisation (ISO) 17100 translation process standard,1 or the 18587 process for the human post-editing of MT output.2 Another approach is to measure quality using traditional definitions of post-hoc quality, and it is these that we focus on to a greater extent in this volume.3 Both approaches are used in tandem in the world’s largest translation service, the European Commission Directorate-General for Translation (DGT), as described in detail in the chapter by Drugan et al. Not only is the DGT process the gold standard for translation quality, but it is an unusual example of moving beyond formal equivalence to a scenario in which each language version may be considered the ‘original’ text, due to legal effect, in that these texts “create rights, obligations and legitimate expectations” (ibid.).
What becomes clear when reading the following chapters is the convergence of post-hoc TQA methods, with cross-pollination from one translation method to another. New standard error typologies (see Lommel, in this volume), introduced to replace the previous TQA models, are being taken on for machine translation (MT) evaluation, as described by Popović. The concept of ‘fitness for purpose’ when raw or post-edited MT is considered ‘good enough’ for a notional translation end-user, as discussed by Way (this volume), is now also used for crowdsourced translation, as described by Jiménez-Crespo (this volume). Crowdsourced evaluation is becoming more common within the MT research community, particularly for large-scale competitive shared tasks (Graham et al. 2017).
Consideration of the (usually assumed) requirements of a translation’s end-user is not novel, and follows in the tradition of the functionalist approaches to translation, and in particular Skopos theory (Vermeer 1978). However, the sheer amount of text to translate and the number of language pairs and directions has led to a new level of pragmatism in large translation service providers, whereby a sharpened focus on a targeted end-user (as detailed in Suojanen et al. 2015) has added a new meticulously calibrated variability to translation quality requirements. These may be expressed using vague, relatively undefined terms, such as the prescriptive guidelines for light or medium post-editing, the exacting requirements of detailed error typology evaluations with an elaborate system of associated penalties, or a combination of approaches tailored for a translation client.
The subtitle of this volume is From Principles to Practice, and some of the principles of TQA are only of use in an academic context, or in the case of error typologies have been too unwieldy to apply at scale, tend to be difficult to explain to translation clients, and were for many years at a developmental standstill. For this reason, we hope to bring principles and practice together in this volume, with descriptions of highly complex use-cases for many text and translation process types (Drugan et al. and Way), novel empirical applications of translation technology and evaluation (O’Brien et al.; Toral and Way; Specia and Shah), and considerations of broadened future TQA applications (Doherty and Kruger; Jiménez-Crespo).
The first part of the book examines the state-of-the-art in TQA, beginning with a chapter by Castilho et al., who provide a historical background and an overview of established and developing approaches to human and machine TQA. In this opening chapter, the most popular automatic MT evaluation metrics are described in detail. This is followed by an in-depth discussion of the leading current issues in TQA, including problems with standardisation and consistency, particularly in relation to translator education and training, a topic returned to in a later chapter of the book (see Doherty et al.).
Drugan, Strandvik, and Vuorinen bring experience and expertise from both academia and translation quality in an institutional setting. Drugan (2013) has previously combined academic, theoretical and professional approaches to measuring and improving translation quality, offering a critical analysis of their effectiveness especially in industrial scenarios. This chapter focuses on the particular quality requirements of the European Commission DGT, wherein legislative texts in all official EU languages are considered equivalent and equally authentic. Defining translation quality and consistently managing quality expectations are challenging tasks when trying to balance legal compliance and maintain consistency among a huge and geographically dispersed cohort of translators, with varied translation processes and working conditions as either public servants or freelance workers. This chapter details the complex and interconnected TQA methodologies employed within the DGT. The authors also consider implications of these processes beyond translation for institutional and (increasingly) freelance translators to whom European Commission work is outsourced, with regard to power, agency, professionalism, and values.
Jiménez-Crespo describes how the relatively novel practice of crowdsourcing has impacted the notion of translation quality by expanding the fitness-for-purpose model (see also Way, in this volume), introduced with the growth in digital content and the need for fast (and, one may add, low-cost) translations. The author discusses the distribution of responsibility to different agents, that is, how the responsibility for the translation quality may shift from language providers and translators to participants from crowd platforms. Jiménez-Crespo presents an overview of workflow practices in these crowd platforms inspired both by professional context and translation automation. Translation crowdsourcing also introduces some particular TQA measures, such as crowd selection, embedded translator testing, and community-building.
In the final chapter of this part, Doherty et al. revisit some of the key issues addressed in the volume, focusing on academic applications of TQA. Firstly, teaching of contemporary evaluation methodologies provides translation graduates with skills that we can already see prove valuable, with graduates moving on to advisory roles in the language industry, using their expertise to take on such tasks as workflow design, project preparation, and MT training data selection. Secondly, familiarity with TQA measures prepares translation graduates for the standards that will be applied and the quality expectations in the translation industry. For these reasons, we advocate adding both of these applications of TQA to translation curricula.
In the second part of the book, we look at developing applications of TQA. Lommel has a long history of vital contributions to quality issues and standards in the translation and localization industry. His chapter provides a historical context for translation error typologies (also known as typologies and classifications), mostly used in the translation industry, and the background to recent influential developments in the area, namely the Multidimensional Quality Metrics (MQM) and Dynamic Quality Framework (DQF). While these two approaches began independently, they were harmonised in 2014. Lommel describes the hierarchy, dimensions, scoring, and specifications of this recent systematic standard approach to assessing translation quality. Popović describes the state-of-the-art for automatic, human, and computer-aided annotation of MT errors according to various error typologies as a way to compare MT systems or as a diagnostic tool for MT developers. Human-annotated translations can give deep insight, but tend to suffer from low inter-annotator agreement, especially when error classes are not clearly defined. Popović explains why automatic tools struggle to accurately identify very specific error types and tend to confuse mistranslations, omissions, and additions. She also discusses the evolution of MT error typologies, and describes experiments with different analysis methods (including the MQM, described in detail in Lommel’s chapter), such as attempts to employ linguistic check-points to identify specific linguistic phenomena that cause particular problems. This chapter brings up the need to consolidate disparate MT evaluation typologies, in order to improve consistency. One particularly interesting suggestion is that widespread use of MQM for MT evaluation would allow subsets of a single unified metric to be used for both human and MT evaluation. As an alternative, Popović suggests a unified metric based on a number of typologies used in previous studies.
Way (2013) described use-cases appropriate for MT based on the perishability of texts, but increases in quality, coupled with economic considerations, mean that MT is being pressed into action in more workflows, and MT post-editing has gone mainstream (Lommel and DePalma 2016). Cognisant of this, Way updates his assessment of MT today in his contribution, explaining the “proper place” of MT, human and automatic evaluation metrics, and task-based MT evaluation. He addresses the weaknesses of automatic evaluation, and describes the changing nature of MT systems. Finally, he examines how MT is currently deployed, and considers associated questions of MT quality expectations and perception, and his prediction of its continued use as a production tool alongside translation memory.
Audiovisual translation (AVT) is absent from contemporary discussions of TQA, especially in the context of translation technologies of MT and post-editing. Doherty and Kruger remedy this with a comprehensive overview of the state-of-the-art in computer-aided AVT, and the difficulties of assessing whether translated media, consumed using a wide range of media devices, translated employing a functionalist approach, meet the needs of a heterogeneous audience. The authors consider the difficulties of merging the distinct AVT quality needs, usually prescriptively imposed (see Ivarsson and Carroll 1998), with concepts in TQA (such as accuracy) when spatial and temporal constraints may require the subtitler to substantially reformulate a target segment. They believe that the fields of language technology and AVT are (and should be) convergent and that this convergence, along with insights from cognitive translation studies, will help to show a holistic way in which traditional and new metrics can be used complementarily for measuring AVT quality.
The chapters in Part III present empirical studies, employing novel applications of TQA. One suggestion from Way’s contribution on quality expectations of MT is “system-internal confidence measures”. Relatedly, in this chapter, Specia and Shah discuss the historical background, and “promising practical uses” of MT quality estimation (QE). The purpose of QE is to provide an indicator of quality for individual MT outputs in use, where reference segments are not available. The authors describe several possible applications of sentence-level QE, such as to predict post-editing effort, to select a preferred translation from several produced by different MT systems, to choose effective MT training data, or to identify samples for human evaluation using an error typology (rather than sampling randomly), and provide experiment results to show to what extent QE has worked in a research environment. MTQE is very much an active research topic today, and the authors are clear that this is “far from a solved problem”. However, they envisage that successful deployment of MTQE has “immense potential to make MT more useful to end-users of various types”.
O’Brien, Simard and Goulet explore the potential of using MT and self-post-editing as a second-language academic writing aid. The authors choose an interesting range of quality assessment measures, comparing participant perceptions, temporal effort (time spent), and revisions required when participants write an academic abstract in their first language, then machine translate and self-post-edit it, and when they write the abstract in English (their L2). Results are also compared using an automatic grammar- and style-checking tool. Participants were generally impressed with the quality of MT output, but some had difficulty in finding the appropriate terminology in their native language, as they were used to using English language terms. This study demonstrates the potential for reducing the cognitive burden of authors when accessing international academic publishing via the current lingua franca of English.
A recent, novel approach to MT using neural network models is introduced by Toral and Way. Although Way advises the use of MT for highly perishable texts in his other contribution, with Toral he investigates the results when that advice is completely disregarded, translating a non-perishable and difficult content type. They apply neural MT to literary texts and perform a quality assessment comparing a portion of the output to that from statistical MT systems and published human translations, showing surprisingly promising results, especially considering the challenging text type.
At an extremely dynamic time for the translation industry, where the pressures of technology and economy have accelerated change (not only for the better) in work processes and working conditions, we consider that it is essential to continually update knowledge of technology and process, in order to maximise one’s agency as a translator, student, teacher, researcher, or process manager. As an aid for such a purpose, we, the editors, believe that this volume covers the dominant methods from various translation scenarios, providing a comprehensive collection of contributions by leading international experts in human and machine translation quality and evaluation who can situate current developments and chart future trends of significant interest to the translation community as a whole.
Abdallah (2017) suggests that an all-encompassing quality model should include not only process and product, but also social quality, as the interactions and work practices of those in a translation network are likely to affect process and product quality (see also Sect. 6.4, in Castilho et al.). Drugan et al. consider this in their contribution.
- Abdallah K (2017) Three-dimensional quality model: the focal point of workflow management in organisational ergonomics. Paper presented at Translating and the Computer 39, LondonGoogle Scholar
- Drugan J (2013) Quality in professional translation: assessment and improvement. Bloomsbury, LondonGoogle Scholar
- Ivarsson J, Carroll M (1998) Subtitling. TransEdit, SimrishamnGoogle Scholar
- Lommel A, DePalma DA (2016) Europe’s leading role in machine translation. Common Sense Advisory, BostonGoogle Scholar
- Suojanen T, Koskinen K, Tuominen T (2015) User-centred translation. Routledge, LondonGoogle Scholar
- Way A (2013) Traditional and emerging use-cases for machine translation. Paper presented at Translating and the Computer 35, LondonGoogle Scholar