Abstract
We describe a set of bilingual English-French and English-German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
All corpora are available at
- 2.
We use “EUR”, “HAN”, “LIT”, “TED” and “POL” to denote the five corpora hereafter.
- 3.
The original Europarl is available from http://www.statmt.org/europarl/.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
TEDx are TED-like events not restricted to specific language. We could not find sufficient amount of TEDx German talks translated to English.
- 10.
- 11.
- 12.
- 13.
The list of French and German FW was downloaded from https://code.google.com/archive/p/stop-words/.
- 14.
Feature combinations yield similar, occasionally slightly better, results; we refrain from providing full analysis in this paper.
- 15.
Standard deviation in most experiments was close to 0.
References
Baker, M.: Corpus linguistics and translation studies: implications and applications. In: Baker, M., Francis, G., Tognini-Bonelli, E. (eds.) Text and Technology: in Honour of John Sinclair, pp. 233–252. John Benjamins, Amsterdam (1993)
Baker, M.: Corpora in translation studies: an overview and some suggestions for future research. Target 7, 223–243 (1995)
Baker, M.: Corpus-based translation studies: the challenges that lie ahead. In: Mona Baker, G.F., Tognini-Bonelli, E., (eds.) Terminology, LSP and Translation. Studies in Language Engineering in Honour of Juan C. Sager, pp. 175–186. John Benjamins, Amsterdam (1996)
Al-Shabab, O.S.: Interpretation and the language of translation: creativity and conventions in translation. Janus, Edinburgh (1996)
Laviosa, S.: Core patterns of lexical use in a comparable corpus of English lexical prose. Meta 43, 557–570 (1998)
Laviosa, S.: Corpus-Based Translation Studies: Theory, Findings, Applications. Approaches to Translation Studies. Rodopi, Amsterdam (2002)
Olohan, M.: Introducing Corpora in Translation Studies. Routledge, Abingdon (2004)
Becher, V.: When and why do translators add connectives? Target 23, 26–47 (2011)
Zanettin, F.: Corpus methods for descriptive translation studies. Procedia Soc. Behav. Sci. 95, 20–32 (2013). Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC 2013)
Gellerstam, M.: Translationese in Swedish novels translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia, pp. 88–95. CWK Gleerup, Lund (1986)
Toury, G.: Descriptive Translation Studies and Beyond. John Benjamins, Amsterdam/Philadelphia (1995)
Baroni, M., Bernardini, S.: A new approach to the study of translationese: machine-learning the difference between original and translated text. Literary Linguist. Comput. 21, 259–274 (2006)
van Halteren, H.: Source language markers in EUROPARL translations. In: Scott, D., Uszkoreit, H., (eds.) COLING 2008, 22nd International Conference on Computational Linguistics, Proceedings of the Conference, 18–22 August 2008, Manchester, UK, pp. 937–944 (2008)
Kurokawa, D., Goutte, C., Isabelle, P.: Automatic detection of translated text and its impact on machine translation. In: Proceedings of MT-Summit XII, pp. 81–88 (2009)
Koppel, M., Ordan, N.: Translationese and its dialects. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 1318–1326. Association for Computational Linguistics (2011)
Ilisei, I., Inkpen, D., Corpas Pastor, G., Mitkov, R.: Identification of translationese: a machine learning approach. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 503–511. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12116-6_43
Ilisei, I., Inkpen, D.: Translationese traits in Romanian newspapers: a machine learning approach. Int. J. Comput. Linguist. Appl. 2, 319–332 (2011)
Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Digit. scholarsh. Humanit. 30, 98–118 (2015)
Rabinovich, E., Wintner, S.: Unsupervised identification of translationese. Trans. Assoc. Comput. Linguist. 3, 419–432 (2015)
Nisioi, S.: Unsupervised classification of translated texts. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds.) NLDB 2015. LNCS, vol. 9103, pp. 323–334. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19581-0_29
Pym, A.: On Toury’s laws of how translators translate. In: Pym, A., Shlesinger, M., Simeoni, D., (eds.) Beyond Descriptive Translation Studies: Investigations in Homage to Gideon Toury. Benjamins Translation Library: EST Subseries, pp. 311–328. John Benjamins (2008)
Becher, V.: Abandoning the notion of “translation-inherent" explicitation: against a dogma of translation studies. Across Lang. Cult. 11, 1–28 (2010)
Eetemadi, S., Toutanova, K.: Asymmetric features of human generated translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 159–164. Association for Computational Linguistics (2014)
Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)
Eetemadi, S., Toutanova, K.: Detecting translation direction: a cross-domain study. In: NAACL Student Research Workshop, ACL Association for Computational Linguistics (2015)
House, J.: Beyond intervention: universals in translation? Trans-kom 1, 6–19 (2008)
Laviosa, S.: Universals. In: Baker, M., Saldanha, G. (eds.) Routledge Encyclopedia of Translation Studies, 2nd edn, pp. 288–292. Routledge (Taylor and Francis), New York (2008)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit (2005)
Cucchi, C.: Dialogic features in EU non-native parliamentary debates. Rev. Air Force Acad. 11, 5–14 (2012)
Koehn, P., Birch, A., Steinberger, R.: 462 machine translation systems for Europe. In: Proceedings of the Twelfth Machine Translation Summit, pp. 65–72 (2009)
Cartoni, B., Zufferey, S., Meyer, T.: Using the Europarl corpus for cross-linguistic research. Belg. J. Linguist. 27, 23–42 (2013)
Islam, Z., Mehler, A.: Customization of the Europarl corpus for translation studies. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), European Language Resources Association (ELRA) (2012)
Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. Comput. Linguist. 38, 799–825 (2012)
Cartoni, B., Meyer, T.: Extracting directional and comparable corpora from a multilingual corpus for translation studies. In: Proceedings 8th International Conference on Language Resources and Evaluation (LREC), pp. 2132–2137. European Language Resources Association (ELRA) (2012)
Mollin, S.: The Hansard hazard: gauging the accuracy of British parliamentary transcripts. Corpora 2, 187–210 (2007)
Lynch, G., Vogel, C.: Towards the automatic detection of the source language of a literary translation. In: Proceedings of COLING 2012, the 24th International Conference on Computational Linguistics: Posters, pp. 775–784 (2012)
Avner, E.A.: Identifying Hebrew translationese using machine learning techniques. Diplomarbeit, University of Potsdam (2013)
Popescu, M.: Studying translationese at the character level. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., (eds.) Proceedings of RANLP-2011, pp. 634–639 (2011)
Avner, E.A., Ordan, N., Wintner, S.: Identifying translationese at the word and sub-word level. Digital Scholarship in the Humanities (Forthcoming)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19, 75–102 (1993)
Tan, L., Bond, F.: NTU-MC toolkit: annotating a linguistically diverse corpus. In: Proceedings of 25th International Conference on Computational Linguistics (COLING 2014) (2014)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, pp. 55–60. Association for Computational Linguistics (2014)
Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13, 637–649 (2001)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Mosteller, F., Wallace, D.L.: Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Am. Stat. Assoc. 58, 275–309 (1963)
Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary. Linguist. Comput. 22, 251–270 (2007)
Nisioi, S.: Feature analysis for native language identification. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 644–657. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18111-0_49
Lembersky, G., Ordan, N., Wintner, S.: Adapting translation models to translationese improves SMT. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 255–265. Association for Computational Linguistics (2012)
Lembersky, G., Ordan, N., Wintner, S.: Improving statistical machine translation by adapting translation models to translationese. Comput. Linguist. 39, 999–1023 (2013)
Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 363–374. Association for Computational Linguistics (2011)
Twitto-Shmuel, N., Ordan, N., Wintner, S.: Statistical machine translation with automatic identification of translationese. In: Proceedings of WMT-2015 (2015)
Acknowledgments
This research was supported by a grant from the Israeli Ministry of Science and Technology. We are grateful to Noam Ordan for much advice and encouragement. We also thank Sergiu Nisioi for helpful suggestions. We are grateful to Philipp Koehn for making the Europarl corpus available; to Cyril Goutte, George Foster and Pierre Isabelle for providing us with an annotated version of the Hansard corpus; to François Yvon and András Farkas (http://farkastranslations.com) for contributing their literary corpora; and to the TED OTP team for sharing TED talks and their translations. We thank also Raphael Salkie for sharing his diverse English-German corpus.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Rabinovich, E., Wintner, S., Lewinsohn, O.L. (2018). A Parallel Corpus of Translationese. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-75487-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)