A Parallel Corpus of Translationese

Rabinovich, Ella; Wintner, Shuly; Lewinsohn, Ofek Luis

doi:10.1007/978-3-319-75487-1_12

Ella Rabinovich¹⁴,
Shuly Wintner¹⁴ &
Ofek Luis Lewinsohn¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1220 Accesses

Abstract

We describe a set of bilingual English-French and English-German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All corpora are available at
http://cl.haifa.ac.il/projects/translationese/index.shtml.
2.
We use “EUR”, “HAN”, “LIT”, “TED” and “POL” to denote the five corpora hereafter.
3.
The original Europarl is available from http://www.statmt.org/europarl/.
4.
http://europa.eu/about-eu/facts-figures/administration/index_en.htm.
5.
http://www.theguardian.com/education/datablog/2014/may/21/european-parliament-english-language-official-debates-data.
6.
http://www.gutenberg.org.
7.
http://farkastranslations.com/.
8.
http://en.wikisource.org/.
9.
TEDx are TED-like events not restricted to specific language. We could not find sufficient amount of TEDx German talks translated to English.
10.
http://developer.ted.com/.
11.
http://www.project-syndicate.org/.
12.
http://www.diplomatisches-magazin.de/.
13.
The list of French and German FW was downloaded from https://code.google.com/archive/p/stop-words/.
14.
Feature combinations yield similar, occasionally slightly better, results; we refrain from providing full analysis in this paper.
15.
Standard deviation in most experiments was close to 0.

References

Baker, M.: Corpus linguistics and translation studies: implications and applications. In: Baker, M., Francis, G., Tognini-Bonelli, E. (eds.) Text and Technology: in Honour of John Sinclair, pp. 233–252. John Benjamins, Amsterdam (1993)
Chapter Google Scholar
Baker, M.: Corpora in translation studies: an overview and some suggestions for future research. Target 7, 223–243 (1995)
Article Google Scholar
Baker, M.: Corpus-based translation studies: the challenges that lie ahead. In: Mona Baker, G.F., Tognini-Bonelli, E., (eds.) Terminology, LSP and Translation. Studies in Language Engineering in Honour of Juan C. Sager, pp. 175–186. John Benjamins, Amsterdam (1996)
Google Scholar
Al-Shabab, O.S.: Interpretation and the language of translation: creativity and conventions in translation. Janus, Edinburgh (1996)
Google Scholar
Laviosa, S.: Core patterns of lexical use in a comparable corpus of English lexical prose. Meta 43, 557–570 (1998)
Article Google Scholar
Laviosa, S.: Corpus-Based Translation Studies: Theory, Findings, Applications. Approaches to Translation Studies. Rodopi, Amsterdam (2002)
Google Scholar
Olohan, M.: Introducing Corpora in Translation Studies. Routledge, Abingdon (2004)
Google Scholar
Becher, V.: When and why do translators add connectives? Target 23, 26–47 (2011)
Article Google Scholar
Zanettin, F.: Corpus methods for descriptive translation studies. Procedia Soc. Behav. Sci. 95, 20–32 (2013). Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC 2013)
Article Google Scholar
Gellerstam, M.: Translationese in Swedish novels translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia, pp. 88–95. CWK Gleerup, Lund (1986)
Google Scholar
Toury, G.: Descriptive Translation Studies and Beyond. John Benjamins, Amsterdam/Philadelphia (1995)
Book Google Scholar
Baroni, M., Bernardini, S.: A new approach to the study of translationese: machine-learning the difference between original and translated text. Literary Linguist. Comput. 21, 259–274 (2006)
Article Google Scholar
van Halteren, H.: Source language markers in EUROPARL translations. In: Scott, D., Uszkoreit, H., (eds.) COLING 2008, 22nd International Conference on Computational Linguistics, Proceedings of the Conference, 18–22 August 2008, Manchester, UK, pp. 937–944 (2008)
Google Scholar
Kurokawa, D., Goutte, C., Isabelle, P.: Automatic detection of translated text and its impact on machine translation. In: Proceedings of MT-Summit XII, pp. 81–88 (2009)
Google Scholar
Koppel, M., Ordan, N.: Translationese and its dialects. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 1318–1326. Association for Computational Linguistics (2011)
Google Scholar
Ilisei, I., Inkpen, D., Corpas Pastor, G., Mitkov, R.: Identification of translationese: a machine learning approach. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 503–511. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12116-6_43
Chapter Google Scholar
Ilisei, I., Inkpen, D.: Translationese traits in Romanian newspapers: a machine learning approach. Int. J. Comput. Linguist. Appl. 2, 319–332 (2011)
Google Scholar
Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Digit. scholarsh. Humanit. 30, 98–118 (2015)
Article Google Scholar
Rabinovich, E., Wintner, S.: Unsupervised identification of translationese. Trans. Assoc. Comput. Linguist. 3, 419–432 (2015)
Google Scholar
Nisioi, S.: Unsupervised classification of translated texts. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds.) NLDB 2015. LNCS, vol. 9103, pp. 323–334. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19581-0_29
Chapter Google Scholar
Pym, A.: On Toury’s laws of how translators translate. In: Pym, A., Shlesinger, M., Simeoni, D., (eds.) Beyond Descriptive Translation Studies: Investigations in Homage to Gideon Toury. Benjamins Translation Library: EST Subseries, pp. 311–328. John Benjamins (2008)
Google Scholar
Becher, V.: Abandoning the notion of “translation-inherent" explicitation: against a dogma of translation studies. Across Lang. Cult. 11, 1–28 (2010)
Article Google Scholar
Eetemadi, S., Toutanova, K.: Asymmetric features of human generated translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 159–164. Association for Computational Linguistics (2014)
Google Scholar
Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)
Google Scholar
Eetemadi, S., Toutanova, K.: Detecting translation direction: a cross-domain study. In: NAACL Student Research Workshop, ACL Association for Computational Linguistics (2015)
Google Scholar
House, J.: Beyond intervention: universals in translation? Trans-kom 1, 6–19 (2008)
Google Scholar
Laviosa, S.: Universals. In: Baker, M., Saldanha, G. (eds.) Routledge Encyclopedia of Translation Studies, 2nd edn, pp. 288–292. Routledge (Taylor and Francis), New York (2008)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit (2005)
Google Scholar
Cucchi, C.: Dialogic features in EU non-native parliamentary debates. Rev. Air Force Acad. 11, 5–14 (2012)
Google Scholar
Koehn, P., Birch, A., Steinberger, R.: 462 machine translation systems for Europe. In: Proceedings of the Twelfth Machine Translation Summit, pp. 65–72 (2009)
Google Scholar
Cartoni, B., Zufferey, S., Meyer, T.: Using the Europarl corpus for cross-linguistic research. Belg. J. Linguist. 27, 23–42 (2013)
Article Google Scholar
Islam, Z., Mehler, A.: Customization of the Europarl corpus for translation studies. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), European Language Resources Association (ELRA) (2012)
Google Scholar
Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. Comput. Linguist. 38, 799–825 (2012)
Article MathSciNet Google Scholar
Cartoni, B., Meyer, T.: Extracting directional and comparable corpora from a multilingual corpus for translation studies. In: Proceedings 8th International Conference on Language Resources and Evaluation (LREC), pp. 2132–2137. European Language Resources Association (ELRA) (2012)
Google Scholar
Mollin, S.: The Hansard hazard: gauging the accuracy of British parliamentary transcripts. Corpora 2, 187–210 (2007)
Article Google Scholar
Lynch, G., Vogel, C.: Towards the automatic detection of the source language of a literary translation. In: Proceedings of COLING 2012, the 24th International Conference on Computational Linguistics: Posters, pp. 775–784 (2012)
Google Scholar
Avner, E.A.: Identifying Hebrew translationese using machine learning techniques. Diplomarbeit, University of Potsdam (2013)
Google Scholar
Popescu, M.: Studying translationese at the character level. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., (eds.) Proceedings of RANLP-2011, pp. 634–639 (2011)
Google Scholar
Avner, E.A., Ordan, N., Wintner, S.: Identifying translationese at the word and sub-word level. Digital Scholarship in the Humanities (Forthcoming)
Google Scholar
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19, 75–102 (1993)
Google Scholar
Tan, L., Bond, F.: NTU-MC toolkit: annotating a linguistically diverse corpus. In: Proceedings of 25th International Conference on Computational Linguistics (COLING 2014) (2014)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, pp. 55–60. Association for Computational Linguistics (2014)
Google Scholar
Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13, 637–649 (2001)
Article MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Article Google Scholar
Mosteller, F., Wallace, D.L.: Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Am. Stat. Assoc. 58, 275–309 (1963)
MATH Google Scholar
Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary. Linguist. Comput. 22, 251–270 (2007)
Article Google Scholar
Nisioi, S.: Feature analysis for native language identification. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 644–657. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18111-0_49
Google Scholar
Lembersky, G., Ordan, N., Wintner, S.: Adapting translation models to translationese improves SMT. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 255–265. Association for Computational Linguistics (2012)
Google Scholar
Lembersky, G., Ordan, N., Wintner, S.: Improving statistical machine translation by adapting translation models to translationese. Comput. Linguist. 39, 999–1023 (2013)
Article Google Scholar
Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 363–374. Association for Computational Linguistics (2011)
Google Scholar
Twitto-Shmuel, N., Ordan, N., Wintner, S.: Statistical machine translation with automatic identification of translationese. In: Proceedings of WMT-2015 (2015)
Google Scholar

Download references

Acknowledgments

This research was supported by a grant from the Israeli Ministry of Science and Technology. We are grateful to Noam Ordan for much advice and encouragement. We also thank Sergiu Nisioi for helpful suggestions. We are grateful to Philipp Koehn for making the Europarl corpus available; to Cyril Goutte, George Foster and Pierre Isabelle for providing us with an annotated version of the Hansard corpus; to François Yvon and András Farkas (http://farkastranslations.com) for contributing their literary corpora; and to the TED OTP team for sharing TED talks and their translations. We thank also Raphael Salkie for sharing his diverse English-German corpus.

Author information

Authors and Affiliations

Department of Computer Science, University of Haifa, Haifa, Israel
Ella Rabinovich & Shuly Wintner
Department of Computational Linguistics, Universität des Saarlandes, Saarbrücken, Germany
Ofek Luis Lewinsohn

Authors

Ella Rabinovich
View author publications
You can also search for this author in PubMed Google Scholar
Shuly Wintner
View author publications
You can also search for this author in PubMed Google Scholar
Ofek Luis Lewinsohn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ella Rabinovich .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rabinovich, E., Wintner, S., Lewinsohn, O.L. (2018). A Parallel Corpus of Translationese. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_12
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics