Skip to main content

A Parallel Corpus of Translationese

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

  • 1220 Accesses

Abstract

We describe a set of bilingual English-French and English-German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    All corpora are available at

    http://cl.haifa.ac.il/projects/translationese/index.shtml.

  2. 2.

    We use “EUR”, “HAN”, “LIT”, “TED” and “POL” to denote the five corpora hereafter.

  3. 3.

    The original Europarl is available from http://www.statmt.org/europarl/.

  4. 4.

    http://europa.eu/about-eu/facts-figures/administration/index_en.htm.

  5. 5.

    http://www.theguardian.com/education/datablog/2014/may/21/european-parliament-english-language-official-debates-data.

  6. 6.

    http://www.gutenberg.org.

  7. 7.

    http://farkastranslations.com/.

  8. 8.

    http://en.wikisource.org/.

  9. 9.

    TEDx are TED-like events not restricted to specific language. We could not find sufficient amount of TEDx German talks translated to English.

  10. 10.

    http://developer.ted.com/.

  11. 11.

    http://www.project-syndicate.org/.

  12. 12.

    http://www.diplomatisches-magazin.de/.

  13. 13.

    The list of French and German FW was downloaded from https://code.google.com/archive/p/stop-words/.

  14. 14.

    Feature combinations yield similar, occasionally slightly better, results; we refrain from providing full analysis in this paper.

  15. 15.

    Standard deviation in most experiments was close to 0.

References

  1. Baker, M.: Corpus linguistics and translation studies: implications and applications. In: Baker, M., Francis, G., Tognini-Bonelli, E. (eds.) Text and Technology: in Honour of John Sinclair, pp. 233–252. John Benjamins, Amsterdam (1993)

    Chapter  Google Scholar 

  2. Baker, M.: Corpora in translation studies: an overview and some suggestions for future research. Target 7, 223–243 (1995)

    Article  Google Scholar 

  3. Baker, M.: Corpus-based translation studies: the challenges that lie ahead. In: Mona Baker, G.F., Tognini-Bonelli, E., (eds.) Terminology, LSP and Translation. Studies in Language Engineering in Honour of Juan C. Sager, pp. 175–186. John Benjamins, Amsterdam (1996)

    Google Scholar 

  4. Al-Shabab, O.S.: Interpretation and the language of translation: creativity and conventions in translation. Janus, Edinburgh (1996)

    Google Scholar 

  5. Laviosa, S.: Core patterns of lexical use in a comparable corpus of English lexical prose. Meta 43, 557–570 (1998)

    Article  Google Scholar 

  6. Laviosa, S.: Corpus-Based Translation Studies: Theory, Findings, Applications. Approaches to Translation Studies. Rodopi, Amsterdam (2002)

    Google Scholar 

  7. Olohan, M.: Introducing Corpora in Translation Studies. Routledge, Abingdon (2004)

    Google Scholar 

  8. Becher, V.: When and why do translators add connectives? Target 23, 26–47 (2011)

    Article  Google Scholar 

  9. Zanettin, F.: Corpus methods for descriptive translation studies. Procedia Soc. Behav. Sci. 95, 20–32 (2013). Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC 2013)

    Article  Google Scholar 

  10. Gellerstam, M.: Translationese in Swedish novels translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia, pp. 88–95. CWK Gleerup, Lund (1986)

    Google Scholar 

  11. Toury, G.: Descriptive Translation Studies and Beyond. John Benjamins, Amsterdam/Philadelphia (1995)

    Book  Google Scholar 

  12. Baroni, M., Bernardini, S.: A new approach to the study of translationese: machine-learning the difference between original and translated text. Literary Linguist. Comput. 21, 259–274 (2006)

    Article  Google Scholar 

  13. van Halteren, H.: Source language markers in EUROPARL translations. In: Scott, D., Uszkoreit, H., (eds.) COLING 2008, 22nd International Conference on Computational Linguistics, Proceedings of the Conference, 18–22 August 2008, Manchester, UK, pp. 937–944 (2008)

    Google Scholar 

  14. Kurokawa, D., Goutte, C., Isabelle, P.: Automatic detection of translated text and its impact on machine translation. In: Proceedings of MT-Summit XII, pp. 81–88 (2009)

    Google Scholar 

  15. Koppel, M., Ordan, N.: Translationese and its dialects. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 1318–1326. Association for Computational Linguistics (2011)

    Google Scholar 

  16. Ilisei, I., Inkpen, D., Corpas Pastor, G., Mitkov, R.: Identification of translationese: a machine learning approach. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 503–511. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12116-6_43

    Chapter  Google Scholar 

  17. Ilisei, I., Inkpen, D.: Translationese traits in Romanian newspapers: a machine learning approach. Int. J. Comput. Linguist. Appl. 2, 319–332 (2011)

    Google Scholar 

  18. Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Digit. scholarsh. Humanit. 30, 98–118 (2015)

    Article  Google Scholar 

  19. Rabinovich, E., Wintner, S.: Unsupervised identification of translationese. Trans. Assoc. Comput. Linguist. 3, 419–432 (2015)

    Google Scholar 

  20. Nisioi, S.: Unsupervised classification of translated texts. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds.) NLDB 2015. LNCS, vol. 9103, pp. 323–334. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19581-0_29

    Chapter  Google Scholar 

  21. Pym, A.: On Toury’s laws of how translators translate. In: Pym, A., Shlesinger, M., Simeoni, D., (eds.) Beyond Descriptive Translation Studies: Investigations in Homage to Gideon Toury. Benjamins Translation Library: EST Subseries, pp. 311–328. John Benjamins (2008)

    Google Scholar 

  22. Becher, V.: Abandoning the notion of “translation-inherent" explicitation: against a dogma of translation studies. Across Lang. Cult. 11, 1–28 (2010)

    Article  Google Scholar 

  23. Eetemadi, S., Toutanova, K.: Asymmetric features of human generated translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 159–164. Association for Computational Linguistics (2014)

    Google Scholar 

  24. Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)

    Google Scholar 

  25. Eetemadi, S., Toutanova, K.: Detecting translation direction: a cross-domain study. In: NAACL Student Research Workshop, ACL Association for Computational Linguistics (2015)

    Google Scholar 

  26. House, J.: Beyond intervention: universals in translation? Trans-kom 1, 6–19 (2008)

    Google Scholar 

  27. Laviosa, S.: Universals. In: Baker, M., Saldanha, G. (eds.) Routledge Encyclopedia of Translation Studies, 2nd edn, pp. 288–292. Routledge (Taylor and Francis), New York (2008)

    Google Scholar 

  28. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit (2005)

    Google Scholar 

  29. Cucchi, C.: Dialogic features in EU non-native parliamentary debates. Rev. Air Force Acad. 11, 5–14 (2012)

    Google Scholar 

  30. Koehn, P., Birch, A., Steinberger, R.: 462 machine translation systems for Europe. In: Proceedings of the Twelfth Machine Translation Summit, pp. 65–72 (2009)

    Google Scholar 

  31. Cartoni, B., Zufferey, S., Meyer, T.: Using the Europarl corpus for cross-linguistic research. Belg. J. Linguist. 27, 23–42 (2013)

    Article  Google Scholar 

  32. Islam, Z., Mehler, A.: Customization of the Europarl corpus for translation studies. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), European Language Resources Association (ELRA) (2012)

    Google Scholar 

  33. Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. Comput. Linguist. 38, 799–825 (2012)

    Article  MathSciNet  Google Scholar 

  34. Cartoni, B., Meyer, T.: Extracting directional and comparable corpora from a multilingual corpus for translation studies. In: Proceedings 8th International Conference on Language Resources and Evaluation (LREC), pp. 2132–2137. European Language Resources Association (ELRA) (2012)

    Google Scholar 

  35. Mollin, S.: The Hansard hazard: gauging the accuracy of British parliamentary transcripts. Corpora 2, 187–210 (2007)

    Article  Google Scholar 

  36. Lynch, G., Vogel, C.: Towards the automatic detection of the source language of a literary translation. In: Proceedings of COLING 2012, the 24th International Conference on Computational Linguistics: Posters, pp. 775–784 (2012)

    Google Scholar 

  37. Avner, E.A.: Identifying Hebrew translationese using machine learning techniques. Diplomarbeit, University of Potsdam (2013)

    Google Scholar 

  38. Popescu, M.: Studying translationese at the character level. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., (eds.) Proceedings of RANLP-2011, pp. 634–639 (2011)

    Google Scholar 

  39. Avner, E.A., Ordan, N., Wintner, S.: Identifying translationese at the word and sub-word level. Digital Scholarship in the Humanities (Forthcoming)

    Google Scholar 

  40. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19, 75–102 (1993)

    Google Scholar 

  41. Tan, L., Bond, F.: NTU-MC toolkit: annotating a linguistically diverse corpus. In: Proceedings of 25th International Conference on Computational Linguistics (COLING 2014) (2014)

    Google Scholar 

  42. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland, pp. 55–60. Association for Computational Linguistics (2014)

    Google Scholar 

  43. Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13, 637–649 (2001)

    Article  MATH  Google Scholar 

  44. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)

    Article  Google Scholar 

  45. Mosteller, F., Wallace, D.L.: Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Am. Stat. Assoc. 58, 275–309 (1963)

    MATH  Google Scholar 

  46. Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary. Linguist. Comput. 22, 251–270 (2007)

    Article  Google Scholar 

  47. Nisioi, S.: Feature analysis for native language identification. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 644–657. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18111-0_49

    Google Scholar 

  48. Lembersky, G., Ordan, N., Wintner, S.: Adapting translation models to translationese improves SMT. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 255–265. Association for Computational Linguistics (2012)

    Google Scholar 

  49. Lembersky, G., Ordan, N., Wintner, S.: Improving statistical machine translation by adapting translation models to translationese. Comput. Linguist. 39, 999–1023 (2013)

    Article  Google Scholar 

  50. Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 363–374. Association for Computational Linguistics (2011)

    Google Scholar 

  51. Twitto-Shmuel, N., Ordan, N., Wintner, S.: Statistical machine translation with automatic identification of translationese. In: Proceedings of WMT-2015 (2015)

    Google Scholar 

Download references

Acknowledgments

This research was supported by a grant from the Israeli Ministry of Science and Technology. We are grateful to Noam Ordan for much advice and encouragement. We also thank Sergiu Nisioi for helpful suggestions. We are grateful to Philipp Koehn for making the Europarl corpus available; to Cyril Goutte, George Foster and Pierre Isabelle for providing us with an annotated version of the Hansard corpus; to François Yvon and András Farkas (http://farkastranslations.com) for contributing their literary corpora; and to the TED OTP team for sharing TED talks and their translations. We thank also Raphael Salkie for sharing his diverse English-German corpus.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ella Rabinovich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rabinovich, E., Wintner, S., Lewinsohn, O.L. (2018). A Parallel Corpus of Translationese. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75487-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75486-4

  • Online ISBN: 978-3-319-75487-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics