Abstract
It is generally accepted that the performance of a statistical machine translation (SMT) system depends significantly on the concordance between the domain of training and test data. During the last years several methods have been proposed in order to deal with out- of-domain words. Less to no attention has been paid however to text genre within the same domain. In this paper we demonstrate that the style of the training corpus may influence the quality of the translation output even when the domain of the training and test data remains al- most unchanged, but the text genre changes. We use as training data the JRC-Acquis and as test data the Europarl corpus. We include also experiments with an out-of-domain test data, as comparison for the variation of performance of the SMT system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
The tag < p > from the initial HTML files.
- 10.
In the Moses description, all sentences longer than forty tokens are excluded.
- 11.
Status: February 2011; http://www.statmt.org/europarl/
- 12.
A one-to-one comparison is not possible, as the training and test data are not the same.
- 13.
Word-form = Declination form, conjugation form, etc.
References
Calude, A.: Machine translation of various text genres. Presented at 7th Language and Society Conference of the New Zealand Linguistic Society. Hamilton, New Zealand, 12 p., November 2002. (unpublished) (http://www.mt-archive.info/Calude-2003.pdf)
Cristea, D.: Romanian language technology and resources go to Europe. Presentation held at the FP7 Language Technology Informative Days, January, 20–11 (2009)
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Gavrila, M.: improving recombination in a linear EBMT system by use of constraints, Ph.D. thesis, University of Hamburg (2012)
Gavrila, M., Elita, N.: Roger - un corpus paralel aliniat. In: Resurse Lingvistice si Instrumente pentru Prelucrarea Limbii Romane Workshop Proceedings, pp. 63–67, Ed. Univ. Alexandru Ioan Cuza, December 2006. Workshop held in November 2006. ISBN: 978-973-703-208-9
Ignat, C.: Improving Statistical Alignment and Translation Using Highly Multilin- gual Corpora. Ph.D. thesis, INSA - LGeco- LICIA, Strasbourg, France, 16 June 2009
Koehn, P., Europarl: A Parallel Corpus for Statistical Machine Translation, MT Summit (2005)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, pp. 177–180, Prague, Czech Republic, June 2007
Koehn, P., Birch, A., Steinberger, R.: 462 Machine Translation Systems forEurope, MT Summit (2009)
Niehues, J., Waibel, A.: Domain adaptation in statistical machine translation using factored translation models. In: Proceedings of EAMT, Saint-Raphael (2010)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguistl. 29(1), 19–51 (2003)
Papineni, K., Roukos, S., Ward, T., Zhu, W-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Session: Machine Translation and Evaluation, pp. 311–318. Association for Computational Linguistics Morristown, Philadelphia (2002)
Rousu, J., SMART Project: Workpackage 3 advanced language models. Report of the EU project: SMART (2008)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul. J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231, August 2006
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), pp. 2142–2147, May, Genoa, Italy (2006)
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language (2002)
Acknowledgments
Part of the work in this paper was part of the EU-Project ATLAS, supported through the ICT-PSP-Programme of the EU-Commission (Topic “Multilingual Web”) and the PhD research conducted by Monica Gavrila at the University of Hamburg (see [4]).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gavrila, M., Vertan, C. (2014). Text Genre – An Unexplored Parameter in Statistical Machine Translation. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-08958-4_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)