Text Genre – An Unexplored Parameter in Statistical Machine Translation

Gavrila, Monica; Vertan, Cristina

doi:10.1007/978-3-319-08958-4_37

Monica Gavrila⁶ &
Cristina Vertan⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Included in the following conference series:

Language and Technology Conference

893 Accesses

Abstract

It is generally accepted that the performance of a statistical machine translation (SMT) system depends significantly on the concordance between the domain of training and test data. During the last years several methods have been proposed in order to deal with out- of-domain words. Less to no attention has been paid however to text genre within the same domain. In this paper we demonstrate that the style of the training corpus may influence the quality of the translation output even when the domain of the training and test data remains al- most unchanged, but the text genre changes. We use as training data the JRC-Acquis and as test data the Europarl corpus. We include also experiments with an out-of-domain test data, as comparison for the variation of performance of the SMT system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://ipsc.jrc.ec.europa.eu/index.php?id=198
2.
http://www.statmt.org/europarl/
3.
http://www.atlasproject.eu
4.
http://www.systranet.com/
5.
http://atlasproject.eu
6.
see http://www.meta-net.eu/whitepapers
7.
www.statmt.org/wmt11/baseline.html
8.
www.statmt.org/moses/
9.
The tag < p > from the initial HTML files.
10.
In the Moses description, all sentences longer than forty tokens are excluded.
11.
Status: February 2011; http://www.statmt.org/europarl/
12.
A one-to-one comparison is not possible, as the training and test data are not the same.
13.
Word-form = Declination form, conjugation form, etc.

References

Calude, A.: Machine translation of various text genres. Presented at 7th Language and Society Conference of the New Zealand Linguistic Society. Hamilton, New Zealand, 12 p., November 2002. (unpublished) (http://www.mt-archive.info/Calude-2003.pdf)
Cristea, D.: Romanian language technology and resources go to Europe. Presentation held at the FP7 Language Technology Informative Days, January, 20–11 (2009)
Google Scholar
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Google Scholar
Gavrila, M.: improving recombination in a linear EBMT system by use of constraints, Ph.D. thesis, University of Hamburg (2012)
Google Scholar
Gavrila, M., Elita, N.: Roger - un corpus paralel aliniat. In: Resurse Lingvistice si Instrumente pentru Prelucrarea Limbii Romane Workshop Proceedings, pp. 63–67, Ed. Univ. Alexandru Ioan Cuza, December 2006. Workshop held in November 2006. ISBN: 978-973-703-208-9
Google Scholar
Ignat, C.: Improving Statistical Alignment and Translation Using Highly Multilin- gual Corpora. Ph.D. thesis, INSA - LGeco- LICIA, Strasbourg, France, 16 June 2009
Google Scholar
Koehn, P., Europarl: A Parallel Corpus for Statistical Machine Translation, MT Summit (2005)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, pp. 177–180, Prague, Czech Republic, June 2007
Google Scholar
Koehn, P., Birch, A., Steinberger, R.: 462 Machine Translation Systems forEurope, MT Summit (2009)
Google Scholar
Niehues, J., Waibel, A.: Domain adaptation in statistical machine translation using factored translation models. In: Proceedings of EAMT, Saint-Raphael (2010)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguistl. 29(1), 19–51 (2003)
Article MATH Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Session: Machine Translation and Evaluation, pp. 311–318. Association for Computational Linguistics Morristown, Philadelphia (2002)
Google Scholar
Rousu, J., SMART Project: Workpackage 3 advanced language models. Report of the EU project: SMART (2008)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul. J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231, August 2006
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), pp. 2142–2147, May, Genoa, Italy (2006)
Google Scholar
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language (2002)
Google Scholar

Download references

Acknowledgments

Part of the work in this paper was part of the EU-Project ATLAS, supported through the ICT-PSP-Programme of the EU-Commission (Topic “Multilingual Web”) and the PhD research conducted by Monica Gavrila at the University of Hamburg (see [4]).

Author information

Authors and Affiliations

University of Hamburg, 30 Vogt-Koelln Str, 22527, Hamburg, Germany
Monica Gavrila & Cristina Vertan

Authors

Monica Gavrila
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Vertan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristina Vertan .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
IMMI-CNRS, Orsay, France
Joseph Mariani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gavrila, M., Vertan, C. (2014). Text Genre – An Unexplored Parameter in Statistical Machine Translation. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-08958-4_37
Published: 26 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics