A Hybrid Approach to Statistical Machine Translation Between Standard and Dialectal Varieties

Neubarth, Friedrich; Haddow, Barry; Huerta, Adolfo Hernández; Trost, Harald

doi:10.1007/978-3-319-43808-5_26

Friedrich Neubarth¹⁶,
Barry Haddow¹⁷,
Adolfo Hernández Huerta¹⁸ &
…
Harald Trost¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9561))

Included in the following conference series:

Language and Technology Conference

682 Accesses

Abstract

Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and character-level models can yield good results even with small training data by exploiting the relative proximity between the two varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German and Viennese dialect. In general, for a phrase-based approach to SMT, complex lexical transformations and syntactic reordering cannot be dealt with satisfyingly. In a situation with sparse resources it becomes merely impossible. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of the resulting system. One such case is the transformation between synthetic imperfect verb forms to perfect tense with finite auxiliary and past participle, which involves detection of clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and discuss the problems that arise using such an approach. Within the developed SMT system, the models trained on preprocessed data unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows that introducing a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system, most probably due to a higher accuracy in the alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See [1] for details on the orthography developed for this project.
2.
There are two exceptions which indeed have imperfect forms: the auxiliary sein ‘to be’ and the two modals sollen ‘ought to’ and wollen ‘want’.
3.
A phenomenon with similar consequences for SMT is the lack of genitive case in VD. It is either replaced by dative, or – in possessive constructions – by a prepositional phrase (s auto fon da schwesda – das Auto von der Schwester ‘the car of the sister’). Alternatively, with animate possessors, there is also a construction not existing in Standard German: the possessor in dative case, and a resumptive possessive pronoun (da schwesda ia auto – \(^{?}\) der Schwester ihr Auto ‘the sister-Dat her car’). These constructions will not be discussed in this paper.

References

Hildenbrandt, T., Moosmüller, S., Neubarth, F.: Orthographic encoding of the Viennese dialect for machine translation. In: Vetulani, Z., Uszkoreit, H. (eds.) Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference (LTC 2013), 7–9 December 2013, Poznan, Poland, pp. 399–403 (2013)
Google Scholar
Schikola, H.: Schriftdeutsch und Wienerisch. Österreichischer Bundesverlag für Unterricht, Wissenschaft and Kunst, Wien (1954)
Google Scholar
Hornung, M.: Wörterbuch der Wiener Mundart. ÖBV - Pädagogischer Verlag, Wien (1998)
Google Scholar
Collins, M., Koehn, P., Kučerová, I.: Clause restructuring for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, June 2005, pp. 531–540 (2005)
Google Scholar
Labov, W.: Principles of Linguistic Change (II): Social Factors. Blackwell, Massachusetts (2001)
Google Scholar
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Rep. of Korea, 8–14 July 2012, pp. 301–305 (2012)
Google Scholar
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the 4th Biennal Workshop on Balto-Slavic Natural Language Processing of the 51th Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, pp. 58–62 (2013)
Google Scholar
Nakov, P., Ng, H.T.: Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44, 179–222 (2012)
MATH Google Scholar
Zbib, R., Maldiochi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O., Callison-Burch, C.: Machine translation of arabic dialects. In: Proceedings of NAACL: HLT 2012, Montreal, Canada, pp. 49–59 (2012)
Google Scholar
Sawaf, H.: Arabic dialect handling in hybrid machine translation. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado (2010)
Google Scholar
Haddow, B., Hernández Huerta, A., Neubarth, F., Trost, H.: Corpus development for machine translation between standard and dialectal varieties. In: Proceedings of the Workshop Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants of the 9th International Conference on Recent Advances in Natural Language Processing (RANLP 2013), 13 September 2013, Hissar, Bulgaria, pp. 7–14 (2013)
Google Scholar
Korpusbasierte Wortgrundformenliste DEREWO, v-ww-bll-320000g-2012-12-31-1.0, mit Benutzerdokumentation, Institut für Deutsche Sprache, Programmbereich Korpuslinguistik, Mannheim, Deutschland (2013)
Google Scholar
den Besten, H.: On the interaction of root transformations and lexical deletive rules. In: Abraham, W. (ed.) On the Formal Syntax of the Westgermania. Papers from the 3rd Groningen Grammar Talks, pp. 47–131. John Benjamins, Amsterdam (1983)
Google Scholar
Haider, H.: The case of German. In: Toman, J. (ed.) Studies in German Grammar, pp. 65–101. Foris, Dordrecht (1985)
Google Scholar
Diedrichsen, E.: Zu einer semantischen Klassifikation der intransitiven haben- und sein- Verben im Deutschen. In: Katz, G., et al. (ed.) Sinn & Bedeutung VI, Proceedings of the 6th Annual Meeting of the Gesellschaft für Semantik, University of Osnabrück (2002)
Google Scholar
Schmid, H.: Efficient parsing of highly ambiguous context-free grammars with bit vectors. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), vol. 1, Geneva, Switzerland, pp. 162–168 (2004)
Google Scholar
Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German corpus. J. Lang. Comput. 2004(2), 597–620 (2004)
Article Google Scholar
Björkelund, A., Bohnet, B., Hafdell, L., Nugues, P.: A high-performance syntactic and semantic dependency parser. In: Coling 2010: Demonstration Volume, Beijing, 23–27 August 2010, pp. 33–36 (2010)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, 2007, pp. 177–180 (2007)
Google Scholar
Vilar, D., Peter, J.-T., Ney, H.: Can we translate letters? In: Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, ACL, pp. 33–39 (2007)
Google Scholar
Tiedemann, J.: Character-based PSMT for closely related languages. In: Marqués, L., Somers, H. (eds.) Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT 2009), Barcelona, Spain, pp. 12–19 (2009)
Google Scholar
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL00), Hongkong, China, pp. 440–447 (2000)
Google Scholar
Postel, H.J.: Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. In: IBM Nachrichten, 19, pp. 925–931 (1969)
Google Scholar

Download references

Acknowledgements

The work presented in this paper was carried out within the project ‘Machine Learning Techniques for Modeling of Language Varieties’ (MLT4MLV - ICT10-049, 2011–2013) which was funded by the Vienna Science and Technology Fund (WWTF).

Author information

Authors and Affiliations

Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria
Friedrich Neubarth
ILCC, School of Informatics, University of Edinburgh, Edinburgh, Scotland
Barry Haddow
Nuance Communications Aachen, Aachen, Germany
Adolfo Hernández Huerta
Medical University of Vienna, Vienna, Austria
Harald Trost

Authors

Friedrich Neubarth
View author publications
You can also search for this author in PubMed Google Scholar
Barry Haddow
View author publications
You can also search for this author in PubMed Google Scholar
Adolfo Hernández Huerta
View author publications
You can also search for this author in PubMed Google Scholar
Harald Trost
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Friedrich Neubarth .

Editor information

Editors and Affiliations

Adam Mickiewicz University , Poznań, Poland
Zygmunt Vetulani
Deutsches Forschungszentrum f. Künstl.Intelligenz (DFKI GmbH), Saarbrücken, Saarland, Germany
Hans Uszkoreit
Adam Mickiewicz University , Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neubarth, F., Haddow, B., Huerta, A.H., Trost, H. (2016). A Hybrid Approach to Statistical Machine Translation Between Standard and Dialectal Varieties. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-43808-5_26
Published: 30 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43807-8
Online ISBN: 978-3-319-43808-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics