Advertisement

A Hybrid Approach to Statistical Machine Translation Between Standard and Dialectal Varieties

  • Friedrich NeubarthEmail author
  • Barry Haddow
  • Adolfo Hernández Huerta
  • Harald Trost
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9561)

Abstract

Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and character-level models can yield good results even with small training data by exploiting the relative proximity between the two varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German and Viennese dialect. In general, for a phrase-based approach to SMT, complex lexical transformations and syntactic reordering cannot be dealt with satisfyingly. In a situation with sparse resources it becomes merely impossible. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of the resulting system. One such case is the transformation between synthetic imperfect verb forms to perfect tense with finite auxiliary and past participle, which involves detection of clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and discuss the problems that arise using such an approach. Within the developed SMT system, the models trained on preprocessed data unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows that introducing a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system, most probably due to a higher accuracy in the alignment.

Keywords

Statistical machine translation Hybrid approaches to MT Preprocessing in SMT Language varieties Dialects Syntactic parsing 

Notes

Acknowledgements

The work presented in this paper was carried out within the project ‘Machine Learning Techniques for Modeling of Language Varieties’ (MLT4MLV - ICT10-049, 2011–2013) which was funded by the Vienna Science and Technology Fund (WWTF).

References

  1. 1.
    Hildenbrandt, T., Moosmüller, S., Neubarth, F.: Orthographic encoding of the Viennese dialect for machine translation. In: Vetulani, Z., Uszkoreit, H. (eds.) Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference (LTC 2013), 7–9 December 2013, Poznan, Poland, pp. 399–403 (2013)Google Scholar
  2. 2.
    Schikola, H.: Schriftdeutsch und Wienerisch. Österreichischer Bundesverlag für Unterricht, Wissenschaft and Kunst, Wien (1954)Google Scholar
  3. 3.
    Hornung, M.: Wörterbuch der Wiener Mundart. ÖBV - Pädagogischer Verlag, Wien (1998)Google Scholar
  4. 4.
    Collins, M., Koehn, P., Kučerová, I.: Clause restructuring for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, June 2005, pp. 531–540 (2005)Google Scholar
  5. 5.
    Labov, W.: Principles of Linguistic Change (II): Social Factors. Blackwell, Massachusetts (2001)Google Scholar
  6. 6.
    Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Rep. of Korea, 8–14 July 2012, pp. 301–305 (2012)Google Scholar
  7. 7.
    Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the 4th Biennal Workshop on Balto-Slavic Natural Language Processing of the 51th Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, pp. 58–62 (2013)Google Scholar
  8. 8.
    Nakov, P., Ng, H.T.: Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44, 179–222 (2012)zbMATHGoogle Scholar
  9. 9.
    Zbib, R., Maldiochi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O., Callison-Burch, C.: Machine translation of arabic dialects. In: Proceedings of NAACL: HLT 2012, Montreal, Canada, pp. 49–59 (2012)Google Scholar
  10. 10.
    Sawaf, H.: Arabic dialect handling in hybrid machine translation. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado (2010)Google Scholar
  11. 11.
    Haddow, B., Hernández Huerta, A., Neubarth, F., Trost, H.: Corpus development for machine translation between standard and dialectal varieties. In: Proceedings of the Workshop Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants of the 9th International Conference on Recent Advances in Natural Language Processing (RANLP 2013), 13 September 2013, Hissar, Bulgaria, pp. 7–14 (2013)Google Scholar
  12. 12.
    Korpusbasierte Wortgrundformenliste DEREWO, v-ww-bll-320000g-2012-12-31-1.0, mit Benutzerdokumentation, Institut für Deutsche Sprache, Programmbereich Korpuslinguistik, Mannheim, Deutschland (2013)Google Scholar
  13. 13.
    den Besten, H.: On the interaction of root transformations and lexical deletive rules. In: Abraham, W. (ed.) On the Formal Syntax of the Westgermania. Papers from the 3rd Groningen Grammar Talks, pp. 47–131. John Benjamins, Amsterdam (1983)Google Scholar
  14. 14.
    Haider, H.: The case of German. In: Toman, J. (ed.) Studies in German Grammar, pp. 65–101. Foris, Dordrecht (1985)Google Scholar
  15. 15.
    Diedrichsen, E.: Zu einer semantischen Klassifikation der intransitiven haben- und sein- Verben im Deutschen. In: Katz, G., et al. (ed.) Sinn & Bedeutung VI, Proceedings of the 6th Annual Meeting of the Gesellschaft für Semantik, University of Osnabrück (2002)Google Scholar
  16. 16.
    Schmid, H.: Efficient parsing of highly ambiguous context-free grammars with bit vectors. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), vol. 1, Geneva, Switzerland, pp. 162–168 (2004)Google Scholar
  17. 17.
    Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German corpus. J. Lang. Comput. 2004(2), 597–620 (2004)CrossRefGoogle Scholar
  18. 18.
    Björkelund, A., Bohnet, B., Hafdell, L., Nugues, P.: A high-performance syntactic and semantic dependency parser. In: Coling 2010: Demonstration Volume, Beijing, 23–27 August 2010, pp. 33–36 (2010)Google Scholar
  19. 19.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, 2007, pp. 177–180 (2007)Google Scholar
  20. 20.
    Vilar, D., Peter, J.-T., Ney, H.: Can we translate letters? In: Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, ACL, pp. 33–39 (2007)Google Scholar
  21. 21.
    Tiedemann, J.: Character-based PSMT for closely related languages. In: Marqués, L., Somers, H. (eds.) Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT 2009), Barcelona, Spain, pp. 12–19 (2009)Google Scholar
  22. 22.
    Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL00), Hongkong, China, pp. 440–447 (2000)Google Scholar
  23. 23.
    Postel, H.J.: Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. In: IBM Nachrichten, 19, pp. 925–931 (1969)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Friedrich Neubarth
    • 1
    Email author
  • Barry Haddow
    • 2
  • Adolfo Hernández Huerta
    • 3
  • Harald Trost
    • 4
  1. 1.Austrian Research Institute for Artificial Intelligence (OFAI)ViennaAustria
  2. 2.ILCC, School of InformaticsUniversity of EdinburghEdinburghScotland
  3. 3.Nuance Communications AachenAachenGermany
  4. 4.Medical University of ViennaViennaAustria

Personalised recommendations