Morphological and Language-Agnostic Word Segmentation for NMT

  • Dominik Macháček
  • Jonáš Vidra
  • Ondřej BojarEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)


The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we identify a critical difference between BPE and STE and show a simple pre-processing step for BPE that considerably increases translation quality as evaluated by automatic measures.


  1. 1.
    Hajič, J., Hlaváčová, J.: MorfFlex CZ (2013)., LINDAT/CLARIN dig. library, Charles University
  2. 2.
    Huck, M., Riess, S., Fraser, A.: Target-side word segmentation strategies for neural machine translation. In: WMT, pp. 56–67. ACL (2017)Google Scholar
  3. 3.
    Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, pp. 79–86. AAMT, Phuket (2005)Google Scholar
  4. 4.
    Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: ACL Poster and Demonstration Sessions, pp. 177–180 (2007)Google Scholar
  5. 5.
    Pinnis, M., Krišlauks, R., Deksne, D., Miks, T.: Neural machine translation for morphologically rich languages with improved sub-word units and synthetic data. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 237–245. Springer, Cham (2017). Scholar
  6. 6.
    Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: ACL, pp. 1715–1725 (2016)Google Scholar
  7. 7.
    Slavíčková, E.: Retrográdní morfematický slovník češtiny. Academia (1975)Google Scholar
  8. 8.
    Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: RANLP, vol. V, pp. 237–248 (2009)Google Scholar
  9. 9.
    Virpioja, S., Smit, P., Grönroos, S.A., Kurimo, M.: Morfessor 2.0: python implementation and extensions for Morfessor baseline. Technical report (2013). Aalto University publication series SCIENCE + TECHNOLOGY; 25/2013Google Scholar
  10. 10.
    Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016)Google Scholar
  11. 11.
    Žabokrtský, Z., Ševčíková, M., Straka, M., Vidra, J., Limburská, A.: Merging data resources for inflectional and derivational morphology in Czech. In: LREC (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Faculty of Mathematics and Physics, Institute of Formal and Applied LinguisticsCharles UniversityPragueCzech Republic

Personalised recommendations