Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation

Zuters, Jānis; Strazds, Gus; Immers, Kārlis

doi:10.1007/978-3-319-97571-9_23

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 838))

Included in the following conference series:

International Baltic Conference on Databases and Information Systems

604 Accesses
1 Citations

Abstract

This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Source code available at: https://github.com/zuters/prpe.
2.
http://www.statmt.org/wmt17/translation-task.html.
3.
https://github.com/rsennrich/subword-nmt.
4.
https://github.com/EdinburghNLP/nematus.
5.
https://github.com/facebookresearch/fairseq-py.
6.
http://www.statmt.org/wmt17/results.html.
7.
Statistical significance was estimated via bootstrap resampling using the script analysis/bootstrap-hypothesis-difference-significance.pl from the Moses MT system: https://github.com/moses-smt/mosesdecoder.
8.
http://data.statmt.org/wmt17_systems/training.

References

Pinnis, M., Krišlauks, R., Deksne, D., Miks, T.: Neural machine translation for morphologically rich languages with improved sub-word units and synthetic data. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 237–245. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_27
Chapter Google Scholar
Ruokolainen, T., Kohonen, O., Sirts, K., Grönroos, A., Kurimo, M., Virpioja, S.: A comparative study of minimally supervised morphological segmentation. Comput. Linguist. 42(1), 91–120 (2016)
Article MathSciNet Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL 2002: 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Hajič, J.: Morphological tagging: data vs. dictionaries. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics conference (NAACL 2000), pp. 94–101 (2000)
Google Scholar
Paikens, P., Rituma, L., Pretkalnina, L.: Morphological analysis with limited resources: Latvian example. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA) (2013)
Google Scholar
Pinnis, M., Goba, K.: Maximum entropy model for disambiguation of rich morphological tags. In: Mahlow, C., Piotrowski, M. (eds.) SFCM 2011. CCIS, vol. 100, pp. 14–22. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23138-4_2
Chapter Google Scholar
Virpioja, S., Smit P., Grönroos, S.-A., Kurimo, M.: Morfessor 2.0: Python implementation and extensions for Morfessor baseline. In: Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013, Aalto University (2013)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn, pp. 184–187. Prentice Hall, Englewood Cliffs (2009)
Google Scholar
Clifton, A., Sarkar, A.: Combining morpheme-based machine translation with post-processing morpheme prediction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 32–42 (2011)
Google Scholar
Mermer, C., Akin, S.: Unsupervised search for the optimal segmentation for statistical machine translation. In: Proceedings of the ACL 2010 Student Research Workshop, Uppsala, Sweden, pp. 31–36 (2010)
Google Scholar
Pinnis, M., Krišlauks, R., Miks, T., Deksne, D., Šics, V.: Tilde’s machine translation systems for WMT 2017. In: Proceedings of the Second Conference on Machine Translation (WMT 2017). Shared Task Papers, Copenhagen, Denmark, vol. 2, pp. 374–381. Association for Computational Linguistics (2017). http://www.aclweb.org/anthology/W17-4737
Grönroos, S.-A., Virpioja, S., Smit, P., Kurimo, M.: Morfessor FlatCat: an HMM-based method for unsupervised and semi-supervised learning of morphology. In: Proceedings of the 25th International Conference on Computational Linguistics, Dublin, Ireland, pp. 1177–1185. Association for Computational Linguistics (2014)
Google Scholar
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Barone, A.V.M., Mokry, J., Nadejde, M.: Nematus: a toolkit for neural machine translation. In: Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65–68 (2017)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats D., Dauphin, Y.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 1243–1252 (2017)
Google Scholar
Sennrich, R., Birch, A., Currey, A., Germann, U., Haddow, B., Heafield, K., Barone, A.V.M., Williams P.: The University of Edinburgh’s neural MT systems for WMT17. In: Proceedings of the Second Conference on Machine Translation. Shared Task Papers, vol. 2, Copenhagen, Denmark (2017)
Google Scholar
Barone, A.V.M., Helcl, J., Sennrich, R., Haddow, B., Birch, A.: Deep Architectures for Neural Machine Translation (2017). arXiv Preprints: arXiv:1707.07631 [cs.CL]

Download references

Acknowledgements

The research has been supported by the European Regional Development Fund within the research project “Neural Network Modelling for Inflected Natural Languages” No. 1.1.1.1/16/A/215, and the Faculty of Computing, University of Latvia.

Author information

Authors and Affiliations

University of Latvia, Raina blvd. 19, Riga, 1586, Latvia
Jānis Zuters, Gus Strazds & Kārlis Immers

Authors

Jānis Zuters
View author publications
You can also search for this author in PubMed Google Scholar
Gus Strazds
View author publications
You can also search for this author in PubMed Google Scholar
Kārlis Immers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jānis Zuters .

Editor information

Editors and Affiliations

Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Audrone Lupeikiene
Information Systems Department, Vilnius Gediminas Technical University, Vilnius, Lithuania
Olegas Vasilecas
Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Gintautas Dzemyda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zuters, J., Strazds, G., Immers, K. (2018). Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation. In: Lupeikiene, A., Vasilecas, O., Dzemyda, G. (eds) Databases and Information Systems. DB&IS 2018. Communications in Computer and Information Science, vol 838. Springer, Cham. https://doi.org/10.1007/978-3-319-97571-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-97571-9_23
Published: 15 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97570-2
Online ISBN: 978-3-319-97571-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics