Abstract
The nature of human language and the lack of a spelling convention make historical documents hard to handle for natural language processing. Spelling normalization tackles this problem by adapting their spelling to modern standards in order to get an orthography consistency. In this work, we compare several character-based machine translation approaches, and propose a method to profit from modern documents to enrich neural machine translation models. We tested our proposal with four different data sets, and observed that the enriched models successfully improved the normalization quality of the neural models. Statistical models, however, yielded a better result.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2015). arXiv:1409.0473
Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)
Bollmann, M.: Normalization of historical texts with neural network models. Ph.D. thesis, Sprachwissenschaftliches Institut, Ruhr-Universität (2018)
Bollmann, M., Søgaard, A.: Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In: Proceedings of the International Conference on the Computational Linguistics, pp. 131–139 (2016)
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Chatterjee, R., Farajian, M.A., Negri, M., Turchi, M., Srivastava, A., Pal, S.: Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 ape shared task. In: Proceedings of the Second Conference on Machine Translation, pp. 630–638 (2017)
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1693–1703 (2016)
Costa-Jussà, M.R., Aldón, D., Fonollosa, J.A.: Chinese-Spanish neural machine translation enhanced with character and word bitmap fonts. Mach. Transl. 31, 35–47 (2017)
Costa-Jussà, M.R., Fonollosa, J.A.: Character-based neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 357–361 (2016)
Domingo, M., Casacuberta, F.: Spelling normalization of historical documents by using a machine translation approach. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 129–137 (2018)
Jehle, F.: Works of Miguel de Cervantes in Old- and Modern-Spelling. Indiana University Purdue University Fort Wayne (2001)
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of the Association for Computational Linguistics Software Engineering, Testing, and Quality Assurance Workshop, pp. 49–57 (2008)
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning (2017). arXiv:1705.03122
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., Mäkelä, E.: Normalizing early English letters to present-day English spelling. In: Proceedings of the Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 87–96 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of the Association for Computational Linguistics: System Demonstration, pp. 67–72 (2017)
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 177–180 (2007)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 48–54 (2003)
Korchagina, N.: Normalizing medieval German texts: from rules to deep learning. In: Proceedings of the Nordic Conference on Computational Linguistics Workshop on Processing Historical Language, pp. 12–17 (2017)
Laing, M.: The linguistic analysis of medieval vernacular texts: Two projects at Edinburgh’. In: Rissanen, M., Kytd, M., Wright, S. (eds.) Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, vol. 25427, pp. 121–141. St Catharine’s College Cambridge (1993)
Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. arXiv preprint arXiv:1511.04586 (2015)
Lison, P., Tiedemann, J.: OpenSubtitles 2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the International Conference on Language Resources Association (2016)
Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T.: Dataset of normalised Slovene text KonvNormSl 1.0. Slovenian language resource repository CLARIN.SI (2016). http://hdl.handle.net/11356/1068
Ljubešic, N., Zupan, K., Fišer, D., Erjavec, T.: Normalising slovene data: historical texts vs. user-generated content. In: Proceedings of the Conference on Natural Language Processing, pp. 146–155 (2016)
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 301–305 (2012)
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 160–167 (2003)
Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 295–302 (2002)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Poncelas, A., Shterionov, D., Way, A., Maillette de Buy Wenniger, G., Passban, P.: Investigation backtranslation in neural machine translation. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 249–258 (2018)
Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics, pp. 70–79 (2013)
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation, pp. 186–191 (2018)
Riezler, S., Maxwell, J.T.: On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 57–64 (2005)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 58–62 (2013)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas, pp. 223–231 (2006)
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 257–286 (2002)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Proc. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Tang, G., Cap, F., Pettersson, E., Nivre, J.: An evaluation of neural machine translation models on historical spelling normalization. In: Proceedings of the International Conference on Computational Linguistics, pp. 1320–1331 (2018)
Tiedemann, J.: Character-based PSMT for closely related languages. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 12–19 (2009)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv:1609.08144
Zens, R., Och, F.J., Ney, H.: Phrase-based statistical machine translation. In: Jarke, M., Lakemeyer, G., Koehler, J. (eds.) KI 2002. LNCS (LNAI), vol. 2479, pp. 18–32. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45751-8_2
Acknowledgments
The research leading to these results has received funding from the European Union through Programa Operativo del Fondo Europeo de Desarrollo Regional (FEDER) from Comunitat Valencia (2014–2020) under project Sistemas de frabricación inteligentes para la indústria 4.0 (grant agreement IDIFEDER/2018/025); and from Ministerio de Economía y Competitividad (MINECO) under project MISMIS-FAKEnHATE (grant agreement PGC2018-096212-B-C31). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for part of this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Domingo, M., Casacuberta, F. (2019). Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds) New Trends in Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11808. Springer, Cham. https://doi.org/10.1007/978-3-030-30754-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-30754-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30753-0
Online ISBN: 978-3-030-30754-7
eBook Packages: Computer ScienceComputer Science (R0)