Advertisement

Convolve, Attend and Spell: An Attention-based Sequence-to-Sequence Model for Handwritten Word Recognition

  • Lei KangEmail author
  • J. Ignacio ToledoEmail author
  • Pau RibaEmail author
  • Mauricio VillegasEmail author
  • Alicia FornésEmail author
  • Marçal RusiñolEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11269)

Abstract

This paper proposes Convolve, Attend and Spell, an attention-based sequence-to-sequence model for handwritten word recognition. The proposed architecture has three main parts: an encoder, consisting of a CNN and a bi-directional GRU, an attention mechanism devoted to focus on the pertinent features and a decoder formed by a one-directional GRU, able to spell the corresponding word, character by character. Compared with the recent state-of-the-art, our model achieves competitive results on the IAM dataset without needing any pre-processing step, predefined lexicon nor language model. Code and additional results are available in https://github.com/omni-us/research-seq2seq-HTR.

Notes

Acknowledgements

This work has been partially supported by the European Fund for Regional Development (EFRE), Pro FIT-Project “Vollautomatisierung der Wertschöpfungskette im Digitalisierungsprozess von Archivdaten” with support of IBB/EFRE in 2016/2017, the Spanish research projects TIN2014-52072-P and TIN2015-70924-C2-2-R, the grant FPU15/06264 from the Spanish Ministerio de Educación, Cultura y Deporte, the grant 2016-DI-087 from the Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya, the Ramón y Cajal fellowship RYC-2014-16831, the AGAUR Llavor project 2016LLAV00057, the CERCA Program/Generalitat de Catalunya and RecerCaixa (XARXES, 2016ACUP-00008), a research program from Obra Social “La Caixa” with the collaboration of the ACUP. We gratefully acknowledge the support of the NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Supplementary material

Supplementary material 1 (mp4 920 KB)

480455_1_En_32_MOESM2_ESM.txt (1 kb)
Supplementary material 2 (txt 1 KB)

References

  1. 1.
    Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)CrossRefGoogle Scholar
  2. 2.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  3. 3.
    Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4945–4949 (2016)Google Scholar
  4. 4.
    Bianne-Bernard, A.L., Menasri, F., Mohamad, R.A.H., Mokbel, C., Kermorvant, C., Likforman-Sulem, L.: Dynamic and contextual information in HMM modeling for handwritten word recognition. IEEE Trans. Pattern Anal. Mach. Intell. 33(10), 2066–2080 (2011)CrossRefGoogle Scholar
  5. 5.
    Bluche, T., Louradour, J., Messina, R.: Scan, attend and read: end-to-end handwritten paragraph recognition with MDLSTM attention. In: Proceedings of the IAPR International Conference on Document Analysis and Recognition, pp. 1050–1055 (2017)Google Scholar
  6. 6.
    Bluche, T., Ney, H., Kermorvant, C.: Tandem HMM with convolutional neural network for handwritten word recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2390–2394 (2013)Google Scholar
  7. 7.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  8. 8.
    Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Proceedings of the International Conference on Neural Information Processing Systems, pp. 577–585 (2015)Google Scholar
  9. 9.
    España-Boquera, S., Castro-Bleda, M.J., Gorbe-Moya, J., Zamora-Martinez, F.: Improving offline handwritten text recognition with hybrid HMM/ANN models. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 767–779 (2011)CrossRefGoogle Scholar
  10. 10.
    Frinken, V., Bunke, H.: Continuous handwritten script recognition. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 391–425. Springer, London (2014).  https://doi.org/10.1007/978-0-85729-859-1_12CrossRefGoogle Scholar
  11. 11.
    Giménez, A., Khoury, I., Andrés-Ferrer, J., Juan, A.: Handwriting word recognition using windowed Bernoulli HMMs. Pattern Recogn. Lett. 35, 149–156 (2014)CrossRefGoogle Scholar
  12. 12.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 369–376 (2006)Google Scholar
  13. 13.
    Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)CrossRefGoogle Scholar
  14. 14.
    Krishnan, P., Dutta, K., Jawahar, C.: Deep feature embedding for accurate recognition and retrieval of handwritten text. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, pp. 289–294 (2016)Google Scholar
  15. 15.
    Krishnan, P., Dutta, K., Jawahar, C.: Word spotting and recognition using deep embedding. In: Proceedings of the IAPR International Workshop on Document Analysis (2018)Google Scholar
  16. 16.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  17. 17.
    Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5(1), 39–46 (2002)CrossRefGoogle Scholar
  18. 18.
    Mor, N., Wolf, L.: Confidence prediction for lexicon-free OCR. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 218–225 (2018)Google Scholar
  19. 19.
    Paszke, A., et al.: Automatic differentiation in PyTorch (2017)Google Scholar
  20. 20.
    Pham, V., Bluche, T., Kermorvant, C., Louradour, J.: Dropout improves recurrent neural networks for handwriting recognition. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, pp. 285–290 (2014)Google Scholar
  21. 21.
    Poznanski, A., Wolf, L.: CNN-N-gram for handwriting word recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2305–2314 (2016)Google Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  23. 23.
    Stuner, B., Chatelain, C., Paquet, T.: Handwriting recognition using cohort of LSTM and lexicon verification with extremely large lexicon. CoRR, vol. abs/1612.07528 (2016)Google Scholar
  24. 24.
    Sueiras, J., Ruiz, V., Sanchez, A., Velez, J.F.: Offline continuous handwriting recognition using sequence to sequence neural networks. Neurocomputing 289, 119–128 (2018)CrossRefGoogle Scholar
  25. 25.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the International Conference on Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  26. 26.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  27. 27.
    Wigington, C., Stewart, S., Davis, B., Barrett, B., Price, B., Cohen, S.: Data augmentation for recognition of handwritten words and lines using a CNN-LSTM network. In: Proceedings of the IAPR International Conference on Document Analysis and Recognition, pp. 639–645 (2017)Google Scholar
  28. 28.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Vision CenterUniversitat Autónoma de BarcelonaBarcelonaSpain
  2. 2.omni:usBerlinGermany

Personalised recommendations