Attention Recurrent Neural Networks for Image-Based Sequence Text Recognition

  • Guoqiang ZhongEmail author
  • Guohua Yue
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12046)


Image-based sequence text recognition is an important research direction in the field of computer vision. In this paper, we propose a new model called Attention Recurrent Neural Networks (ARNNs) for the image-based sequence text recognition. ARNNs embed the attention mechanism seamlessly into the recurrent neural networks (RNNs) through an attention gate. The attention gate generates a gating signal that is end-to-end trainable, which empowers the ARNNs to adaptively focus on the important information. The proposed attention gate can be applied to any recurrent networks, e.g., standard RNN, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). Experimental results on several benchmark datasets demonstrate that ARNNs consistently improves previous approaches on the image-based sequence text recognition tasks.


Attention gate Recurrent Neural Networks (RNNs) Image-based sequence text recognition 


  1. 1.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)Google Scholar
  2. 2.
    Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  3. 3.
    Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: reading text in uncontrolled conditions. In: ICCV, pp. 785–792 (2013)Google Scholar
  4. 4.
    Britz, D., Goldie, A., Luong, M., Le, Q.V.: Massive exploration of neural machine translation architectures. CoRR abs/1703.03906 (2017)Google Scholar
  5. 5.
    Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5086–5094 (2017)Google Scholar
  6. 6.
    Chevalier, G.: LARNN: linear attention recurrent neural network. CoRR abs/1808.05578 (2018)Google Scholar
  7. 7.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734 (2014)Google Scholar
  8. 8.
    Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014)Google Scholar
  9. 9.
    Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4476–4484 (2017)Google Scholar
  10. 10.
    Gao, Y., Chen, Y., Wang, J., Lu, H.: Reading scene text with attention convolutional sequence modeling. CoRR abs/1709.04303 (2017)Google Scholar
  11. 11.
    Ghosh, S.K., Valveny, E., Bagdanov, A.D.: Visual attention models for scene text recognition. In: ICDAR, pp. 943–948 (2017)Google Scholar
  12. 12.
    Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.D.: Multi-digit number recognition from street view imagery using deep convolutional neural networks. In: ICLR (2014)Google Scholar
  13. 13.
    Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML, pp. 369–376 (2006)Google Scholar
  14. 14.
    Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: ICASSP, pp. 6645–6649 (2013)Google Scholar
  15. 15.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  16. 16.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014)Google Scholar
  17. 17.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. IJCV 116(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS, pp. 2017–2025 (2015)Google Scholar
  19. 19.
    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)Google Scholar
  20. 20.
    Lee, C., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: CVPR, pp. 2231–2239 (2016)Google Scholar
  21. 21.
    Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (indrnn): building a longer and deeper RNN. In: CVPR, pp. 5457–5466 (2018)Google Scholar
  22. 22.
    Liu, G., Guo, J.: Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 325–338 (2019)CrossRefGoogle Scholar
  23. 23.
    Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. CoRR abs/1508.04025 (2015)Google Scholar
  24. 24.
    Marti, U.V., Bunke, H.: The IAM-database: an english sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)CrossRefGoogle Scholar
  25. 25.
    Mishra, A., Alahari, K., V. Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC, September 2012Google Scholar
  26. 26.
    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014)Google Scholar
  27. 27.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS (2011)Google Scholar
  28. 28.
    Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: CVPR, pp. 3538–3545 (2012)Google Scholar
  29. 29.
    Palangi, H., et al.: Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. CoRR abs/1502.06922 (2015)Google Scholar
  30. 30.
    Santoro, A., et al.: Relational recurrent neural networks. In: NIPS, pp. 7310–7321 (2018)Google Scholar
  31. 31.
    Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: ACL, pp. 1577–1586 (2015)Google Scholar
  32. 32.
    Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. TPAMI 41(9), 2035–2048 (2018)CrossRefGoogle Scholar
  33. 33.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39(11), 2298–2304 (2017)CrossRefGoogle Scholar
  34. 34.
    Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)Google Scholar
  35. 35.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI, pp. 4263–4270 (2017)Google Scholar
  36. 36.
    Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: ACCV, pp. 35–48 (2014)CrossRefGoogle Scholar
  37. 37.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  38. 38.
    Tian, Y., Hu, W., Jiang, H., Wu, J.: Densely connected attentional pyramid residual network for human pose estimation. Neurocomputing 347, 13–23 (2019)CrossRefGoogle Scholar
  39. 39.
    Tran, K.M., Bisazza, A., Monz, C.: Recurrent memory networks for language modeling. In: NAACL HLT, pp. 321–331 (2016)Google Scholar
  40. 40.
    Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 6000–6010 (2017)Google Scholar
  41. 41.
    Wang, K., Babenko, B., Belongie, S.J.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)Google Scholar
  42. 42.
    Wang, K., Belongie, S.J.: Word spotting in the wild. In: ECCV, pp. 591–604 (2010)CrossRefGoogle Scholar
  43. 43.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  44. 44.
    Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: CVPR, pp. 4042–4049 (2014)Google Scholar
  45. 45.
    Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., Zheng, N.: Adding attentiveness to the neurons in recurrent neural networks. In: ECCV, pp. 136–152 (2018)Google Scholar
  46. 46.
    Zimmermann, M., Bunke, H.: Automatic segmentation of the IAM off-line database for handwritten English text. In: Object Recognition Supported by User Interaction for Service Robots, vol. 4, pp. 35–39, August 2002Google Scholar
  47. 47.
    Zintgraf, L.M., Cohen, T.S., Adel, T., Welling, M.: Visualizing deep neural network decisions: Prediction difference analysis. CoRR abs/1702.04595 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyOcean University of ChinaQingdaoChina

Personalised recommendations