SAFE: Scale Aware Feature Encoder for Scene Text Recognition

Liu, Wei; Chen, Chaofeng; Wong, Kwan-Yee K.

doi:10.1007/978-3-030-20890-5_13

Wei Liu¹⁸,
Chaofeng Chen¹⁸ &
Kwan-Yee K. Wong¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11362))

Included in the following conference series:

Asian Conference on Computer Vision

2063 Accesses
5 Citations

Abstract

In this paper, we address the problem of having characters with different scales in scene text recognition. We propose a novel scale aware feature encoder (SAFE) that is designed specifically for encoding characters with different scales. SAFE is composed of a multi-scale convolutional encoder and a scale attention network. The multi-scale convolutional encoder targets at extracting character features under multiple scales, and the scale attention network is responsible for selecting features from the most relevant scale(s). SAFE has two main advantages over the traditional single-CNN encoder used in current state-of-the-art text recognizers. First, it explicitly tackles the scale problem by extracting scale-invariant features from the characters. This allows the recognizer to put more effort in handling other challenges in scene text recognition, like those caused by view distortion and poor image quality. Second, it can transfer the learning of feature encoding across different character scales. This is particularly important when the training set has a very unbalanced distribution of character scales, as training with such a dataset will make the encoder biased towards extracting features from the predominant scale. To evaluate the effectiveness of SAFE, we design a simple text recognizer named scale-spatial attention network (S-SAN) that employs SAFE as its feature encoder, and carry out experiments on six public benchmarks. Experimental results demonstrate that S-SAN can achieve state-of-the-art (or, in some cases, extremely competitive) performance without any post-processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Although the receptive field of a CNN is large, its effective region [24] responsible for calculating each feature representation only occupies a small fraction.
2.
We obtain \(\mathbf F _1'\) by actually down-sampling \(\mathbf F _1\).

References

Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)
Article Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint: arXiv:1409.0473
Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S.: Edit probability for scene text recognition (2018). arXiv preprint: arXiv:1805.03384
Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: reading text in uncontrolled conditions. In: IEEE International Conference on Computer Vision (2013)
Google Scholar
Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images (2017). arXiv preprint: arXiv:1709.02054v3
Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., Zhou, S.: Arbitrarily-oriented text recognition (2017). arXiv preprint: arXiv:1711.04226
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning (2006)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
He, P., Huang, W., Qiao, Y., Loy, C.C., Tang, X.: Reading scene text in deep convolutional sequences. In: AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, Advances in Neural Information Processing Systems (2014)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems (2015)
Google Scholar
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_34
Chapter Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: IEEE International Conference on Document Analysis and Recognition (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: IEEE International Conference on Document Analysis and Recognition (2013)
Google Scholar
Lee, C.Y., Bhardwaj, A., Di, W., Jagadeesh, V., Piramuthu, R.: Region-based discriminative feature pooling for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Liu, W., Chen, C., Wong, K.Y.K.: Char-Net: a character-aware neural network for distorted scene text recognition. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Liu, W., Chen, C., Wong, K.K., Su, Z., Han, J.: STAR-Net: a spatial attention residue network for scene text recognition. In: British Machine Vision Conference (2016)
Google Scholar
Lucas, S.M., et al.: ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recognit. 7(2–3), 105–122 (2005)
Article Google Scholar
Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptive field in deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2016)
Google Scholar
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: British Machine Vision Conference (2012)
Google Scholar
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
Phan, T., Shivakumara, P., Tian, S., Tan, C.: Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision (2013)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014, Part I. LNCS, vol. 9003, pp. 35–48. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16865-4_3
Chapter Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: IEEE International Conference on Computer Vision (2011)
Google Scholar
Wang, K., Belongie, S.: Word spotting in the wild. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 591–604. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_43
Chapter Google Scholar
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: IEEE International Conference on Pattern Recognition (2012)
Google Scholar
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. In: Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017 (2017)
Google Scholar
Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
Zeiler, M.D.: ADADELTA: an adaptive learning rate method (2012). arXiv preprint: arXiv:1212.5701

Download references

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Author information

Authors and Affiliations

Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
Wei Liu, Chaofeng Chen & Kwan-Yee K. Wong

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chaofeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kwan-Yee K. Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Liu .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C. V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Chen, C., Wong, KY.K. (2019). SAFE: Scale Aware Feature Encoder for Scene Text Recognition. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11362. Springer, Cham. https://doi.org/10.1007/978-3-030-20890-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-20890-5_13
Published: 02 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20889-9
Online ISBN: 978-3-030-20890-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics