Advertisement

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

  • Pengyuan Lyu
  • Minghui Liao
  • Cong Yao
  • Wenhao Wu
  • Xiang BaiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)

Abstract

Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.

Keywords

Scene text spotting Neural network Arbitrary shapes 

Notes

Acknowledgements

This work was supported by National Key R&D Program of China No. 2018YFB1 004600, NSFC 61733007, and NSFC 61573160, to Dr. Xiang Bai by the National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team.

Supplementary material

474202_1_En_5_MOESM1_ESM.pdf (129 kb)
Supplementary material 1 (pdf 128 KB)

References

  1. 1.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceeding of ICML, pp. 41–48 (2009)Google Scholar
  2. 2.
    Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: reading text in uncontrolled conditions. In: Proceedings of ICCV, pp. 785–792 (2013)Google Scholar
  3. 3.
    Busta, M., Neumann, L., Matas, J.: Deep TextSpotter: an end-to-end trainable scene text localization and recognition framework. In: Proceedings of ICCV, pp. 2223–2231 (2017)Google Scholar
  4. 4.
    Chng, C.K., Chan, C.S.: Total-Text: a comprehensive dataset for scene text detection and recognition. In: Proceedings of ICDAR, pp. 935–942 (2017)Google Scholar
  5. 5.
    Dai, J., He, K., Li, Y., Ren, S., Sun, J.: Instance-sensitive fully convolutional networks. In: Proceedings of ECCV, pp. 534–549 (2016)CrossRefGoogle Scholar
  6. 6.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Proceedings of NIPS, pp. 379–387 (2016)Google Scholar
  7. 7.
    Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Proceedings of CVPR, pp. 2963–2970 (2010)Google Scholar
  8. 8.
    Girshick, R.B.: Fast R-CNN. In: Proceedings of ICCV, pp. 1440–1448 (2015)Google Scholar
  9. 9.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of CVPR, pp. 580–587 (2014)Google Scholar
  10. 10.
    Gómez, L., Karatzas, D.: TextProposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recognit. 70, 60–74 (2017)CrossRefGoogle Scholar
  11. 11.
    Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML, pp. 369–376 (2006)Google Scholar
  12. 12.
    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of CVPR, pp. 2315–2324 (2016)Google Scholar
  13. 13.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: Proceedings of ICCV, pp. 2980–2988 (2017)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016)Google Scholar
  15. 15.
    He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: Proceedings of ICCV, pp. 3066–3074 (2017)Google Scholar
  16. 16.
    He, W., Zhang, X., Yin, F., Liu, C.: Deep direct regression for multi-oriented scene text detection. In: Proceedings ICCV, pp. 745–753 (2017)Google Scholar
  17. 17.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  18. 18.
    Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: WordSup: exploiting word annotations for character based text detection. In: Proceedings of ICCV, pp. 4950–4959 (2017)Google Scholar
  19. 19.
    Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Proceedings of ECCV, pp. 497–511 (2014)Google Scholar
  20. 20.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014)Google Scholar
  21. 21.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10593-2_34CrossRefGoogle Scholar
  23. 23.
    Kang, L., Li, Y., Doermann, D.S.: Orientation robust text line detection in natural images. In: Proceedings of CVPR, pp. 4034–4041 (2014)Google Scholar
  24. 24.
    Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: Proceedings of ICDAR, pp. 1156–1160 (2015)Google Scholar
  25. 25.
    Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: Proceedings of ICDAR, pp. 1484–1493 (2013)Google Scholar
  26. 26.
    Lee, C., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of CVPR, pp. 2231–2239 (2016)Google Scholar
  27. 27.
    Li, H., Wang, P., Shen, C.: Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of ICCV, pp. 5248–5256 (2017)Google Scholar
  28. 28.
    Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: Proceedings of CVPR, pp. 4438–4446 (2017)Google Scholar
  29. 29.
    Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27(8), 3676–3690 (2018)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of AAAI, pp. 4161–4167 (2017)Google Scholar
  31. 31.
    Liao, M., Zhu, Z., Shi, B., Xia, G.s., Bai, X.: Rotation-sensitive regression for oriented scene text detection. In: Proceedings of CVPR, pp. 5909–5918 (2018)Google Scholar
  32. 32.
    Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: Proceedings of CVPR, pp. 936–944 (2017)Google Scholar
  33. 33.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  34. 34.
    Liu, Y., Jin, L.: Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of CVPR, pp. 3454–3461 (2017)Google Scholar
  35. 35.
    Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of CVPR, pp. 7553–7563 (2018)Google Scholar
  36. 36.
    Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Proceedings of ACCV, pp. 770–783 (2010)CrossRefGoogle Scholar
  37. 37.
    Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of CVPR, pp. 3538–3545 (2012)Google Scholar
  38. 38.
    Neumann, L., Matas, J.: Real-time lexicon-free scene text localization and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1872–1885 (2016)CrossRefGoogle Scholar
  39. 39.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, pp. 779–788 (2016)Google Scholar
  40. 40.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  41. 41.
    Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)CrossRefGoogle Scholar
  42. 42.
    Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)CrossRefGoogle Scholar
  43. 43.
    Shi, B., Bai, X., Belongie, S.J.: Detecting oriented text in natural images by linking segments. In: Proceedings of CVPR, pp. 3482–3490 (2017)Google Scholar
  44. 44.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)CrossRefGoogle Scholar
  45. 45.
    Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of CVPR, pp. 4168–4176 (2016)Google Scholar
  46. 46.
    Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. (2018)Google Scholar
  47. 47.
    Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: a unified text detection system in natural scene images. In: Proceedings of ICCV, pp. 4651–4659 (2015)Google Scholar
  48. 48.
    Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_4CrossRefGoogle Scholar
  49. 49.
    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV, pp. 1457–1464 (2011)Google Scholar
  50. 50.
    Yao, C., Bai, X., Liu, Wenyu and, M.Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1083–1090. IEEE (2012)Google Scholar
  51. 51.
    Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. IEEE Trans. Image Process. 23(11), 4737–4749 (2014)MathSciNetCrossRefGoogle Scholar
  52. 52.
    Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z.: Scene text detection via holistic, multi-channel prediction. CoRR abs/1606.09002 (2016)Google Scholar
  53. 53.
    Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4049 (2014)Google Scholar
  54. 54.
    Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in natural scenes. In: Proceedings of CVPR, pp. 2558–2567 (2015)Google Scholar
  55. 55.
    Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: Proceeding of CVPR, pp. 4159–4167 (2016)Google Scholar
  56. 56.
    Zhong, Z., Jin, L., Zhang, S., Feng, Z.: DeepText: a unified framework for text proposal generation and text detection in natural images. CoRR abs/1605.07314 (2016)Google Scholar
  57. 57.
    Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: an efficient and accurate scene text detector. In: Proceedings of CVPR, pp. 2642–2651 (2017)Google Scholar
  58. 58.
    Zhu, Y., Liao, M., Yang, M., Liu, W.: Cascaded segmentation-detection networks for text-based traffic sign detection. IEEE Trans. Intell. Transport. Syst. 19(1), 209–219 (2018)CrossRefGoogle Scholar
  59. 59.
    Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: recent advances and future trends. Front. Comput. Sci. 10(1), 19–36 (2016)CrossRefGoogle Scholar
  60. 60.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_26CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Huazhong University of Science and TechnologyWuhanChina
  2. 2.Megvii (Face++) Technology Inc.BeijingChina

Personalised recommendations