Abstract
Scene text detection is important and valuable for text recognition in natural scenes, but it is still a very challenging problem. In this paper, we propose a unified deep neural network for scene text detection, which is composed of a Fully Convolutional Network (FCN) for text saliency map generation and a Bounding box Regression Network (BRN) for text bounding boxes prediction. The FCN is trained with a hybrid loss function based on two types of pixel-wise ground truth masks while the unified neural network is fine-tuned with a multi-task loss function. Additionally, the post-processing procedures including scoring the predicted bounding boxes by the saliency map and eliminating the redundant boxes via the Non-Maximum Suppression (NMS) method are applied to improve the final text detection results. It is demonstrated by the experimental results on ICDAR2013 benchmark that our proposed unified deep neural network can achieve good performance of text detection and process images at 5 fps, being faster than most of the existing text detection methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: recent advances and future trends. Front. Comput. Sci. 10(1), 19–36 (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. PP(99), 640–651 (2014)
Neubeck, A., Gool, L.V.: Efficient non-maximum suppression. In: International Conference on Pattern Recognition, pp. 850–855. DBLP (2006)
Karatzas, D., Shafait, F., Uchida, S., et al.: ICDAR 2013 robust reading competition. In: International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE Computer Society (2013)
Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Computer Vision and Pattern Recognition, pp. 2963–2970. IEEE (2010)
Matas, J., Chum, O., Urban, M., et al.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)
Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19318-7_60
Shi, C., Wang, C., Xiao, B., et al.: Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recogn. Lett. 34(2), 107–116 (2013)
Yin, X.C., Yin, X., Huang, K., et al.: Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2014)
Sun, L., Huo, Q., Jia, W., et al.: A robust approach for text detection from natural scene images. Pattern Recogn. 48(9), 2906–2920 (2015)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014). doi:10.1007/978-3-319-10593-2_34
Zhang, Z., Zhang, C., Shen, W., et al.: Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4159–4167 (2016)
Yao, C., Bai, X., Sang, N., et al.: Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Veit, A., Matera, T., Neumann, L., et al.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Huang, W., Qiao, Yu., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 497–511. Springer, Cham (2014). doi:10.1007/978-3-319-10593-2_33
Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
Yao, C., Bai, X., Liu, W., et al.: Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1083–1090. IEEE (2012)
Vedaldi, A., Lenc, K.: Matconvnet: convolutional neural networks for matlab. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 689–692. ACM (2015)
Acknowledgements
This work was supported by the Natural Science Foundation of China for Grant 61171138.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Li, Y., Ma, J. (2017). A Unified Deep Neural Network for Scene Text Detection. In: Huang, DS., Bevilacqua, V., Premaratne, P., Gupta, P. (eds) Intelligent Computing Theories and Application. ICIC 2017. Lecture Notes in Computer Science(), vol 10361. Springer, Cham. https://doi.org/10.1007/978-3-319-63309-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-63309-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63308-4
Online ISBN: 978-3-319-63309-1
eBook Packages: Computer ScienceComputer Science (R0)