Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees

  • Weilin Huang
  • Yu Qiao
  • Xiaoou Tang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8692)


Maximally Stable Extremal Regions (MSERs) have achieved great success in scene text detection. However, this low-level pixel operation inherently limits its capability for handling complex text information efficiently (e. g. connections between text or background components), leading to the difficulty in distinguishing texts from background components. In this paper, we propose a novel framework to tackle this problem by leveraging the high capability of convolutional neural network (CNN). In contrast to recent methods using a set of low-level heuristic features, the CNN network is capable of learning high-level features to robustly identify text components from text-like outliers (e.g. bikes, windows, or leaves). Our approach takes advantages of both MSERs and sliding-window based methods. The MSERs operator dramatically reduces the number of windows scanned and enhances detection of the low-quality texts. While the sliding-window with CNN is applied to correctly separate the connections of multiple characters in components. The proposed system achieved strong robustness against a number of extreme text variations and serious real-world problems. It was evaluated on the ICDAR 2011 benchmark dataset, and achieved over 78% in F-measure, which is significantly higher than previous methods.


Maximally Stable Extremal Regions (MSERs) convolutional neural network (CNN) text-like outliers sliding-window 


  1. 1.
    Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: Photoocr: reading text in uncontralled conditions. In: ICCV (2013)Google Scholar
  2. 2.
    Chen, H., Tsai, S., Schronth, G., Chen, D., Grzeszczuk, R., Girod, B.: Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: ICIP (2012)Google Scholar
  3. 3.
    Chen, X., Yuille, A.: Detecting and reading text in natural scenes. In: CVPR (2004)Google Scholar
  4. 4.
    Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., Wu, D.J., Ng, A.Y.: Text detection and character recognition in scene images with unsupervised feature learning. In: ICDAR (2011)Google Scholar
  5. 5.
    Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: AISTATS (2011)Google Scholar
  6. 6.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  7. 7.
    Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: CVPR (2010)Google Scholar
  8. 8.
    González, A., Bergasa, L., Yebes, J., Bronte, S.: Text location in complex images. In: ICPR (2012)Google Scholar
  9. 9.
    Hanif, S., Prevost, L.: Text detection and localization in complex scene images using constrained adaboost algorithm. In: ICDAR (2009)Google Scholar
  10. 10.
    Huang, W., Lin, Z., Yang, J., Wang, J.: Text localization in natural images using stroke feature transform and text covariance descriptors. In: ICCV (2013)Google Scholar
  11. 11.
    Kim, K., Jung, K., Kim, J.: Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 25, 1631–1639 (2003)CrossRefGoogle Scholar
  12. 12.
    LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W.: Handwritten digit recognition with a back-propagation network. In: NIPS (1989)Google Scholar
  13. 13.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  14. 14.
    Lucas, S.: Icdar 2005 text locating competition results. In: ICDAR (2005)Google Scholar
  15. 15.
    Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: Icdar 2003 robust reading competitions. In: ICDAR (2003)Google Scholar
  16. 16.
    Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal region. In: BMVC (2002)Google Scholar
  17. 17.
    Minetto, R., Thome, N., Cord, M., Fabrizio, J., Marcotegui, B.: Snoopertext: A multiresolution system for text detection in complex visual scenes. In: ICIP (2010)Google Scholar
  18. 18.
    Mishra, A., Alahari, K., Jawahar, C.V.: Top-down and bottom-up cues for scene text recognition. In: CVPR (2012)Google Scholar
  19. 19.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learningGoogle Scholar
  20. 20.
    Neubeck, A., Gool, L.: Efficient non-maximum suppression. In: ICPR (2006)Google Scholar
  21. 21.
    Neumann, L., Matas, J.: On combining multiple segmentations in scene text recognition. In: ICDAR (2013)Google Scholar
  22. 22.
    Neumann, L., Matas, J.: Scene text localization and recognition with oriented stroke detection. In: ICCV (2013)Google Scholar
  23. 23.
    Neumann, L., Matas, K.: Text localization in real-world images using eficiently pruned exhaustive search. In: ICDAR (2011)Google Scholar
  24. 24.
    Neumann, L., Matas, K.: Real-time scene text localization and recognition. In: CVPR (2012)Google Scholar
  25. 25.
    Nistér, D., Stewénius, H.: Linear time maximally stable extremal regions. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 183–196. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  26. 26.
    Shahab, A., Shafait, F., Dengel, A.: Icdar 2011 robust reading competition challenge 2: Reading text in scene images. In: ICDAR (2011)Google Scholar
  27. 27.
    Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S.: Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recognition 34, 107–116 (2013)CrossRefGoogle Scholar
  28. 28.
    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV (2011)Google Scholar
  29. 29.
    Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural network. In: ICPR (2012)Google Scholar
  30. 30.
    Wolf, C., Jolion, J.-M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal on Document Analysis and Recognition 8, 280–296 (2006)CrossRefGoogle Scholar
  31. 31.
    Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: CVPR (2012)Google Scholar
  32. 32.
    Yi, C., Tian, Y.: Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans. Image Processing 20, 2594–2605 (2011)CrossRefMathSciNetGoogle Scholar
  33. 33.
    Yi, C., Tian, Y.: Text extraction from scene images by character appearance and structure modeling. Computer Vision and Image Understanding 117, 182–194 (2013)Google Scholar
  34. 34.
    Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene images. IEEE Trans. Pattern Analysis and Machine Intelligence (to appear)Google Scholar
  35. 35.
    Zhang, J., Kasturi, R.: Character energy and link energybased text extraction in scene images. In: ACCV (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Weilin Huang
    • 1
    • 2
  • Yu Qiao
    • 1
  • Xiaoou Tang
    • 2
    • 1
  1. 1.Shenzhen Key Lab of Comp. Vis and Pat. Rec.Shenzhen Institutes of Advanced Technology, Chinese Academy of SciencesChina
  2. 2.Department of Information EngineeringThe Chinese University of Hong KongChina

Personalised recommendations