Deep Features for Text Spotting

  • Max Jaderberg
  • Andrea Vedaldi
  • Andrew Zisserman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8692)


The goal of this work is text spotting in natural images. This is divided into two sequential tasks: detecting words regions in the image, and recognizing the words within these regions. We make the following contributions: first, we develop a Convolutional Neural Network (CNN) classifier that can be used for both tasks. The CNN has a novel architecture that enables efficient feature sharing (by using a number of layers in common) for text detection, character case-sensitive and insensitive classification, and bigram classification. It exceeds the state-of-the-art performance for all of these. Second, we make a number of technical changes over the traditional CNN architectures, including no downsampling for a per-pixel sliding window, and multi-mode learning with a mixture of linear models (maxout). Third, we have a method of automated data mining of Flickr, that generates word and character level annotations. Finally, these components are used together to form an end-to-end, state-of-the-art text spotting system. We evaluate the text-spotting system on two standard benchmarks, the ICDAR Robust Reading data set and the Street View Text data set, and demonstrate improvements over the state-of-the-art on multiple measures.


Word Recognition Convolutional Neural Network Text Line Text Detection Street View 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
    Alsharif, O., Pineau, J.: End-to-End Text Recognition with Hybrid HMM Maxout Models. In: ICLR (2014)Google Scholar
  7. 7.
    Anthimopoulos, M., Gatos, B., Pratikakis, I.: Detection of artificial and scene text in images and video frames. Pattern Analysis and Applications, 1–16 (2011)Google Scholar
  8. 8.
    Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: Reading text in uncontrolled conditions. In: ICCV (2013)Google Scholar
  9. 9.
    Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: Proc. ICCV, vol. 2, pp. 105–112 (2001)Google Scholar
  10. 10.
    de Campos, T., Babu, B.R., Varma, M.: Character recognition in natural images, pp. 591–604 (2009)Google Scholar
  11. 11.
    Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., Girod, B.: Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: Proc. International Conference on Image Processing (ICIP), pp. 2609–2612 (2011)Google Scholar
  12. 12.
    Chen, X., Yuille, A.L.: Detecting and reading text in natural scenes. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 2, p. II–366. IEEE (2004)Google Scholar
  13. 13.
    Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., Wu, D.J., Ng, A.Y.: Text detection and character recognition in scene images with unsupervised feature learning. In: 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 440–445. IEEE (2011)Google Scholar
  14. 14.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)Google Scholar
  15. 15.
    Dutta, S., Sankaran, N., Sankar, K., Jawahar, C.: Robust recognition of degraded documents using character n-grams. In: International Workshop on Document Analysis Systems (DAS), pp. 130–134. IEEE (2012)Google Scholar
  16. 16.
    Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Proc. CVPR, pp. 2963–2970. IEEE (2010)Google Scholar
  17. 17.
    Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Tech. rep. University of Montreal (2009)Google Scholar
  18. 18.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61(1) (2005)Google Scholar
  19. 19.
    Goel, V., Mishra, A., Alahari, K., Jawahar, C.: Whole is greater than sum of parts: Recognizing scene text words. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 398–402. IEEE (2013)Google Scholar
  20. 20.
    Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.: Multi-digit number recognition from street view imagery using deep convolutional neural networks. In: ICLR (2014)Google Scholar
  21. 21.
    Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)Google Scholar
  22. 22.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)Google Scholar
  23. 23.
    Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014)Google Scholar
  24. 24.
    Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P., et al.: Icdar 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1484–1493. IEEE (2013)Google Scholar
  25. 25.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, vol. 1, p. 4 (2012)Google Scholar
  26. 26.
    Lalonde, M., Gagnon, L.: Key-text spotting in documentary videos using adaboost. In: Electronic Imaging 2006, p. 60641N. International Society for Optics and Photonics (2006)Google Scholar
  27. 27.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  28. 28.
    Lucas, S.M.: Icdar 2005 text locating competition results. In: Proceedings of the Eighth International Conference on Document Analysis and Recognition 2005, pp. 80–84. IEEE (2005)Google Scholar
  29. 29.
    Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. BMVC, pp. 384–393 (2002)Google Scholar
  30. 30.
    Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. CoRR abs/1312.5851 (2013)Google Scholar
  31. 31.
    Mishra, A., Alahari, K., Jawahar, C., et al.: Scene text recognition using higher order language priors. In: 23rd British Machine Vision Conference on BMVC 2012 (2012)Google Scholar
  32. 32.
    Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  33. 33.
    Neumann, L., Matas, J.: Text localization in real-world images using efficiently pruned exhaustive search. In: Proc. ICDAR, pp. 687–691. IEEE (2011)Google Scholar
  34. 34.
    Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proc. CVPR, vol. 3, pp. 1187–1190. IEEE (2012)Google Scholar
  35. 35.
    Neumann, L., Matas, J.: Scene text localization and recognition with oriented stroke detection. In: 2013 IEEE International Conference on Computer Vision (ICCV 2013), pp. 97–104. IEEE, California (2013)CrossRefGoogle Scholar
  36. 36.
    Novikova, T., Barinova, O., Kohli, P., Lempitsky, V.: Large-lexicon attribute-consistent text recognition in natural images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 752–765. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  37. 37.
    Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1), 62–66 (1979)CrossRefMathSciNetGoogle Scholar
  38. 38.
    Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code. In: Proc. CVPR (2007)Google Scholar
  39. 39.
    Posner, I., Corke, P., Newman, P.: Using text-spotting to query the world. In: Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, IROS (2010)Google Scholar
  40. 40.
    Quack, T.: Large scale mining and retrieval of visual data in a multimodal context. Ph.D. thesis, ETH Zurich (2009)Google Scholar
  41. 41.
    Rath, T., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2-4), 139–152 (2007)CrossRefGoogle Scholar
  42. 42.
    Shahab, A., Shafait, F., Dengel, A.: Icdar 2011 robust reading competition challenge 2: Reading text in scene images. In: Proc. ICDAR, pp. 1491–1496. IEEE (2011)Google Scholar
  43. 43.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. In: Workshop at International Conference on Learning Representations (2014)Google Scholar
  44. 44.
    Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: efficient boosting procedures for multiclass object detection. In: Proc. CVPR, pp. 762–769 (2004)Google Scholar
  45. 45.
    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proc. ICCV, pp. 1457–1464. IEEE (2011)Google Scholar
  46. 46.
    Wang, K., Belongie, S.: Word spotting in the wild. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 591–604. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  47. 47.
    Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3304–3308. IEEE (2012)Google Scholar
  48. 48.
    Weinman, J.J., Butler, Z., Knoll, D., Feild, J.: Toward integrated scene text reading. IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 375–387 (2014)CrossRefGoogle Scholar
  49. 49.
    Yang, H., Quehl, B., Sack, H.: A framework for improved video text detection and recognition. Int. Journal of Multimedia Tools and Applications, MTAP (2012)Google Scholar
  50. 50.
    Yi, C., Tian, Y.: Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing 20(9), 2594–2605 (2011)CrossRefMathSciNetGoogle Scholar
  51. 51.
    Yin, X.C., Yin, X., Huang, K.: Robust text detection in natural scene images. CoRR abs/1301.2628 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Max Jaderberg
    • 1
  • Andrea Vedaldi
    • 1
  • Andrew Zisserman
    • 1
  1. 1.Visual Geometry Group, Department of Engineering ScienceUniversity of OxfordUK

Personalised recommendations