Text Detection from Video Scenes

  • Tong Lu
  • Shivakumara Palaiahnakote
  • Chew Lim Tan
  • Wenyin Liu
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Text in video contains valuable information and is exploited in many content-based video applications. However, scene text detection has not been systematically explored even people have developed a lot of optical character recognition (OCR) techniques in the past decades. This chapter gives an introduction to the current progress on scene text detection especially in the past several years. It starts from discussing the visual saliency of scene texts to describe the characteristics of text in natural scene images. Then, the recent developments of scene text detection from video or image are discussed, roughly being categorized into bottom-up, top-down, statistic and learning, temporal or motion analysis, and hybrid approaches. Scene character recognition methods are introduced accordingly. Several typical scene text datasets adopted in different applications are introduced for performance evaluation.


Text Line Optical Character Recognition Text Region Scene Image Saliency Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Mishra A, Alahari K, Jawahar CV (2012) Top-down and bottom-up cues for scene text recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), 2012Google Scholar
  2. 2.
    Yin X-C, Yin X, Huang K (2013) Robust text detection in natural scene images. arXiv preprint arXiv:1301.2628Google Scholar
  3. 3.
    Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: IEEE conference on computer vision and pattern recognition (CVPR), 2010Google Scholar
  4. 4.
    Torralba A et al (2006) Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol Rev 113(4):766–786CrossRefGoogle Scholar
  5. 5.
    Shahab A et al (2012) How salient is scene text? In: 10th IAPR international workshop on document analysis systems (DAS), 2012Google Scholar
  6. 6.
    Harel J, Koch C, Perona P (2006) Graph-based visual saliency. In: Advances in neural information processing systemsGoogle Scholar
  7. 7.
    Zhang L et al (2008) SUN: a bayesian framework for saliency using natural statistics. J Vis 8(7):32CrossRefGoogle Scholar
  8. 8.
    Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. Pattern Anal Mach Intell IEEE Trans 20(11):1254–1259CrossRefGoogle Scholar
  9. 9.
    Uchida S et al (2011) A keypoint-based approach toward scenery character detection. In: International conference on document analysis and recognition (ICDAR), 2011Google Scholar
  10. 10.
    Karaoglu S, Gemert J, Gevers T (2012) Object reading: text recognition for object recognition. In: Fusiello A, Murino V, Cucchiara R (eds) Computer vision – ECCV 2012. Workshops and demonstrations. Springer, Berlin, pp 456–465CrossRefGoogle Scholar
  11. 11.
    Jain AK, Yu BIN (1998) Automatic text location in images and video frames. Pattern Recogn 31(12):2055–2076CrossRefGoogle Scholar
  12. 12.
    Kim H-K (1996) Efficient automatic text location method and content-based indexing and structuring of video database. J Vis Commun Image Represent 7(4):336–344CrossRefGoogle Scholar
  13. 13.
    Shivakumara P, Trung Quy P, Tan CL (2011) A Laplacian approach to multi-oriented text detection in video. Pattern Anal Mach Intell IEEE Trans 33(2):412–419CrossRefGoogle Scholar
  14. 14.
    Shivakumara P et al (2013) Gradient vector flow and grouping based method for arbitrarily-oriented scene text detection in video images. Circ Syst Video Technol, IEEE Trans. PP(99):1Google Scholar
  15. 15.
    Shivakumara P et al (2010) Accurate video text detection through classification of low and high contrast images. Pattern Recogn 43(6):2165–2185CrossRefGoogle Scholar
  16. 16.
    Pan J et al (2012) Effectively leveraging visual context to detect texts in natural scenes, In: Asian conference on computer vision (ACCV’12), 2012. DaejeonGoogle Scholar
  17. 17.
    Neumann L, Matas J (2011) A method for text localization and recognition in real-world images. In: Kimmel R, Klette R, Sugimoto A (eds) Computer vision – ACCV 2010. Springer, Berlin, pp 770–783CrossRefGoogle Scholar
  18. 18.
    Yildirim G, Achanta R, Süsstrunk S (2013) Text recognition in natural images using multiclass Hough forests. In: 8th international conference on computer vision theory and applications (VISAPP). Barcelona, pp 737–741Google Scholar
  19. 19.
    Gall J et al (2011) Hough forests for object detection, tracking, and action recognition. Pattern Anal Mach Intell IEEE Trans 33(11):2188–2202CrossRefGoogle Scholar
  20. 20.
    Kunishige Y, Yaokai F, Uchida S (2011) Scenery character detection with environmental context. In: International conference on document analysis and recognition (ICDAR), 2011Google Scholar
  21. 21.
    Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using three-dimensional textons. Int J Comput Vis 43(1):29–44CrossRefzbMATHGoogle Scholar
  22. 22.
    Xiangrong C, Yuille AL (2014) Detecting and reading text in natural scenes. In: CVPR 2004. Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, 2004Google Scholar
  23. 23.
    Drucker H, Schapire R, Simard P (1993) Boosting performance in neural networks. Int J Pattern Recognit Artif Intell 7(04):705–719CrossRefGoogle Scholar
  24. 24.
    Jung-Jin L et al (2011) AdaBoost for text detection in natural scene. In: International conference on document analysis and recognition (ICDAR), 2011Google Scholar
  25. 25.
    Vezhnevets A, Vezhnevets V (2005) Modest AdaBoost-teaching AdaBoost to generalize better. Graphicon-2005, Novosibirsk AkademgorodokGoogle Scholar
  26. 26.
    Shivakumara P et al (2012) Multioriented video scene text detection through bayesian classification and boundary growing. Circ Syst Video Technol IEEE Trans 22(8):1227–1235CrossRefGoogle Scholar
  27. 27.
    Shivakumara P et al (2011) A novel mutual nearest neighbor based symmetry for text frame classification in video. Pattern Recogn 44(8):1671–1683CrossRefGoogle Scholar
  28. 28.
    Chenyang X, Prince JL (1998) Snakes, shapes, and gradient vector flow. Image Process IEEE Trans 7(3):359–369CrossRefzbMATHGoogle Scholar
  29. 29.
    Palma D, Ascenso J, Pereira F (2004) Automatic text extraction in digital video based on motion analysis. In: Campilho A, Kamel M (eds) Image analysis and recognition. Springer, Berlin, pp 588–596CrossRefGoogle Scholar
  30. 30.
    Li H, Doermann D, Kia O (2000) Automatic text detection and tracking in digital video. Image Process IEEE Trans 9(1):147–156CrossRefGoogle Scholar
  31. 31.
    Tsung-Han T, Yung-Chien C (2007) A comprehensive motion videotext detection localization and extraction method. In: IEEE 23rd international conference on data engineering workshop, 2007Google Scholar
  32. 32.
    Chen W, Hongliang W (2010) Utilization of temporal continuity in video text detection. In: Second international conference on multimedia and information technology (MMIT), 2010Google Scholar
  33. 33.
    Xiaoou T et al (2002) Video text extraction using temporal feature vectors. In: ICME ’02. Proceedings of the IEEE international conference on multimedia and expo, 2002Google Scholar
  34. 34.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  35. 35.
    Cong Y et al (2012) Detecting texts of arbitrary orientations in natural images. In: IEEE conference on computer vision and pattern recognition (CVPR), 2012Google Scholar
  36. 36.
    Kai W, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE international conference on computer vision (ICCV), 2011Google Scholar
  37. 37.
    Ozuysal M, Fua P, Lepetit V (2007) Fast keypoint recognition in ten lines of code. In: CVPR’07. IEEE conference on computer vision and pattern recognition, 2007Google Scholar
  38. 38.
    Gonzalez A et al (2012) A character recognition method in natural scene images. In: 21st international conference on pattern recognition (ICPR), 2012Google Scholar
  39. 39.
    Campos TED, Babu BR, Varma M (2009) Character recognition in natural images. In: Computer vision theory and applications, pp 273–280Google Scholar
  40. 40.
    Feild J, Erik G (2012) Learned-Miller, scene text recognition with bilateral regression. UMass Amherst technical reportGoogle Scholar
  41. 41.
    Lucas SM et al (2003) ICDAR 2003 robust reading competitions. In: Proceedings of the seventh international conference on document analysis and recognition, 2003Google Scholar
  42. 42.
    Lucas S et al (2005) ICDAR 2003 robust reading competitions: entries, results, and future directions. IJDAR 7(2–3):105–122CrossRefGoogle Scholar
  43. 43.
    Lucas SM (2005) ICDAR 2005 text locating competition results. In: Proceedings of the eighth international conference on document analysis and recognition, 2005Google Scholar
  44. 44.
    Shahab A, Shafait F, Dengel A (2011) ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: International conference on document analysis and recognition (ICDAR), 2011Google Scholar
  45. 45.
    A database of images. Available from:
  46. 46.
    Netzer Y et al (2011) Reading digits in natural images with unsupervised feature learningGoogle Scholar
  47. 47.
    The Street View House Numbers (SVHN) Dataset. Available from:
  48. 48.
    Amazon Mechanical Turk framework. Available from:
  49. 49.
    Wu L, Shivakumara P, Lu T, Tan CL Text detection using Delaunay Triangulation in video sequence. DAS 2014, to appearGoogle Scholar
  50. 50.
    Karatzas D, Shafait K, Uchida S, Iwamura M, Bigorda LG ICDAR 2013 robust reading competition. In: Proceedings of the 12th ICDAR, pp 1115–1124Google Scholar
  51. 51.
    Yin XC, Yin XW, Huang KZ, Hao HW (2013) Robust text detection in natural scene images. CVPRGoogle Scholar
  52. 52.
    Shi CZ, Wang CH, Xiao BH, Zhang Y, Gao S, Zhang Z (2013) Scene text recognition using part-based tree-structured character detection. CVPRGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Tong Lu
    • 1
  • Shivakumara Palaiahnakote
    • 2
  • Chew Lim Tan
    • 3
  • Wenyin Liu
    • 4
  1. 1.Department of Computer Science and TechnologyNanjing UniversityNanjingChina
  2. 2.Faculty of CSITUniversity of MalayaKuala LumpurMalaysia
  3. 3.National University of SingaporeSingaporeSingapore
  4. 4.Multimedia Software Engineering Research CenterCity University of Hong KongKowloon TongHong Kong SAR

Personalised recommendations