Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 22, pp 29323–29345 | Cite as

Spatiotemporal text localization for videos

  • Yuanqiang Cai
  • Weiqiang WangEmail author
  • Shao Huang
  • Jin Ma
  • Ke Lu
Article
  • 203 Downloads

Abstract

Text in videos contains rich semantic information, which is useful for content based video understanding and retrieval. Although a great number of state-of-the-art methods are proposed to detect text in images and videos, few works focus on spatiotemporal text localization in videos. In this paper, we present a spatiotemporal text localization method with an improved detection efficiency and performance. Concretely, a unified framework is proposed which consists of the sampling-and-recovery model (SaRM) and the divide-and-conquer model (DaCM). SaRM aims at exploiting the temporal redundancy of text to increase the detection efficiency for videos. DaCM is designed to efficiently localize the text in spatiotemporal domain simultaneously. Besides, we construct a challenging video overlaid text dataset named UCAS-STLData, which contains 57070 frames with spatiotemporal ground truths. In the experiments, we comprehensively evaluate the proposed method on the publicly available overlaid text datasets and UCAS-STLData. A slight performance improvement is achieved compared with the state-of-the-art methods for spatiotemporal text localization, with a significant efficiency improvement.

Keywords

Text localization Spatiotemporal domain Sampling-and-recovery Divide-and-conquer Overlaid text 

Notes

Acknowledgments

This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, and also supported by National Nature Science Foundation of China (NSFC) under Grant Nos. 61772495.

References

  1. 1.
    Bai X, Shi B, Zhang C, Cai X, Qi L (2017) Text/non-text image classification in the wild with convolutional neural networks. Pattern Recogn 66:437–446CrossRefGoogle Scholar
  2. 2.
    Busta M, Neumann L, Matas J (2015) Fastext: efficient unconstrained scene text detector. In: The International conference on computer vision (ICCV’15)Google Scholar
  3. 3.
    Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE Conference on computer vision and pattern recognition (CVPR’10). IEEE, pp 2963– 2970Google Scholar
  4. 4.
    Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting Uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15,083–15,103CrossRefGoogle Scholar
  5. 5.
    Fernández D, Del Barrio A, Botella G, García C (2018) Fast and effective cu size decision based on spatial and temporal homogeneity detection. Multimed Tools Appl 77(5):5907–5927CrossRefGoogle Scholar
  6. 6.
    Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126MathSciNetCrossRefGoogle Scholar
  7. 7.
    Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252–264MathSciNetCrossRefGoogle Scholar
  8. 8.
    Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100CrossRefGoogle Scholar
  9. 9.
    Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced mser trees. In: The European conference on computer vision (ECCV’14). Springer, pp 497–511Google Scholar
  10. 10.
    Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: The European conference on computer vision (ECCV’14). Springer, pp 512–528Google Scholar
  11. 11.
    Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20MathSciNetCrossRefGoogle Scholar
  12. 12.
    Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) Icdar 2013 robust reading competition. In: The International conference on document analysis and recognition (ICDAR’13). IEEE, pp 1484–1493Google Scholar
  13. 13.
    Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: The International conference on document analysis and recognition (ICDAR’15). IEEE, pp 1156–1160Google Scholar
  14. 14.
    Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recogn 54:128–148CrossRefGoogle Scholar
  15. 15.
    Khare V, Shivakumara P, Paramesran R, Blumenstein M (2017) Arbitrarily-oriented multi-lingual text detection in video. Multimed Tools Appl 76 (15):16,625–16,655CrossRefGoogle Scholar
  16. 16.
    Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355MathSciNetCrossRefGoogle Scholar
  17. 17.
    Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989–1999CrossRefGoogle Scholar
  18. 18.
    Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288MathSciNetCrossRefGoogle Scholar
  19. 19.
    Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26 (9):2138–2150CrossRefGoogle Scholar
  20. 20.
    Li Z, Tang J, He X (2017) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn SystGoogle Scholar
  21. 21.
    Liang G, Shivakumara P, Lu T, Tan CL (2015) Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Trans Image Process 24(11):4488–4501MathSciNetCrossRefGoogle Scholar
  22. 22.
    Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI Conference on artificial intelligence (AAAI’17), pp 4161–4167Google Scholar
  23. 23.
    Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 988–996Google Scholar
  24. 24.
    Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. arXiv:1703.01425
  25. 25.
    Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM international conference on multimedia (ACM MM’10). ACM, pp 843–846Google Scholar
  26. 26.
    Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489CrossRefGoogle Scholar
  27. 27.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: The European conference on computer vision (ECCV’16). Springer, pp 21–37Google Scholar
  28. 28.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The IEEE Conference on computer vision and pattern recognition (CVPR’15), pp 3431–3440Google Scholar
  29. 29.
    Lucas SM (2005) Icdar 2005 text locating competition results. In: The International conference on document analysis and recognition (ICDAR’05). IEEE, pp 80–84Google Scholar
  30. 30.
    Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The IEEE International conference on multimedia and expo (ICME’17). IEEE, pp 367–372Google Scholar
  31. 31.
    Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The IEEE International conference on image processing (ICIP’11). IEEE, pp 505–508Google Scholar
  32. 32.
    Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 3538–3545Google Scholar
  33. 33.
    Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The IEEE Winter conference on applications of computer vision (WACV’14). IEEE, pp 776–783Google Scholar
  34. 34.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: The Neural information processing systems (NIPS’15). Curran Associates, Inc, pp 91–99Google Scholar
  35. 35.
    Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481CrossRefGoogle Scholar
  36. 36.
    Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)Google Scholar
  37. 37.
    Shivakumara P, Dutta A, Phan TQ, Tan CL, Pal U (2011) A novel mutual nearest neighbor based symmetry for text frame classification in video. Pattern Recogn 44(8):1671–1683CrossRefGoogle Scholar
  38. 38.
    Shivakumara P, Phan TQ, Tan CL (2011) A laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419CrossRefGoogle Scholar
  39. 39.
    Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Vid Technol 22(8):1227–1235CrossRefGoogle Scholar
  40. 40.
    Shivakumara P, Phan TQ, Lu S, Tan CL (2013) Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images. IEEE Trans Circ Syst Vid Technol 23(10):1729–1739CrossRefGoogle Scholar
  41. 41.
    Sullivan GJ, Ohm J, Han WJ, Wiegand T (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Trans Circ Syst Vid Technol 22(12):1649–1668CrossRefGoogle Scholar
  42. 42.
    Tian S, Pei WY, Zuo ZY, Yin X (2016) Scene text detection in video by learning locally and globally. In: The International joint conference on artificial intelligence (IJCAI’16), vol 10, p 18Google Scholar
  43. 43.
    Tian S, Yin X, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach IntellGoogle Scholar
  44. 44.
    Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883Google Scholar
  45. 45.
    Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152CrossRefGoogle Scholar
  46. 46.
    Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process, 26Google Scholar
  47. 47.
    Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 146–153Google Scholar
  48. 48.
    Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 1083–1090Google Scholar
  49. 49.
    Yao C, Bai X, Sang N, Zhou X, Zhou S, Cao ZM (2016) Scene text detection via holistic, multi-channel prediction. arXiv:1606.09002
  50. 50.
    Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500CrossRefGoogle Scholar
  51. 51.
    Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605MathSciNetCrossRefGoogle Scholar
  52. 52.
    Yin X, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983CrossRefGoogle Scholar
  53. 53.
    Yin X, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773MathSciNetCrossRefGoogle Scholar
  54. 54.
    Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video TechnolGoogle Scholar
  55. 55.
    Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Yuanqiang Cai
    • 1
  • Weiqiang Wang
    • 1
    Email author
  • Shao Huang
    • 1
  • Jin Ma
    • 1
  • Ke Lu
    • 2
  1. 1.School of Computer and Control EngineeringUniversity of Chinese Academy of SciencesBeijingChina
  2. 2.School of Engineering ScienceUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations