Skip to main content

Extracting Textual Overlays from Social Media Videos Using Neural Networks

  • Conference paper
  • First Online:
Book cover Computer Vision and Graphics (ICCVG 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11114))

Included in the following conference series:

  • 1295 Accesses

Abstract

Textual overlays are often used in social media videos as people who watch them without the sound would otherwise miss essential information conveyed in the audio stream. This is why extraction of those overlays can serve as an important meta-data source, e.g. for content classification or retrieval tasks. In this work, we present a robust method for extracting textual overlays from videos that builds up on multiple neural network architectures. The proposed solution relies on several processing steps: keyframe extraction, text detection and text recognition. The main component of our system, i.e. the text recognition module, is inspired by a convolutional recurrent neural network architecture and we improve its performance using synthetically generated dataset of over 600,000 images with text prepared by authors specifically for this task. We also develop a filtering method that reduces the amount of overlapping text phrases using Levenshtein distance and further boosts system’s performance. The final accuracy of our solution reaches over 80% and is au pair with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ffmpeg.org/.

  2. 2.

    https://github.com/MhLiao/TextBoxes.

  3. 3.

    https://github.com/phatpiglet/autocorrect/.

  4. 4.

    http://www.opencv.org.

References

  1. Davies, M.: The Corpus of Contemporary American English (COCA): 560 million words, 1990-present (2008)

    Google Scholar 

  2. Donoser, M., Bischof, H.: Efficient Maximally Stable Extremal Region (MSER) tracking. In: CVPR (2006)

    Google Scholar 

  3. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006). http://doi.acm.org/10.1145/1143844.1143891

  4. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  5. Kannao, R., Guha, P.: Overlay text extraction from TV news broadcast. CoRR abs/1604.00470 (2016). http://arxiv.org/abs/1604.00470

  6. Karatzas, D., Mestre, S.R., Mas, J., Nourbakhsh, F., Roy, P.P.: ICDAR 2011 robust reading competition - challenge 1: reading text in born-digital images (web and email). In: 2011 International Conference on Document Analysis and Recognition, pp. 1485–1490, September 2011. https://doi.org/10.1109/ICDAR.2011.295

  7. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707 (1966)

    MathSciNet  Google Scholar 

  8. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. CoRR abs/1611.06779 (2016). http://googlebooks.byu.edu/

  9. Liu, W., et al.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325

  10. Lundqvist, F., Wallberg, O.: Natural image distortions and optical character recognition accuracy. Ph.D. thesis, KTH, School of Computer Science and Communication (2016)

    Google Scholar 

  11. Sato, T., Kanade, T., Hughes, E., Smith, M., Satoh, S.: Video OCR: indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia Systems Special Issue on Video Libraries, February 1998

    Google Scholar 

  12. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015). http://arxiv.org/abs/1507.05717

  13. Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02, ICDAR 2007, pp. 629–633. IEEE Computer Society, Washington, DC (2007). http://dl.acm.org/citation.cfm?id=1304596.1304846

  14. Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: a unified text detection system in natural scene images. CoRR abs/1604.06877 (2016). http://arxiv.org/abs/1604.06877

  15. Yang, H., Wang, C., Bartz, C., Meinel, C.: SceneTextReg: a real-time video OCR system. In: Proceedings of the 2016 ACM on Multimedia Conference, MM 2016, pp. 698–700. ACM, New York (2016). http://doi.acm.org/10.1145/2964284.2973811

  16. Yang, H., Wang, B., Lin, S., Wipf, D.P., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. CoRR abs/1510.01442 (2015)

    Google Scholar 

  17. Yao, C., et al.: Incidental scene text understanding: recent progresses on ICDAR 2015 robust reading competition challenge 4. CoRR abs/1511.09207 (2015). http://arxiv.org/abs/1511.09207

Download references

Acknowledgments

This work was partially funded by the Dean’s Grant nr II/2017/GD/1 of the Faculty of Electronics and Information Technology at Warsaw University of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Słucki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Słucki, A., Trzciński, T., Bielski, A., Cyrta, P. (2018). Extracting Textual Overlays from Social Media Videos Using Neural Networks. In: Chmielewski, L., Kozera, R., Orłowski, A., Wojciechowski, K., Bruckstein, A., Petkov, N. (eds) Computer Vision and Graphics. ICCVG 2018. Lecture Notes in Computer Science(), vol 11114. Springer, Cham. https://doi.org/10.1007/978-3-030-00692-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00692-1_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00691-4

  • Online ISBN: 978-3-030-00692-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics