Abstract
Textual overlays are often used in social media videos as people who watch them without the sound would otherwise miss essential information conveyed in the audio stream. This is why extraction of those overlays can serve as an important meta-data source, e.g. for content classification or retrieval tasks. In this work, we present a robust method for extracting textual overlays from videos that builds up on multiple neural network architectures. The proposed solution relies on several processing steps: keyframe extraction, text detection and text recognition. The main component of our system, i.e. the text recognition module, is inspired by a convolutional recurrent neural network architecture and we improve its performance using synthetically generated dataset of over 600,000 images with text prepared by authors specifically for this task. We also develop a filtering method that reduces the amount of overlapping text phrases using Levenshtein distance and further boosts system’s performance. The final accuracy of our solution reaches over 80% and is au pair with state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Davies, M.: The Corpus of Contemporary American English (COCA): 560 million words, 1990-present (2008)
Donoser, M., Bischof, H.: Efficient Maximally Stable Extremal Region (MSER) tracking. In: CVPR (2006)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006). http://doi.acm.org/10.1145/1143844.1143891
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Kannao, R., Guha, P.: Overlay text extraction from TV news broadcast. CoRR abs/1604.00470 (2016). http://arxiv.org/abs/1604.00470
Karatzas, D., Mestre, S.R., Mas, J., Nourbakhsh, F., Roy, P.P.: ICDAR 2011 robust reading competition - challenge 1: reading text in born-digital images (web and email). In: 2011 International Conference on Document Analysis and Recognition, pp. 1485–1490, September 2011. https://doi.org/10.1109/ICDAR.2011.295
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707 (1966)
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. CoRR abs/1611.06779 (2016). http://googlebooks.byu.edu/
Liu, W., et al.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325
Lundqvist, F., Wallberg, O.: Natural image distortions and optical character recognition accuracy. Ph.D. thesis, KTH, School of Computer Science and Communication (2016)
Sato, T., Kanade, T., Hughes, E., Smith, M., Satoh, S.: Video OCR: indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia Systems Special Issue on Video Libraries, February 1998
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015). http://arxiv.org/abs/1507.05717
Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02, ICDAR 2007, pp. 629–633. IEEE Computer Society, Washington, DC (2007). http://dl.acm.org/citation.cfm?id=1304596.1304846
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: a unified text detection system in natural scene images. CoRR abs/1604.06877 (2016). http://arxiv.org/abs/1604.06877
Yang, H., Wang, C., Bartz, C., Meinel, C.: SceneTextReg: a real-time video OCR system. In: Proceedings of the 2016 ACM on Multimedia Conference, MM 2016, pp. 698–700. ACM, New York (2016). http://doi.acm.org/10.1145/2964284.2973811
Yang, H., Wang, B., Lin, S., Wipf, D.P., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. CoRR abs/1510.01442 (2015)
Yao, C., et al.: Incidental scene text understanding: recent progresses on ICDAR 2015 robust reading competition challenge 4. CoRR abs/1511.09207 (2015). http://arxiv.org/abs/1511.09207
Acknowledgments
This work was partially funded by the Dean’s Grant nr II/2017/GD/1 of the Faculty of Electronics and Information Technology at Warsaw University of Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Słucki, A., Trzciński, T., Bielski, A., Cyrta, P. (2018). Extracting Textual Overlays from Social Media Videos Using Neural Networks. In: Chmielewski, L., Kozera, R., Orłowski, A., Wojciechowski, K., Bruckstein, A., Petkov, N. (eds) Computer Vision and Graphics. ICCVG 2018. Lecture Notes in Computer Science(), vol 11114. Springer, Cham. https://doi.org/10.1007/978-3-030-00692-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-00692-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00691-4
Online ISBN: 978-3-030-00692-1
eBook Packages: Computer ScienceComputer Science (R0)