Extracting Textual Overlays from Social Media Videos Using Neural Networks

Słucki, Adam; Trzciński, Tomasz; Bielski, Adam; Cyrta, Paweł

doi:10.1007/978-3-030-00692-1_25

Adam Słucki^19,21,
Tomasz Trzciński^20,21,
Adam Bielski²¹ &
…
Paweł Cyrta²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11114))

Included in the following conference series:

International Conference on Computer Vision and Graphics

1295 Accesses

Abstract

Textual overlays are often used in social media videos as people who watch them without the sound would otherwise miss essential information conveyed in the audio stream. This is why extraction of those overlays can serve as an important meta-data source, e.g. for content classification or retrieval tasks. In this work, we present a robust method for extracting textual overlays from videos that builds up on multiple neural network architectures. The proposed solution relies on several processing steps: keyframe extraction, text detection and text recognition. The main component of our system, i.e. the text recognition module, is inspired by a convolutional recurrent neural network architecture and we improve its performance using synthetically generated dataset of over 600,000 images with text prepared by authors specifically for this task. We also develop a filtering method that reduces the amount of overlapping text phrases using Levenshtein distance and further boosts system’s performance. The final accuracy of our solution reaches over 80% and is au pair with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Davies, M.: The Corpus of Contemporary American English (COCA): 560 million words, 1990-present (2008)
Google Scholar
Donoser, M., Bischof, H.: Efficient Maximally Stable Extremal Region (MSER) tracking. In: CVPR (2006)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006). http://doi.acm.org/10.1145/1143844.1143891
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Kannao, R., Guha, P.: Overlay text extraction from TV news broadcast. CoRR abs/1604.00470 (2016). http://arxiv.org/abs/1604.00470
Karatzas, D., Mestre, S.R., Mas, J., Nourbakhsh, F., Roy, P.P.: ICDAR 2011 robust reading competition - challenge 1: reading text in born-digital images (web and email). In: 2011 International Conference on Document Analysis and Recognition, pp. 1485–1490, September 2011. https://doi.org/10.1109/ICDAR.2011.295
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707 (1966)
MathSciNet Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. CoRR abs/1611.06779 (2016). http://googlebooks.byu.edu/
Liu, W., et al.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325
Lundqvist, F., Wallberg, O.: Natural image distortions and optical character recognition accuracy. Ph.D. thesis, KTH, School of Computer Science and Communication (2016)
Google Scholar
Sato, T., Kanade, T., Hughes, E., Smith, M., Satoh, S.: Video OCR: indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia Systems Special Issue on Video Libraries, February 1998
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015). http://arxiv.org/abs/1507.05717
Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02, ICDAR 2007, pp. 629–633. IEEE Computer Society, Washington, DC (2007). http://dl.acm.org/citation.cfm?id=1304596.1304846
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: a unified text detection system in natural scene images. CoRR abs/1604.06877 (2016). http://arxiv.org/abs/1604.06877
Yang, H., Wang, C., Bartz, C., Meinel, C.: SceneTextReg: a real-time video OCR system. In: Proceedings of the 2016 ACM on Multimedia Conference, MM 2016, pp. 698–700. ACM, New York (2016). http://doi.acm.org/10.1145/2964284.2973811
Yang, H., Wang, B., Lin, S., Wipf, D.P., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. CoRR abs/1510.01442 (2015)
Google Scholar
Yao, C., et al.: Incidental scene text understanding: recent progresses on ICDAR 2015 robust reading competition challenge 4. CoRR abs/1511.09207 (2015). http://arxiv.org/abs/1511.09207

Download references

Acknowledgments

This work was partially funded by the Dean’s Grant nr II/2017/GD/1 of the Faculty of Electronics and Information Technology at Warsaw University of Technology.

Author information

Authors and Affiliations

Polish-Japanese Academy of Information Technology, Warsaw, Poland
Adam Słucki
Warsaw University of Technology, Warsaw, Poland
Tomasz Trzciński
Tooploox, Wrocław, Poland
Adam Słucki, Tomasz Trzciński, Adam Bielski & Paweł Cyrta

Authors

Adam Słucki
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Trzciński
View author publications
You can also search for this author in PubMed Google Scholar
Adam Bielski
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Cyrta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Słucki .

Editor information

Editors and Affiliations

Faculty of Applied Informatics and Mathematics, Warsaw University of Life Sciences, Warsaw, Poland
Leszek J. Chmielewski
Faculty of Applied Informatics and Mathematics, Warsaw University of Life Sciences, Warsaw, Poland
Ryszard Kozera
Faculty of Applied Informatics and Mathematics, Warsaw University of Life Sciences, Warsaw, Poland
Arkadiusz Orłowski
Institute of Computer Science, Silesian University of Technology, Gliwice, Poland
Konrad Wojciechowski
Technion, Israel Institute of Technology, Haifa, Israel
Alfred M. Bruckstein
University of Groningen, Groningen, The Netherlands
Nicolai Petkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Słucki, A., Trzciński, T., Bielski, A., Cyrta, P. (2018). Extracting Textual Overlays from Social Media Videos Using Neural Networks. In: Chmielewski, L., Kozera, R., Orłowski, A., Wojciechowski, K., Bruckstein, A., Petkov, N. (eds) Computer Vision and Graphics. ICCVG 2018. Lecture Notes in Computer Science(), vol 11114. Springer, Cham. https://doi.org/10.1007/978-3-030-00692-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-00692-1_25
Published: 14 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00691-4
Online ISBN: 978-3-030-00692-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics