Foveated convolutional neural networks for video summarization

Wu, Jiaxin; Zhong, Sheng-hua; Ma, Zheng; Heinen, Stephen J.; Jiang, Jianmin

doi:10.1007/s11042-018-5953-1

Foveated convolutional neural networks for video summarization

Published: 30 April 2018

Volume 77, pages 29245–29267, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jiaxin Wu¹,
Sheng-hua Zhong¹,
Zheng Ma²,
Stephen J. Heinen² &
…
Jianmin Jiang ORCID: orcid.org/0000-0002-7576-3999¹

567 Accesses
8 Citations
Explore all metrics

Abstract

With the proliferation of video data, video summarization is an ideal tool for users to browse video content rapidly. In this paper, we propose a novel foveated convolutional neural networks for dynamic video summarization. We are the first to integrate gaze information into a deep learning network for video summarization. Foveated images are constructed based on subjects’ eye movements to represent the spatial information of the input video. Multi-frame motion vectors are stacked across several adjacent frames to convey the motion clues. To evaluate the proposed method, experiments are conducted on two video summarization benchmark datasets. The experimental results validate the effectiveness of the gaze information for video summarization despite the fact that the eye movements are collected from different subjects from those who generated summaries. Empirical validations also demonstrate that our proposed foveated convolutional neural networks for video summarization can achieve state-of-the-art performances on these benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gaze Aware Deep Learning Model for Video Summarization

Video Summarization Using Fully Convolutional Sequence Networks

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

References

Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 401–408. https://doi.org/10.1145/1282280.1282340
Bradley M M, Lang P J (2015) Memory, emotion, and pupil diameter: repetition of natural scenes. Psychophysiology 52(9):1186–1193. https://doi.org/10.1111/psyp.12442
Article Google Scholar
Chang C C, Lin C J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27,1–27,27
Article Google Scholar
Daniel P, Whitteridge D (1961) The representation of the visual field on the cerebral cortex in monkeys. J Physiol 159(2):203–221
Article Google Scholar
Deng J, Dong W, Socher R, Li JL, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE computer society conference on computer vision and pattern recognition, pp 248–255
Detenber B, Simons R, Bennett GG Jr (1998) Roll ’em!: the effects of picture motion on emotional responses. J Broadcast Electron Media 42:113–127
Article Google Scholar
Drucker H, Burges C J C, Kaufman L, Smola A J, Vapnik V (1997) Support vector regression machines. In: Mozer M C, Jordan M I, Petsche T (eds) Advances in neural information processing systems, vol 9, pp 155–161
Fu Y, Guo Y, Zhu Y, Liu F, Song C, Zhou Z H (2010) Multi-view video summarization. IEEE Trans Multimed 12(7):717–729. https://doi.org/10.1109/TMM.2010.2052025
Article Google Scholar
Guenter B, Finch M, Drucker S, Tan D, Snyder J (2012) Foveated 3d graphics. ACM Trans Graph 31(6):164,1–164,10. https://doi.org/10.1145/2366145.2366183
Article Google Scholar
Gygli M, Grabner H, Riemenschneider H, Van L (2014) Creating summaries from user videos. In: Proceedings of the European conference on computer vision
Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition
Hanjalic A, Xu L Q (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154
Article Google Scholar
Holmberg N, Holmqvist K, Sandberg H (2015) Children’s attention to online adverts is related to low-level saliency factors and individual level of gaze control. J Eye Mov Res 8(2):1–10
Google Scholar
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. CoRR arXiv:http://arXiv.org/abs/1408.5093
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, pp 2593–2600
Karessli N, Akata Z, Schiele B, Bulling A (2017) Gaze embeddings for zero-shot image classification. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Kleiner M, Brainard D, Pelli D, Ingling A, Murray R, Broussard C (2007) What’s new in psychtoolbox-3. Perception 36(14):1–16
Google Scholar
Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: Proceedings of the 11th international workshop on image analysis for multimedia interactive services, pp 1–4
Li Y, Fathi A, Rehg JM (2013) Learning to predict gaze in egocentric video. In: Proceedings of the 2013 IEEE international conference on computer vision, pp 3216–3223. https://doi.org/10.1109/ICCV.2013.399
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article MathSciNet Google Scholar
Ma YF, Lu L, Zhang HJ, Li M (2002) A user attention model for video summarization. In: Proceedings of the Tenth ACM international conference on multimedia, pp 533–542. https://doi.org/10.1145/641007.641116 https://doi.org/10.1145/641007.641116
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Mishra A K, Aloimonos Y, Cheong L F, Kassim A (2012) Active visual segmentation. IEEE Trans Pattern Anal Mach Intell 34(4):639–653. https://doi.org/10.1109/TPAMI.2011.171
Article Google Scholar
Nelson AL, Purdon C, Quigley L, Carriere J, Smilek D (2015) Distinguishing the roles of trait and state anxiety on the nature of anxiety-related attentional biases to threat using a free viewing eye movement paradigm. Cogn Emotion 29(3):504–526. https://doi.org/10.1080/02699931.2014.922460. pMID: 24884972
Article Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175. https://doi.org/10.1023/A:1011139631724
Article MATH Google Scholar
Papoutsakimz A, Sangkloy P, Laskey J, Daskalova N, Huang J, Hays J (2016) Webgazer: scalable webcam eye tracking using user interactions. In: Proceedings of the 25th international joint conference on artificial intelligence, pp 3839–3845
Pereira M, Camargo M, Aprahamian I, Forlenza O (2014) Eye movement analysis and cognitive processing: detecting indicators of conversion to alzheimer’s disease. Neuropsychiatr Dis Treat 10:1273–1285
Article Google Scholar
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of the European conference on computer vision
Rovamo J, Virsu V (1979) Estimation and application of the human cortical magnification factor. Exper Brain Res Experimentelle Hirnforschung Expérimentation cérébrale 37:495–510
Article Google Scholar
Salehin MM, Paul M (2017) A novel framework for video summarization based on smooth pursuit information from eye tracker data. In: 2017 IEEE International Conference on Multimedia Expo Workshops, pp 692–697. https://doi.org/10.1109/ICMEW.2017.8026294
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:http://arXiv.org/abs/1409.1556
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187
Truong B T, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):1–37
Article Google Scholar
Vul E, Alvarez G, Tenenbaum J B, Black M J (2009) Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model. In: Bengio Y, Schuurmans D, Lafferty J D, Williams C K I, Culotta A (eds) Advances in neural information processing systems, vol 22, pp 1955–1963
Wang Z, Bovik C A, Lu L (2003) Foveated wavelet image quality index. In: Proceedings of SPIE - the international society for optical engineering, p 4472
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. CoRR arXiv:http://arXiv.org/abs/1507.02159
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool L V (2016) Temporal segment networks: towards good practices for deep action recognition. CoRR arXiv:http://arXiv.org/abs/1608.00859
Wick D V, Martinez T, Restaino S R, Stone B R (2002) Foveated imaging demonstration. Opt Express 10(1):60–65. https://doi.org/10.1364/OE.10.000060
Article Google Scholar
Wu J, Zhong S h, Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641
Article Google Scholar
Xie Y H, Setia L, Burkhardt H (2007) Object-based color image retrieval using concentric circular invariant features. Int J Comput Sci Eng Syst 1:159–166
Google Scholar
Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, pp 2235–2244
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 982–990
Yun K, Peng Y, Samaras D, Zelinsky GJ, Berg TL (2013) Studying relationships between human gaze, description, and computer vision. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, pp 739–746
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l1 optical flow. In: Proceedings of the 29th DAGM conference on pattern recognition, pp 214–223
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 2718–2726
Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition
Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of the European conference on computer vision
Zhang M, Ma K T, Lim J H, Zhao Q, Feng J (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol PP(99):1–11
Article Google Scholar
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61502311, No. 61620106008), the Natural Science Foundation of Guangdong Province (No. 2016A030310053), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant (No.U1501501), the Shenzhen Emerging Industries of the Strategic Basic Research Project under Grant (No. JCYJ20160226191842793), the Shenzhen high-level overseas talents program, and the Tencent “Rhinoceros Birds” - Scientific Research Foundation for Young Teachers of Shenzhen University.

Author information

Authors and Affiliations

The College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, China
Jiaxin Wu, Sheng-hua Zhong & Jianmin Jiang
Smith-Kettlewell Eye Research Institute, San Francisco, CA, USA
Zheng Ma & Stephen J. Heinen

Authors

Jiaxin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Sheng-hua Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Stephen J. Heinen
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianmin Jiang.

Additional information

Jiaxin Wu and Sheng-hua Zhong contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Zhong, Sh., Ma, Z. et al. Foveated convolutional neural networks for video summarization. Multimed Tools Appl 77, 29245–29267 (2018). https://doi.org/10.1007/s11042-018-5953-1

Download citation

Received: 12 December 2017
Revised: 11 March 2018
Accepted: 27 March 2018
Published: 30 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-5953-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Foveated convolutional neural networks for video summarization

Abstract

Access this article

Similar content being viewed by others

Gaze Aware Deep Learning Model for Video Summarization

Video Summarization Using Fully Convolutional Sequence Networks

Video summarization using deep learning techniques: a detailed analysis and investigation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Foveated convolutional neural networks for video summarization

Abstract

Access this article

Similar content being viewed by others

Gaze Aware Deep Learning Model for Video Summarization

Video Summarization Using Fully Convolutional Sequence Networks

Video summarization using deep learning techniques: a detailed analysis and investigation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation