Skip to main content
Log in

Foveated convolutional neural networks for video summarization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the proliferation of video data, video summarization is an ideal tool for users to browse video content rapidly. In this paper, we propose a novel foveated convolutional neural networks for dynamic video summarization. We are the first to integrate gaze information into a deep learning network for video summarization. Foveated images are constructed based on subjects’ eye movements to represent the spatial information of the input video. Multi-frame motion vectors are stacked across several adjacent frames to convey the motion clues. To evaluate the proposed method, experiments are conducted on two video summarization benchmark datasets. The experimental results validate the effectiveness of the gaze information for video summarization despite the fact that the eye movements are collected from different subjects from those who generated summaries. Empirical validations also demonstrate that our proposed foveated convolutional neural networks for video summarization can achieve state-of-the-art performances on these benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 401–408. https://doi.org/10.1145/1282280.1282340

  2. Bradley M M, Lang P J (2015) Memory, emotion, and pupil diameter: repetition of natural scenes. Psychophysiology 52(9):1186–1193. https://doi.org/10.1111/psyp.12442

    Article  Google Scholar 

  3. Chang C C, Lin C J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27,1–27,27

    Article  Google Scholar 

  4. Daniel P, Whitteridge D (1961) The representation of the visual field on the cerebral cortex in monkeys. J Physiol 159(2):203–221

    Article  Google Scholar 

  5. Deng J, Dong W, Socher R, Li JL, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE computer society conference on computer vision and pattern recognition, pp 248–255

  6. Detenber B, Simons R, Bennett GG Jr (1998) Roll ’em!: the effects of picture motion on emotional responses. J Broadcast Electron Media 42:113–127

    Article  Google Scholar 

  7. Drucker H, Burges C J C, Kaufman L, Smola A J, Vapnik V (1997) Support vector regression machines. In: Mozer M C, Jordan M I, Petsche T (eds) Advances in neural information processing systems, vol 9, pp 155–161

  8. Fu Y, Guo Y, Zhu Y, Liu F, Song C, Zhou Z H (2010) Multi-view video summarization. IEEE Trans Multimed 12(7):717–729. https://doi.org/10.1109/TMM.2010.2052025

    Article  Google Scholar 

  9. Guenter B, Finch M, Drucker S, Tan D, Snyder J (2012) Foveated 3d graphics. ACM Trans Graph 31(6):164,1–164,10. https://doi.org/10.1145/2366145.2366183

    Article  Google Scholar 

  10. Gygli M, Grabner H, Riemenschneider H, Van L (2014) Creating summaries from user videos. In: Proceedings of the European conference on computer vision

  11. Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition

  12. Hanjalic A, Xu L Q (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154

    Article  Google Scholar 

  13. Holmberg N, Holmqvist K, Sandberg H (2015) Children’s attention to online adverts is related to low-level saliency factors and individual level of gaze control. J Eye Mov Res 8(2):1–10

    Google Scholar 

  14. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. CoRR arXiv:http://arXiv.org/abs/1408.5093

  15. Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, pp 2593–2600

  16. Karessli N, Akata Z, Schiele B, Bulling A (2017) Gaze embeddings for zero-shot image classification. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition

  17. Kleiner M, Brainard D, Pelli D, Ingling A, Murray R, Broussard C (2007) What’s new in psychtoolbox-3. Perception 36(14):1–16

    Google Scholar 

  18. Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: Proceedings of the 11th international workshop on image analysis for multimedia interactive services, pp 1–4

  19. Li Y, Fathi A, Rehg JM (2013) Learning to predict gaze in egocentric video. In: Proceedings of the 2013 IEEE international conference on computer vision, pp 3216–3223. https://doi.org/10.1109/ICCV.2013.399

  20. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94

    Article  MathSciNet  Google Scholar 

  21. Ma YF, Lu L, Zhang HJ, Li M (2002) A user attention model for video summarization. In: Proceedings of the Tenth ACM international conference on multimedia, pp 533–542. https://doi.org/10.1145/641007.641116 https://doi.org/10.1145/641007.641116

  22. Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition

  23. Mishra A K, Aloimonos Y, Cheong L F, Kassim A (2012) Active visual segmentation. IEEE Trans Pattern Anal Mach Intell 34(4):639–653. https://doi.org/10.1109/TPAMI.2011.171

    Article  Google Scholar 

  24. Nelson AL, Purdon C, Quigley L, Carriere J, Smilek D (2015) Distinguishing the roles of trait and state anxiety on the nature of anxiety-related attentional biases to threat using a free viewing eye movement paradigm. Cogn Emotion 29(3):504–526. https://doi.org/10.1080/02699931.2014.922460. pMID: 24884972

    Article  Google Scholar 

  25. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175. https://doi.org/10.1023/A:1011139631724

    Article  MATH  Google Scholar 

  26. Papoutsakimz A, Sangkloy P, Laskey J, Daskalova N, Huang J, Hays J (2016) Webgazer: scalable webcam eye tracking using user interactions. In: Proceedings of the 25th international joint conference on artificial intelligence, pp 3839–3845

  27. Pereira M, Camargo M, Aprahamian I, Forlenza O (2014) Eye movement analysis and cognitive processing: detecting indicators of conversion to alzheimer’s disease. Neuropsychiatr Dis Treat 10:1273–1285

    Article  Google Scholar 

  28. Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of the European conference on computer vision

  29. Rovamo J, Virsu V (1979) Estimation and application of the human cortical magnification factor. Exper Brain Res Experimentelle Hirnforschung Expérimentation cérébrale 37:495–510

    Article  Google Scholar 

  30. Salehin MM, Paul M (2017) A novel framework for video summarization based on smooth pursuit information from eye tracker data. In: 2017 IEEE International Conference on Multimedia Expo Workshops, pp 692–697. https://doi.org/10.1109/ICMEW.2017.8026294

  31. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, pp 568–576

  32. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:http://arXiv.org/abs/1409.1556

  33. Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187

  34. Truong B T, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):1–37

    Article  Google Scholar 

  35. Vul E, Alvarez G, Tenenbaum J B, Black M J (2009) Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model. In: Bengio Y, Schuurmans D, Lafferty J D, Williams C K I, Culotta A (eds) Advances in neural information processing systems, vol 22, pp 1955–1963

  36. Wang Z, Bovik C A, Lu L (2003) Foveated wavelet image quality index. In: Proceedings of SPIE - the international society for optical engineering, p 4472

  37. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. CoRR arXiv:http://arXiv.org/abs/1507.02159

  38. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool L V (2016) Temporal segment networks: towards good practices for deep action recognition. CoRR arXiv:http://arXiv.org/abs/1608.00859

  39. Wick D V, Martinez T, Restaino S R, Stone B R (2002) Foveated imaging demonstration. Opt Express 10(1):60–65. https://doi.org/10.1364/OE.10.000060

    Article  Google Scholar 

  40. Wu J, Zhong S h, Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641

    Article  Google Scholar 

  41. Xie Y H, Setia L, Burkhardt H (2007) Object-based color image retrieval using concentric circular invariant features. Int J Comput Sci Eng Syst 1:159–166

    Google Scholar 

  42. Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, pp 2235–2244

  43. Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 982–990

  44. Yun K, Peng Y, Samaras D, Zelinsky GJ, Berg TL (2013) Studying relationships between human gaze, description, and computer vision. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, pp 739–746

  45. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l1 optical flow. In: Proceedings of the 29th DAGM conference on pattern recognition, pp 214–223

  46. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 2718–2726

  47. Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition

  48. Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of the European conference on computer vision

  49. Zhang M, Ma K T, Lim J H, Zhao Q, Feng J (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition

  50. Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol PP(99):1–11

    Article  Google Scholar 

  51. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61502311, No. 61620106008), the Natural Science Foundation of Guangdong Province (No. 2016A030310053), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant (No.U1501501), the Shenzhen Emerging Industries of the Strategic Basic Research Project under Grant (No. JCYJ20160226191842793), the Shenzhen high-level overseas talents program, and the Tencent “Rhinoceros Birds” - Scientific Research Foundation for Young Teachers of Shenzhen University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianmin Jiang.

Additional information

Jiaxin Wu and Sheng-hua Zhong contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Zhong, Sh., Ma, Z. et al. Foveated convolutional neural networks for video summarization. Multimed Tools Appl 77, 29245–29267 (2018). https://doi.org/10.1007/s11042-018-5953-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-5953-1

Keywords

Navigation