Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 22, pp 29245–29267 | Cite as

Foveated convolutional neural networks for video summarization

  • Jiaxin Wu
  • Sheng-hua Zhong
  • Zheng Ma
  • Stephen J. Heinen
  • Jianmin JiangEmail author
Article

Abstract

With the proliferation of video data, video summarization is an ideal tool for users to browse video content rapidly. In this paper, we propose a novel foveated convolutional neural networks for dynamic video summarization. We are the first to integrate gaze information into a deep learning network for video summarization. Foveated images are constructed based on subjects’ eye movements to represent the spatial information of the input video. Multi-frame motion vectors are stacked across several adjacent frames to convey the motion clues. To evaluate the proposed method, experiments are conducted on two video summarization benchmark datasets. The experimental results validate the effectiveness of the gaze information for video summarization despite the fact that the eye movements are collected from different subjects from those who generated summaries. Empirical validations also demonstrate that our proposed foveated convolutional neural networks for video summarization can achieve state-of-the-art performances on these benchmark datasets.

Keywords

Video summarization Convolutional neural networks Eye movement Foveated image 

Notes

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61502311, No. 61620106008), the Natural Science Foundation of Guangdong Province (No. 2016A030310053), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant (No.U1501501), the Shenzhen Emerging Industries of the Strategic Basic Research Project under Grant (No. JCYJ20160226191842793), the Shenzhen high-level overseas talents program, and the Tencent “Rhinoceros Birds” - Scientific Research Foundation for Young Teachers of Shenzhen University.

References

  1. 1.
    Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 401–408.  https://doi.org/10.1145/1282280.1282340
  2. 2.
    Bradley M M, Lang P J (2015) Memory, emotion, and pupil diameter: repetition of natural scenes. Psychophysiology 52(9):1186–1193.  https://doi.org/10.1111/psyp.12442 CrossRefGoogle Scholar
  3. 3.
    Chang C C, Lin C J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27,1–27,27CrossRefGoogle Scholar
  4. 4.
    Daniel P, Whitteridge D (1961) The representation of the visual field on the cerebral cortex in monkeys. J Physiol 159(2):203–221CrossRefGoogle Scholar
  5. 5.
    Deng J, Dong W, Socher R, Li JL, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE computer society conference on computer vision and pattern recognition, pp 248–255Google Scholar
  6. 6.
    Detenber B, Simons R, Bennett GG Jr (1998) Roll ’em!: the effects of picture motion on emotional responses. J Broadcast Electron Media 42:113–127CrossRefGoogle Scholar
  7. 7.
    Drucker H, Burges C J C, Kaufman L, Smola A J, Vapnik V (1997) Support vector regression machines. In: Mozer M C, Jordan M I, Petsche T (eds) Advances in neural information processing systems, vol 9, pp 155–161Google Scholar
  8. 8.
    Fu Y, Guo Y, Zhu Y, Liu F, Song C, Zhou Z H (2010) Multi-view video summarization. IEEE Trans Multimed 12(7):717–729.  https://doi.org/10.1109/TMM.2010.2052025 CrossRefGoogle Scholar
  9. 9.
    Guenter B, Finch M, Drucker S, Tan D, Snyder J (2012) Foveated 3d graphics. ACM Trans Graph 31(6):164,1–164,10.  https://doi.org/10.1145/2366145.2366183 CrossRefGoogle Scholar
  10. 10.
    Gygli M, Grabner H, Riemenschneider H, Van L (2014) Creating summaries from user videos. In: Proceedings of the European conference on computer visionGoogle Scholar
  11. 11.
    Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognitionGoogle Scholar
  12. 12.
    Hanjalic A, Xu L Q (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154CrossRefGoogle Scholar
  13. 13.
    Holmberg N, Holmqvist K, Sandberg H (2015) Children’s attention to online adverts is related to low-level saliency factors and individual level of gaze control. J Eye Mov Res 8(2):1–10Google Scholar
  14. 14.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. CoRR arXiv:http://arXiv.org/abs/1408.5093
  15. 15.
    Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, pp 2593–2600Google Scholar
  16. 16.
    Karessli N, Akata Z, Schiele B, Bulling A (2017) Gaze embeddings for zero-shot image classification. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognitionGoogle Scholar
  17. 17.
    Kleiner M, Brainard D, Pelli D, Ingling A, Murray R, Broussard C (2007) What’s new in psychtoolbox-3. Perception 36(14):1–16Google Scholar
  18. 18.
    Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: Proceedings of the 11th international workshop on image analysis for multimedia interactive services, pp 1–4Google Scholar
  19. 19.
    Li Y, Fathi A, Rehg JM (2013) Learning to predict gaze in egocentric video. In: Proceedings of the 2013 IEEE international conference on computer vision, pp 3216–3223.  https://doi.org/10.1109/ICCV.2013.399
  20. 20.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110.  https://doi.org/10.1023/B:VISI.0000029664.99615.94 MathSciNetCrossRefGoogle Scholar
  21. 21.
    Ma YF, Lu L, Zhang HJ, Li M (2002) A user attention model for video summarization. In: Proceedings of the Tenth ACM international conference on multimedia, pp 533–542.  https://doi.org/10.1145/641007.641116  https://doi.org/10.1145/641007.641116
  22. 22.
    Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognitionGoogle Scholar
  23. 23.
    Mishra A K, Aloimonos Y, Cheong L F, Kassim A (2012) Active visual segmentation. IEEE Trans Pattern Anal Mach Intell 34(4):639–653.  https://doi.org/10.1109/TPAMI.2011.171 CrossRefGoogle Scholar
  24. 24.
    Nelson AL, Purdon C, Quigley L, Carriere J, Smilek D (2015) Distinguishing the roles of trait and state anxiety on the nature of anxiety-related attentional biases to threat using a free viewing eye movement paradigm. Cogn Emotion 29(3):504–526.  https://doi.org/10.1080/02699931.2014.922460. pMID: 24884972CrossRefGoogle Scholar
  25. 25.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175.  https://doi.org/10.1023/A:1011139631724 CrossRefzbMATHGoogle Scholar
  26. 26.
    Papoutsakimz A, Sangkloy P, Laskey J, Daskalova N, Huang J, Hays J (2016) Webgazer: scalable webcam eye tracking using user interactions. In: Proceedings of the 25th international joint conference on artificial intelligence, pp 3839–3845Google Scholar
  27. 27.
    Pereira M, Camargo M, Aprahamian I, Forlenza O (2014) Eye movement analysis and cognitive processing: detecting indicators of conversion to alzheimer’s disease. Neuropsychiatr Dis Treat 10:1273–1285CrossRefGoogle Scholar
  28. 28.
    Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of the European conference on computer visionGoogle Scholar
  29. 29.
    Rovamo J, Virsu V (1979) Estimation and application of the human cortical magnification factor. Exper Brain Res Experimentelle Hirnforschung Expérimentation cérébrale 37:495–510CrossRefGoogle Scholar
  30. 30.
    Salehin MM, Paul M (2017) A novel framework for video summarization based on smooth pursuit information from eye tracker data. In: 2017 IEEE International Conference on Multimedia Expo Workshops, pp 692–697.  https://doi.org/10.1109/ICMEW.2017.8026294
  31. 31.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, pp 568–576Google Scholar
  32. 32.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:http://arXiv.org/abs/1409.1556
  33. 33.
    Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187Google Scholar
  34. 34.
    Truong B T, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):1–37CrossRefGoogle Scholar
  35. 35.
    Vul E, Alvarez G, Tenenbaum J B, Black M J (2009) Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model. In: Bengio Y, Schuurmans D, Lafferty J D, Williams C K I, Culotta A (eds) Advances in neural information processing systems, vol 22, pp 1955–1963Google Scholar
  36. 36.
    Wang Z, Bovik C A, Lu L (2003) Foveated wavelet image quality index. In: Proceedings of SPIE - the international society for optical engineering, p 4472Google Scholar
  37. 37.
    Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. CoRR arXiv:http://arXiv.org/abs/1507.02159
  38. 38.
    Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool L V (2016) Temporal segment networks: towards good practices for deep action recognition. CoRR arXiv:http://arXiv.org/abs/1608.00859
  39. 39.
    Wick D V, Martinez T, Restaino S R, Stone B R (2002) Foveated imaging demonstration. Opt Express 10(1):60–65.  https://doi.org/10.1364/OE.10.000060 CrossRefGoogle Scholar
  40. 40.
    Wu J, Zhong S h, Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641CrossRefGoogle Scholar
  41. 41.
    Xie Y H, Setia L, Burkhardt H (2007) Object-based color image retrieval using concentric circular invariant features. Int J Comput Sci Eng Syst 1:159–166Google Scholar
  42. 42.
    Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, pp 2235–2244Google Scholar
  43. 43.
    Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 982–990Google Scholar
  44. 44.
    Yun K, Peng Y, Samaras D, Zelinsky GJ, Berg TL (2013) Studying relationships between human gaze, description, and computer vision. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, pp 739–746Google Scholar
  45. 45.
    Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l1 optical flow. In: Proceedings of the 29th DAGM conference on pattern recognition, pp 214–223Google Scholar
  46. 46.
    Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 2718–2726Google Scholar
  47. 47.
    Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognitionGoogle Scholar
  48. 48.
    Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of the European conference on computer visionGoogle Scholar
  49. 49.
    Zhang M, Ma K T, Lim J H, Zhao Q, Feng J (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognitionGoogle Scholar
  50. 50.
    Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol PP(99):1–11CrossRefGoogle Scholar
  51. 51.
    Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognitionGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.The College of Computer Science and Software EngineeringShenzhen UniversityShenzhenChina
  2. 2.Smith-Kettlewell Eye Research InstituteSan FranciscoUSA

Personalised recommendations