Multimedia Tools and Applications

, Volume 77, Issue 22, pp 29231–29244 | Cite as

Hybrid convolutional neural networks and optical flow for video visual attention prediction

  • Meijun Sun
  • Ziqi Zhou
  • Dong Zhang
  • Zheng WangEmail author


In this paper, a convolutional neural networks (CNN) and optical flow based method is proposed for prediction of visual attention in the videos. First, a deep-learning framework is employed to extract spatial features in frames to replace those commonly used handcrafted features. The optical flow is calculated to obtain the temporal feature of the moving objects in video frames, which always draw audiences’ attentions. By integrating these two groups of features, a hybrid spatial temporal feature set is obtained and taken as the input of a support vector machine (SVM) to predict the degree of visual attention. Finally, two publicly available video datasets were used to test the performance of the proposed model, where the results have demonstrated the efficacy of the proposed approach.


Convolutional neural networks Optical flow Spatial temporal feature Visual attention 



The authors wish to acknowledge the support from the National Natural Science Foundation, China, under the grants [61572351] and [61772360].


  1. 1.
    Bak C, Erdem A, Erdem E (2016) Two-stream convolutional networks for dynamic saliency prediction. arXiv preprint arXiv:1607.04730Google Scholar
  2. 2.
    Baluja S, Pomerleau D (1994) Using a saliency map for active spatial selective attention: implementation & initial results. In: Proc. Neural Information Processing Systems, pp 451–458Google Scholar
  3. 3.
    Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127CrossRefGoogle Scholar
  4. 4.
    Berg DJ, Boehnke SE, Marino RA, Munoz DP, Itti L (2009) Free viewing of dynamic stimuli by humans and monkeys. J Vis 9(5):1–15CrossRefGoogle Scholar
  5. 5.
    Cheng G, Yang C, Yao X, Guo L, Han J (2018) When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs. IEEE Trans Geosci Remote Sens. CrossRefGoogle Scholar
  6. 6.
    Cui X, Liu Q, Metaxas D (2009) Temporal spectral residual: fast motion saliency detection. In: Proc. ACM Int. Multimedia Conf., pp 617–620Google Scholar
  7. 7.
    Dorr M, Martinetz T, Gegenfurtner KR, Barth E (2010) Variability of eye movements when viewing dynamic natural scenes. J Vis 10(10):28CrossRefGoogle Scholar
  8. 8.
    Fan J, Xu W, Wu Y, Gong Y (2010) Human tracking using convolutional neural networks. IEEE Trans Neural Netw 21(10):1610–1623CrossRefGoogle Scholar
  9. 9.
    Fragkiadaki K, Arbelaez P, Felsen P, Malik J (2015) Learning to segment moving objects in videos. In: IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  10. 10.
    Goferman S, Zelnik-Manor L, Tal A (2012) Context-aware saliency detection. IEEE Trans Pattern Anal Mach Intell 34(10):1915–1926CrossRefGoogle Scholar
  11. 11.
    Han J, He S, Qian X, Wang D, Guo L, Liu T (2013) An object-oriented visual saliency detection framework based on sparse coding representations. IEEE Trans Circuits Syst Video Technol 23(12):2009–2021CrossRefGoogle Scholar
  12. 12.
    Han J, Zhou P, Zhang D, Cheng G, Guo L, Liu Z, Bu S, Wu J (2014) Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding. ISPRS J PhotoGramm Remote Sens 89:37–48CrossRefGoogle Scholar
  13. 13.
    Han J, Sun L, Hu X, Han J, Shao L (2014) Spatial and temporal visual attention prediction in videos using eye movement data. Neurocomputing 145:140–153CrossRefGoogle Scholar
  14. 14.
    Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126MathSciNetCrossRefGoogle Scholar
  15. 15.
    Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learning Syst 26(2):252–264MathSciNetCrossRefGoogle Scholar
  16. 16.
    Han J, Cheng G, Li Z, Zhang D (2017) A unified metric learning-based framework for co-saliency detection. IEEE Trans Circuits Syst Video Technol 99:1–1. CrossRefGoogle Scholar
  17. 17.
    Han J, Chen H, Liu N, Yan C, Li X (2017) CNNs-based RGB-D saliency detection via cross-view transfer and multi view fusion. IEEE Trans Cybern PP(99):1–13. CrossRefGoogle Scholar
  18. 18.
    Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100CrossRefGoogle Scholar
  19. 19.
    Han J, Quan R, Zhang D, Nie F (2018) Robust object co-segmentation using background prior. IEEE Trans Image Process 27(4):1639–1651MathSciNetCrossRefGoogle Scholar
  20. 20.
    Harel J, Koch C, Perona P (2007) Graph-based visual saliency. Adv. Neural Inf. Process. Syst., pp 545–552Google Scholar
  21. 21.
    Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: Proc. IEEE Conf. Computer Vision and Pattern Recog., pp 1–8Google Scholar
  22. 22.
    Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  23. 23.
    Jiang J et al (2011) Live: an integrated production and feedback system for intelligent and interactive tv broadcasting. IEEE Trans Broadcast 57(3):646–661CrossRefGoogle Scholar
  24. 24.
    Kim W, Jung C, Kim C (2011) Spatiotemporal saliency detection and its applications in static and dynamic scenes. IEEE Trans Circuits Syst Video Technol 21(4):446–456CrossRefGoogle Scholar
  25. 25.
    Koch C, Ullman S (1986) Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol 4(4):219Google Scholar
  26. 26.
    Li G, Yu Y (2016) Visual saliency detection based on multiscale deep CNN features. IEEE Trans Image Process 25(11):5012–5024MathSciNetCrossRefGoogle Scholar
  27. 27.
    Lia H, Chenb J, Luc H, Chic Z (2017) CNN for saliency detection with low-level feature integration. Neurocomputing 226(22):212–220Google Scholar
  28. 28.
    Liu D, Shyu M (2013) Semantic retrieval for videos in non-static background using motion saliency and global features. IEEE Seventh International Conference on Semantic ComputingGoogle Scholar
  29. 29.
    Lu X, Yuan Y, Yan P (2013) Sparse coding for image denoising using spike and slab prior. Neurocomputing 106(6):12–20CrossRefGoogle Scholar
  30. 30.
    Lu X, Zheng X, Li X (2016) Latent semantic minimal hashing for image retrieval. IEEE Trans Image Process 26(1):355–368MathSciNetCrossRefGoogle Scholar
  31. 31.
    Ma Y, Hua X, Lu L, Zhang H (2005) A generic frame work of userattention model and its application in video summarization. IEEE Trans Multimedia 7(5):907–919CrossRefGoogle Scholar
  32. 32.
    Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. In: IEEE International Conference on Computer Vision, pp 3074–3082Google Scholar
  33. 33.
    Maioli C, Benaglio I, Siri S, Sosta K, Cappa S (2001) The integration of parallel and serial processing mechanisms in visual search: evidence from eye movement recordings. Eur J Neurosci 13(2):364–372Google Scholar
  34. 34.
    Milanese R (1993) Detecting salient regions in an image: from biological evidence to computer implementation. PhD thesis, Univ. GenevaGoogle Scholar
  35. 35.
    Muhl C, Nagai Y, Sagerer G (2007) On constructing a communicative space in HRI. In: Proc. 30th German Conf. Artificial Intelligence, pp 264–278Google Scholar
  36. 36.
    Ni Q, Gu X (2014) Video attention saliency mapping using pulse coupled neural network and optical flow. International Joint Conference on Neural NetworksGoogle Scholar
  37. 37.
    Rahtu E, Kannala J, Salo M, Heikkilä J (2010) Segmenting salient objects from images and videos. In: Proceedings of European Conference on Computer Vision, pp 366–379CrossRefGoogle Scholar
  38. 38.
    Ren J et al (2007) Efficient detection of temporally impulsive dirt impairments in archived films. Signal Process 87(3):541–551CrossRefGoogle Scholar
  39. 39.
    Ren J et al (2009) Hierarchical modeling and adaptive clustering for real-time summarization of rush videos. IEEE Trans Multimedia 11(5):906–917CrossRefGoogle Scholar
  40. 40.
    Ren J et al (2010) Fusion of intensity and inter-component chromatic difference for effective and robust colour edge detection. IET Image Process 4(4):294–301CrossRefGoogle Scholar
  41. 41.
    Rutishauser U, Walther D, Koch C, Perona P (2004) Is bottom-up attention useful for object recognition? In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp 37–44Google Scholar
  42. 42.
    Seo HJ, Milanfar P (2009) Static and space-time visual saliency detection by self- resemblance. J Vis 9(12):1–27CrossRefGoogle Scholar
  43. 43.
    Shen C, Zhao Q (2014) Learning to predict eye fixations for semantic contents using multi-layer sparse network. Neurocomputing 138(11):61–68CrossRefGoogle Scholar
  44. 44.
    Tsai YH, Zhong G, Yang MH (2016) Semantic co-segmentation in videos. In: European Conference on Computer Vision, pp 760–775CrossRefGoogle Scholar
  45. 45.
    Tsotsos JK, Culhane SM, Wai WYK, Lai Y, Davis N, Nuflo F (1995) Modeling visual attention via selective tuning. Artif Intell 78:507–545CrossRefGoogle Scholar
  46. 46.
    Vig E, Dorr M, Martinetz MT, Barth E (2012) Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE Trans Pattern Anal Mach Intell 34(6):1080–1091CrossRefGoogle Scholar
  47. 47.
    Wang N, Yeung DY (2013) Learning a deep compact image representation for visual tracking. In: Advances in Neural Information Processing Pystems, pp 809–817Google Scholar
  48. 48.
    Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE TIP 24(11):4185–4196MathSciNetGoogle Scholar
  49. 49.
    Wang L, Ouyang W, Wang X, Lu H (2015) Visual tracking with fully convolutional networks. In: IEEE International Conference on Computer VisionGoogle Scholar
  50. 50.
    Wang W, Shen J, Shao L (2017) Deep learning for video saliency detection. arXivGoogle Scholar
  51. 51.
    Wu Z, Su L, Huang Q, Wu B, Li J (2016) Video saliency prediction with optimized optical flow and gravity center bias. IEEE International Conference on Multimedia & Expo, pp 1–6Google Scholar
  52. 52.
    Yao X, Han J, Zhang D, Nie F (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE TIP 26(7):3196–3209MathSciNetGoogle Scholar
  53. 53.
    Yuan Y, Lv H, Lu X (2015) Semi-supervised change detection method for multi-temporal hyperspectral images. Neurocomputing 148(19):363–375CrossRefGoogle Scholar
  54. 54.
    Zabalza J et al (2015) Novel two-dimensional singular spectrum analysis for effective feature extraction and data classification in hyperspectral imaging. IEEE Trans Geosci Remote Sens 53(8):4418–4433CrossRefGoogle Scholar
  55. 55.
    Zhang K, Liu Q, Wu Y, Yang MH (2016) Robust visual tracking via convolutional networks without training. IEEE Trans Image Process 25(4):1779–1792MathSciNetGoogle Scholar
  56. 56.
    Zhang D, Meng D, Han J (2017) co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39(5):865–878CrossRefGoogle Scholar
  57. 57.
    Zhang D, Han J, Jiang L, Ye S, Chang X (2017) revealing event saliency in unconstrained video collection. IEEE TIP 26(4):1746–1758MathSciNetGoogle Scholar
  58. 58.
    Zhao R, Ouyang W, Li H, Wang X (2015) Saliency detection by multi-context deep learning. CVPR, pp 1265-1274Google Scholar
  59. 59.
    Zhong S, Liu Y, Ren F, Zhang J, Ren T (2013) Video saliency detection via dynamic consistent spatio-temporal attention modelling. AAAIGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyTianjin UniversityTianjinChina
  2. 2.Tianjin University of Traditional Chinese MedicineTianjinChina
  3. 3.School of Computer SoftwareTianjin UniversityTianjinChina

Personalised recommendations