Visual Attention and Memory Augmented Activity Recognition and Behavioral Prediction

  • Nidhinandana SalianEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 950)


Visual attention based on saliency and human behavior analysis are two areas of research that have garnered much interest in the last two decades and several recent developments have showed exceedingly promising results. In this paper, we review the evolution of systems for computational modeling of human visual attention and action recognition and hypothesize upon their correlation and combined applications. We attempt to systemically compare and contrast each category of models and investigate directions of research that have shown the most potential in tackling major challenges relevant to these tasks. We also present a spatiotemporal saliency detection network augmented with bi-directional Long Short Term Memory (LSTM) units for efficient activity localization and recognition that to the best of our knowledge, is the first of its kind. Finally, we conjecture upon a conceptual model of visual attention based networks for behavioral prediction in intelligent surveillance systems.


Visual attention Activity-recognition Behavioral-prediction 


  1. 1.
    Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. Hum. Neurobiol. 4(4), 219–227 (1985)Google Scholar
  2. 2.
    Treisman, A., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)CrossRefGoogle Scholar
  3. 3.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)CrossRefGoogle Scholar
  4. 4.
    Ballard, D.H., Hayhoe, M., Pelz, J.: Memory representations in natural tasks. J. Cogn. Neurosci. 7, 66–80 (1995)CrossRefGoogle Scholar
  5. 5.
    Ungerleider, S.: Mechanisms of visual attention in the human cortex. Annu. Rev. Neurosci. (2000)Google Scholar
  6. 6.
    Itti, L., Koch, C.: A saliency-based search mechanism for overt and covert shifts of visual attention. Vis. Res. 40, 1489–1506 (2000)CrossRefGoogle Scholar
  7. 7.
    Minut, S., Mahadevan, S.: A reinforcement learning model of selective visual attention. In: Autonomous Agents Conference (2001). Author, F.: Article title. Journal 2(5), 99–110 (2016)Google Scholar
  8. 8.
    Hillstrom, A.P., Yantis, S.: Visual-motion and attentional capture. Percept. Psychophys. 55, 399–411 (1994)CrossRefGoogle Scholar
  9. 9.
    Hamker, F., Worcester, J.: Object detection in natural scenes by feedback. In: Biologically Motivated Computer Vision: Second International Workshop, pp. 398–407 (2002)CrossRefGoogle Scholar
  10. 10.
    Jovancevic, J., Sullivan, B., Hayhoe, M.: Control of attention and gaze in complex environments. J. Vis. 6, 1431–1450 (2006)CrossRefGoogle Scholar
  11. 11.
    Tatler, B.W.: The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. J. Vis. 7(14), 1–17 (2007)CrossRefGoogle Scholar
  12. 12.
    Hou, X., Zhang, L.: Saliency detection: a spectral residual approach. In: CVPR 2007. IEEE (2007)Google Scholar
  13. 13.
    Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends Neurosci. 15(1), 20–25 (1992)CrossRefGoogle Scholar
  14. 14.
    Cerf, M., Frady, E., Koch, C.: Faces and text attract gaze independent of the task: experimental data and computer model. J. Vis. 9, 10.1–1015 (2009). Scholar
  15. 15.
    Jovancevic-Misic, J., Hayhoe, M.: Adaptive gaze control in natural environments. J. Neurosci. 29(19), 6234–6238 (2009). Scholar
  16. 16.
    Folk, C.L., Remington, R.W., Johnston, J.C.: Involuntary covert orienting is contingent on attentional control settings. J. Exp. Psychol. 18(4), 1030–1044 (1992)Google Scholar
  17. 17.
    Wu, C., Wang, H., Pomplun, M.: The roles of scene gist and spatial dependency among objects in the semantic guidance of attention in real-world scenes. Vis. Res. 105, 10–20 (2014)CrossRefGoogle Scholar
  18. 18.
    Shinoda, M., Hayhoe, M.M., Shrivastava, A.: What controls attention in natural environments. Vis. Res. 41, 3535–3545 (2001)CrossRefGoogle Scholar
  19. 19.
    Triesch, J., Ballard, D.H., Hayhoe, M.M., Sullivan, B.T.: What you see is what you need. J. Vis. 3, 9 (2003)CrossRefGoogle Scholar
  20. 20.
    Oliva, A., Torralba, A., Castelhano, M., Henderson, J.: Top–down control of visual attention in object detection. In: Proceedings of the 2003 International Conference on Image Processing (2003)Google Scholar
  21. 21.
    Bruce, N., Tsotsos, J.: Saliency, attention, and visual search: an information theoretic approach. J. Vis. 9, 5 (2009)CrossRefGoogle Scholar
  22. 22.
    Lee, T.S., Stella, X.Y.: An information-theoretic framework for understanding saccadic eye movements. In: NIPS (1999)Google Scholar
  23. 23.
    Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Advances in Neural Information Processing Systems (2006)Google Scholar
  24. 24.
    Fink, G.R., Dolan, R.J., Halligan, P.W., Marshall, J.C., Frith, C.D.: Space-based and object-based visual attention: shared and specific neural domains. Brain 120, 2013–2028 (1997)CrossRefGoogle Scholar
  25. 25.
    Itti, L., Baldi, P.: A principled approach to detecting surprising events in video. In: Proceedings of IEEE CVPR (2005)Google Scholar
  26. 26.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. IEEE (2009)Google Scholar
  27. 27.
    Judd, T., Durand, F., Torralba, A.: Fixations on low resolution images. J. Vis. 11(4), 14 (2011)CrossRefGoogle Scholar
  28. 28.
    Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: CVPR 2015. IEEE (2015)Google Scholar
  29. 29.
    Jiang, M., Boix, X., Roig, G., Xu, J., Van Gool, L., Zhao, Q.: Learning to predict sequences of human visual fixations. IEEE Trans. Neural Netw. Learn. Syst. 27(6), 1241–1252 (2016)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an LSTM-based saliency attention model. arXiv preprint arXiv:1611.09571 (2017)
  31. 31.
    Bylinskii, Z., et al.: MIT Saliency Benchmark (2017).
  32. 32.
    Liu, Z., Zhang, X., Luo, S., Le Meur, O.: Superpixel-based spatiotemporal saliency detection. IEEE Trans. Circ. Syst. Video Technol. 24, 1522–1540 (2014)CrossRefGoogle Scholar
  33. 33.
    Kruthiventi, S.S., Gudisa, V., Dholakiya, J.H., Venkatesh Babu, R.: Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation. In: CPVR (2016)Google Scholar
  34. 34.
    Dodge, S., Karam, L.: Visual saliency prediction using a mixture of deep neural networks. arXiv preprint arXiv:1702.00372 (2017)
  35. 35.
    Tatler, B.W., Hayhoe, M.M., Land, M.F., Ballard, D.H.: Eye guidance in natural vision: reinterpreting salience. J. Vis. 11(5), 5 (2011)CrossRefGoogle Scholar
  36. 36.
    Kümmerer, M., et al.: Understanding low- and high-level contributions to fixation prediction. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4799–4808 (2017)Google Scholar
  37. 37.
    Foulsham, T., Walker, E., Kingstone, A.: The where, what and when of gaze allocation in the lab and the natural environment. Vis. Res. 51, 1920–1931 (2011)CrossRefGoogle Scholar
  38. 38.
    Tatler, B.W.: Eye movements from laboratory to life. In: Horsley, M., Eliot, M., Knight, B., Reilly, R. (eds.) Current Trends in Eye Tracking Research, pp. 17–35. Springer, Cham (2014). Scholar
  39. 39.
    Laidlawa, K.E.W., Foulshamb, T., Kuhnc, G., Kingstone, A.: Potential social interactions are important to social attention. PNAS 108, 5548–5553 (2011)CrossRefGoogle Scholar
  40. 40.
    Gobel, M.S., Kim, H.S., Richardson, D.C.: The dual function of social gaze. Cognition 136, 359–364 (2015)CrossRefGoogle Scholar
  41. 41.
    Murabito, F., Spampinato, C., Palazzo, S.. Giordano, D., Pogorelov, K., Riegler, M.: Top-down saliency detection driven by visual classification. Comput. Vis. Image Underst. (2018)Google Scholar
  42. 42.
    Kruthiventi, S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. arXiv preprint arXiv:1510.02927 (2015)
  43. 43.
    Kümmerer, M., Wallis, T.S., Bethge, M.: DeepGaze II: reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563 (2016)
  44. 44.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multilevel network for saliency prediction. In: Proceedings of the International Conference on Pattern Recognition (2016)Google Scholar
  45. 45.
    Wang, W., Shen, J.: Deep visual attention prediction. arXiv preprint arXiv:1705.02544 (2018)
  46. 46.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  47. 47.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  48. 48.
    Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)Google Scholar
  49. 49.
    Bak, C., Erdem, E., Erdem, A.: Two-stream convolutional neural networks for dynamic saliency prediction. arXiv preprint arXiv:1607.04730 (2016)
  50. 50.
    Bak, C., Kocak, A., Erdem, E., Erdem, A.: Spatio-temporal networks for dynamic saliency prediction. arXiv preprint arXiv:1607.04730v2 (2017)
  51. 51.
    Kuen, J., Wang, Z. Wang, G.: Recurrent attentional networks for saliency detection. In: CVPR (2016)Google Scholar
  52. 52.
    Unzicker, A., Juttner, M., Rentschler, I.: Similarity-based models of human visual recognition. Vis. Res. 38, 2289–2305 (1998)CrossRefGoogle Scholar
  53. 53.
    Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cybern. 34(3), 334–352 (2004)CrossRefGoogle Scholar
  54. 54.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)CrossRefGoogle Scholar
  55. 55.
    Van Kasteren, T., Noulas, A., Englebienne, G., Kröse, B.: Accurate activity recognition in a home setting. In: Proceedings of the 10th International Conference on Ubiquitous Computing, pp. 1–9 (2008)Google Scholar
  56. 56.
    Avci, U., Passerini, A.: A fully unsupervised approach to activity discovery. In: Salah, A.A., Hung, H., Aran, O., Gunes, H. (eds.) HBU 2013. LNCS, vol. 8212, pp. 77–88. Springer, Cham (2013). Scholar
  57. 57.
    Bobick, A., Davis, J.: Real-time recognition of activity using temporal templates. In: Proceedings 3rd IEEE Workshop on Applications of Computer Vision 1996, WACV 1996, pp. 39–42 (1996)Google Scholar
  58. 58.
    Wilson, A.D., Bobick, A.F., Cassell, J.: Temporal classification of natural gesture and application to video coding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 948–954 (1997)Google Scholar
  59. 59.
    Bobick, A., Davis, J.: An appearance-based representation of action. In: ICPR (1996)Google Scholar
  60. 60.
    Bobick, A.F., Wilson, A.D.: A state-based technique to the representation and recognition of gesture. IEEE Trans. Pattern Anal. Mach. Intell. 19, 1325–1337 (1997)CrossRefGoogle Scholar
  61. 61.
    Rabiner, L.: “A tutorial on hidden Markov models and selected applications in speech recognition”, Proceedings of the IEEE, 1989Google Scholar
  62. 62.
    Fossler-Lussier, E.: Markov models and hidden Markov models: a brief tutorial. International Computer Science Institute (1998)Google Scholar
  63. 63.
    Pentland, A., Liu, A.: Modeling and prediction of human behavior. Neural Comput. 11(1), 229–242 (1999)CrossRefGoogle Scholar
  64. 64.
    Kim, E., Helal, S., Cook, D.: Human activity recognition and pattern discovery. IEEE Pervasive Comput. 9, 48–53 (2010)CrossRefGoogle Scholar
  65. 65.
    Phan, N., Dou, D., Piniewski, B., Kil, D.: Social restricted Boltzmann machine: human behavior prediction in health social networks. In: ASONAM 2015 (2015)Google Scholar
  66. 66.
    Yang. M., Ahuja, N.: Extraction and classification of visual motion pattern recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 892–897 (1998)Google Scholar
  67. 67.
    Meier, U., Stiefelhagen, R., Yang, J., Waibel, A.: Toward unrestricted lip reading. Int. J. Pattern Recogn. Artif. Intell. 14(5), 571–585 (2000)CrossRefGoogle Scholar
  68. 68.
    Owens, J., Hunter, A.: Application of the self-organizing map to trajectory classification. In: Proceedings of IEEE Int. Workshop Visual Surveillance (2000)Google Scholar
  69. 69.
    Zhao, H., Liu, Z.: Human action recognition based on non-linear SVM decision tree. J. Comput. Inf. Syst. 7(7), 2461–2468 (2011)Google Scholar
  70. 70.
    Hartford, J., Wright, J.R. Leyton-Brown, K.: Deep learning for human strategic behavior prediction. In: NIPS (2016)Google Scholar
  71. 71.
    Almeida, A., Azkune, G., Predicting human behavior with recurrent neural networks. In: Appl. Sci. 2018 (2018)Google Scholar
  72. 72.
    Sigurdsson, G., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: ICCV 2017, pp. 2156–2165 (2017)Google Scholar
  73. 73.
    Bregonzio, M., Gong, S. Xiang, T.: Recognising action as clouds of space-time interest points. In: Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  74. 74.
    An, N., Sun, S., Zhao, X., Hou, Z.: Remember like humans: visual tracking with cognitive psychological memory model. Int. J. Adv. Robot. Syst. 14, 1–9 (2017)Google Scholar
  75. 75.
    Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization via vision-language embedding. In: CPVR (2017)Google Scholar
  76. 76.
    Fakoor, R., Mohamed, A., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261v4 (2017)
  77. 77.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: Computer Vision, pp. 1593–1600 (2009)Google Scholar
  78. 78.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  79. 79.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  80. 80.
    Oliver, N., Rosario, B., Pentland, A.: A Bayesian computer vision system for modeling human interactions. ICVS 1999. LNCS, vol. 1542, pp. 255–272. Springer, Heidelberg (1999). Scholar
  81. 81.
    Ryoo, M.S., Aggarwal, J.K.: Semantic representation and recognition of continued and recursive human activities. Int. J. Comput. Vis. 82, 1–24 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science EngineeringManipal Institute of TechnologyManipalIndia

Personalised recommendations