Detecting Engagement in Egocentric Video

  • Yu-Chuan SuEmail author
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


In a wearable camera video, we see what the camera wearer sees. While this makes it easy to know roughly Open image in new window , it does not immediately reveal Open image in new window . Specifically, at what moments did his focus linger, as he paused to gather more information about something he saw? Knowing this answer would benefit various applications in video summarization and augmented reality, yet prior work focuses solely on the “what” question (estimating saliency, gaze) without considering the “when” (engagement). We propose a learning-based approach that uses long-term egomotion cues to detect engagement, specifically in browsing scenarios where one frequently takes in new visual information (e.g., shopping, touring). We introduce a large, richly annotated dataset for ego-engagement that is the first of its kind. Our approach outperforms a wide array of existing methods. We show engagement can be detected well independent of both scene appearance and the camera wearer’s identity.


Ground Truth Random Forest Optical Flow Inertial Sensor Random Forest Classifier 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research is supported in part by ONR YIP N00014-12-1-0754.

Supplementary material

419978_1_En_28_MOESM1_ESM.pdf (658 kb)
Supplementary material 1 (pdf 657 KB)


  1. 1.
    Rudoy, D., Goldman, D., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013)Google Scholar
  2. 2.
    Han, J., Sun, L., Hu, X., Han, J., Shao, L.: Spatial and temporal visual attention prediction in videos using eye movement data. Neurocomputing 145, 140–153 (2014)CrossRefGoogle Scholar
  3. 3.
    Lee, W., Huang, T., Yeh, S., Chen, H.: Learning-based prediction of visual attention for video signals. IEEE TIP 20(11), 3028–3038 (2011)MathSciNetGoogle Scholar
  4. 4.
    Abdollahian, G., Taskiran, C., Pizlo, Z., Delp, E.: Camera motion-based analysis of user generated video. TMM 12(1), 28–41 (2010)Google Scholar
  5. 5.
    Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. TPAMI 32(1), 171–177 (2010)Google Scholar
  6. 6.
    Rahtu, E., Kannala, J., Salo, M., Heikkilä, J.: Segmenting salient objects from images and videos. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 366–379. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15555-0_27 CrossRefGoogle Scholar
  7. 7.
    Itti, L., Baldi, P.: Bayesian surprise attracts human attention. Vision Res. 49(10), 1295–1306 (2009)CrossRefGoogle Scholar
  8. 8.
    Liu, H., Jiang, S., Huang, Q., Xu, C.: A generic virtual content insertion system based on visual attention analysis. In: ACM MM (2008)Google Scholar
  9. 9.
    Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV (2013)Google Scholar
  10. 10.
    Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Attention prediction in egocentric video using motion and visual saliency. In: Ho, Y.-S. (ed.) PSIVT 2011. LNCS, vol. 7087, pp. 277–288. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-25367-6_25 CrossRefGoogle Scholar
  11. 11.
    Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Can saliency map models predict human egocentric visual attention? In: Koch, R., Huang, F. (eds.) ACCV 2010. LNCS, vol. 6468, pp. 420–429. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-22822-3_42 CrossRefGoogle Scholar
  12. 12.
    Kender, J., Yeo, B.L.: On the structure and analysis of home videos. In: ACCV (2000)Google Scholar
  13. 13.
    Li, K., Oh, S., Perera, A., Fu, Y.: A videography analysis framework for video retrieval and summarization. In: BMVC (2012)Google Scholar
  14. 14.
    Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_33 Google Scholar
  15. 15.
    Poleg, Y., Arora, C., Peleg, S.: Temporal segmentation of egocentric videos. In: CVPR (2014)Google Scholar
  16. 16.
    Nguyen, T.V., Xu, M., Gao, G., Kankanhalli, M., Tian, Q., Yan, S.: Static saliency vs. dynamic saliency: a comparative study. In: ACM MM (2013)Google Scholar
  17. 17.
    Ejaz, N., Mehmood, I., Baik, S.: Efficient visual attention based framework for extracting key frames from videos. Image Commun. 28, 34–44 (2013)Google Scholar
  18. 18.
    Itti, L., Dhavale, N., Pighin, F.: Realistic avatar eye and head animation using a neurobiological model of visual attention. In: Proceedings of the SPIE 48th Annual International Symposium on Optical Science and Technology, vol. 5200, pp. 64–78, August 2003Google Scholar
  19. 19.
    Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS (2007)Google Scholar
  20. 20.
    Seo, H., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vision 9(7), 1–27 (2009)CrossRefGoogle Scholar
  21. 21.
    Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM MM (2002)Google Scholar
  22. 22.
    Kienzle, W., Schölkopf, B., Wichmann, F.A., Franz, M.O.: How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 405–414. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-74936-3_41 CrossRefGoogle Scholar
  23. 23.
    Dorr, M., Martinetz, T., Gegenfurtner, K.R., Barth, E.: Variability of eye movements when viewing dynamic natural scenes. J. Vision 10(10), 1–17 (2010)CrossRefGoogle Scholar
  24. 24.
    Pilu, M.: On the use of attention clues for an autonomous wearable camera. Technical report HPL-2002-195, HP Laboratories Bristol (2003)Google Scholar
  25. 25.
    Rallapalli, S., Ganesan, A., Padmanabhan, V., Chintalapudi, K., Qiu, L.: Enabling physical analytics in retail stores using smart glasses. In: MobiCom (2014)Google Scholar
  26. 26.
    Nakamura, Y., Ohde, J., Ohta, Y.: Structuring personal activity records based on attention-analyzing videos from head mounted camera. In: ICPR (2000)Google Scholar
  27. 27.
    Cheatle, P.: Media content and type selection from always-on wearable video. In: ICPR (2004)Google Scholar
  28. 28.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  29. 29.
    Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: CVPR (2013)Google Scholar
  30. 30.
    Aghazadeh, O., Sullivan, J., Carlsson, S.: Novelty detection from an egocentric perspective. In: CVPR (2011)Google Scholar
  31. 31.
    Hoshen, Y., Ben-Artzi, G., Peleg, S.: Wisdom of the crowd in egocentric video curation. In: CVPR Workshop (2014)Google Scholar
  32. 32.
    Park, H.S., Jain, E., Sheikh, Y.: 3D gaze concurrences from head-mounted cameras. In: NIPS (2012)Google Scholar
  33. 33.
    Fathi, A., Hodgins, J., Rehg, J.: Social interactions: a first-person perspective. In: CVPR (2012)Google Scholar
  34. 34.
    Fathi, A., Farhadi, A., Rehg, J.: Understanding egocentric activities. In: ICCV (2011)Google Scholar
  35. 35.
    Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)Google Scholar
  36. 36.
    Damen, D., Leelasawassuk, T., Haines, O., Calway, A., Mayol-Cuevas, W.: You-do, i-learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. In: BMVC 2014 (2014)Google Scholar
  37. 37.
    Soran, B., Farhadi, A., Shapiro, L.: Action recognition in the presence of one egocentric and multiple static cameras. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 178–193. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16814-2_12 Google Scholar
  38. 38.
    Kitani, K., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports video. In: CVPR (2011)Google Scholar
  39. 39.
    Spriggs, E., la Torre, F.D., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: CVPR Workshop on Egocentric Vision (2009)Google Scholar
  40. 40.
    Li, Y., Ye, Z., Rehg, J.: Delving into egocentric actions. In: CVPR (2015)Google Scholar
  41. 41.
    Mital, P.K., Smith, T.J., Hill, R.L., Henderson, J.M.: Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn. Comput. 3(1), 5–24 (2011)CrossRefGoogle Scholar
  42. 42.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint (2014). arXiv:1408.5093
  43. 43.
    Liu, C.: Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Ph.D. thesis, Massachusetts Institute of Technology, May 2009Google Scholar
  44. 44.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. JMLR 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA

Personalised recommendations