Learning to Recognize Daily Actions Using Gaze

  • Alireza Fathi
  • Yin Li
  • James M. Rehg
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7572)


We present a probabilistic generative model for simultaneously recognizing daily actions and predicting gaze locations in videos recorded from an egocentric camera. We focus on activities requiring eye-hand coordination and model the spatio-temporal relationship between the gaze point, the scene objects, and the action label. Our model captures the fact that the distribution of both visual features and object occurrences in the vicinity of the gaze point is correlated with the verb-object pair describing the action. It explicitly incorporates known properties of gaze behavior from the psychology literature, such as the temporal delay between fixation and manipulation events. We present an inference method that can predict the best sequence of gaze locations and the associated action label from an input sequence of images. We demonstrate improvements in action recognition rates and gaze prediction accuracy relative to state-of-the-art methods, on two new datasets that contain egocentric videos of daily activities and gaze.


Action Recognition Humanoid Robot Training Sequence Action Label Foreground Region 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: From contours to regions: an empirical evaluation. In: CVPR (2009)Google Scholar
  2. 2.
    Borji, A., Sihite, D.N., Itti, L.: Probabilistic learning of task-specific visual attention. In: CVPR (2012)Google Scholar
  3. 3.
    Devyver, M., Tsukada, A., Kanade, T.: A wearable device for first person vision. In: 3rd International Symposium on Quality of Life Technology (2011)Google Scholar
  4. 4.
    Einhauser, W., Spain, M., Perona, P.: Objects predict fixations better than early saliency. Journal of Vision (2008)Google Scholar
  5. 5.
    Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)Google Scholar
  6. 6.
    Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: A first-person perspective. In: CVPR (2012)Google Scholar
  7. 7.
    Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR (2011)Google Scholar
  8. 8.
    Findlay, J.M., Gilchrist, I.D.: Active Vision: The Psychology of Looking and Seeing. Oxford Psychology Series. Oxford University Press (2003)Google Scholar
  9. 9.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI (2009)Google Scholar
  10. 10.
    Hayhoe, M., Ballard, D.: Eye movements in natural behavior. TRENDS in Congnitive Sciences (2005)Google Scholar
  11. 11.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. PAMI (1998)Google Scholar
  12. 12.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009)Google Scholar
  13. 13.
    Kitani, K.M., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports videos. In: CVPR (2011)Google Scholar
  14. 14.
    Land, M.F., Hayhoe, M.: In what ways do eye movements contribute to everyday activities? Vision Research 41, 3559–3565 (2001)CrossRefGoogle Scholar
  15. 15.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  16. 16.
    Lester, J., Choudhury, T., Kern, N., Borriello, G., Hannaford, B.: A hybrid discriminative/generative approach for modeling human activities. In: IJCAI (2005)Google Scholar
  17. 17.
    Mann, R., Jepson, A., Siskind, J.M.: Computational Perception of Scene Dynamics. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996, Part II. LNCS, vol. 1065, pp. 528–539. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  18. 18.
    Pelz, J.B., Consa, R.: Oculomotor behavior and perceptual strategies in complex tasks. Vision Research (2001)Google Scholar
  19. 19.
    Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)Google Scholar
  20. 20.
    Platt, J.: Probabilities for sv machines. In: Advanced in Large Margin Classifiers. MIT Press (1999)Google Scholar
  21. 21.
    Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR (2010)Google Scholar
  22. 22.
    Schiele, B., Oliver, N., Jebara, T., Pentland, A.: An interactive computer vision system - dypers: dynamic personal enhanced reality system. In: ICVS (1999)Google Scholar
  23. 23.
    Spriggs, E.H., De La Torre, F., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: Egovision Workshop (2009)Google Scholar
  24. 24.
    Torralba, A., Oliva, A., Castelhano, M., Henderson, J.: Contextual guidance of eye movements and attention in real-world scenes: the role of global features on object search. Psychological Review (2006)Google Scholar
  25. 25.
    Verma, M., Zisserman, A.: A statistical approach to texture classification from single images. IJCV (2005)Google Scholar
  26. 26.
    Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., Rehg, J.M.: A scalable approach to activity recognition based on object use. In: CVPR (2007)Google Scholar
  27. 27.
    Yarbus, A.: Eye Movements and Vision. Plenum Press (1967)Google Scholar
  28. 28.
    Yi, W., Ballard, D.: Recognizing behavior in hand-eye coordination patterns. International Journal of Humanoid Robots (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alireza Fathi
    • 1
  • Yin Li
    • 1
  • James M. Rehg
    • 1
  1. 1.College of ComputingGeorgia Institute of TechnologyUSA

Personalised recommendations