Object, Scene and Actions: Combining Multiple Features for Human Action Recognition

  • Nazli Ikizler-Cinbis
  • Stan Sclaroff
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6311)


In many cases, human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. In this paper, we look into this problem and propose an approach for human action recognition that integrates multiple feature channels from several entities such as objects, scenes and people. We formulate the problem in a multiple instance learning (MIL) framework, based on multiple feature channels. By using a discriminative approach, we join multiple feature channels embedded to the MIL space. Our experiments over the large YouTube dataset show that scene and object information can be used to complement person features for human action recognition.


Feature Channel Candidate Object Multiple Kernel Learning Multiple Instance Learn Scene Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE TPAMI 32(2) (2010)Google Scholar
  2. 2.
    Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS, pp. 561–568. MIT Press, Cambridge (2003)Google Scholar
  3. 3.
    Babenko, B., Yang, M.-H., Belongie, S.: Visual Tracking with Online Multiple Instance Learning. In: CVPR (2009)Google Scholar
  4. 4.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005)Google Scholar
  5. 5.
    Chen, Y., Bi, J., Wang, J.Z.: Miles: Multiple-instance learning via embedded instance selection. IEEE TPAMI 28(12), 1931–1947 (2006)Google Scholar
  6. 6.
    Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE TPAMI 25(5), 564–575 (2003)Google Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005)Google Scholar
  8. 8.
    Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: ICCV ’03, pp. 726–733 (2003)Google Scholar
  9. 9.
    Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)Google Scholar
  10. 10.
    Forsyth, D.A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan, D.: Computational studies of human motion: part 1, tracking and motion synthesis. Found. Trends. Comput. Graph. Vis. 1(2-3), 77–254 (2005)CrossRefGoogle Scholar
  11. 11.
    Gehler, P.V., Nowozin, S.: On feature combination for multiclass object classification. In: ICCV (2009)Google Scholar
  12. 12.
    Gupta, A., Davis, L.S.: Objects in action:an approach for combining action understanding and object perception. In: CVPR (2007)Google Scholar
  13. 13.
    Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: ICCV (2009)Google Scholar
  14. 14.
    Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., Huang, T.S.: Action detection in complex scenes with spatial and temporal ambiguities. In: ICCV (2009)Google Scholar
  15. 15.
    Huang, Y., Liu, Q., Metaxas, D.N.: Video object segmentation by hypergraph cut. In: CVPR (2009)Google Scholar
  16. 16.
    Ikizler, N., Forsyth, D.: Searching for complex human activities with no visual examples. IJCV 80(3) (2008)Google Scholar
  17. 17.
    Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: ICCV (2009)Google Scholar
  18. 18.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: ICCV (2007)Google Scholar
  19. 19.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)Google Scholar
  20. 20.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  21. 21.
    Liu, F., Gleicher, M.: Learning color and locality cues for moving object detection and segmentation. In: CVPR (2009)Google Scholar
  22. 22.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: CVPR (2009)Google Scholar
  23. 23.
    Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: ICML, pp. 341–349 (1998)Google Scholar
  24. 24.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  25. 25.
    Mikolajczyk, K., Uemura, H.: Action recognition with motion-appearance vocabulary forest. In: CVPR (2008)Google Scholar
  26. 26.
    Moore, D.J., Essa, I., Hayes, M.H.: Exploiting human actions and object context for recognition tasks. In: ICCV (1999)Google Scholar
  27. 27.
    Niebles, J.C., Han, B., Ferencz, A., Fei-Fei, L.: Extracting moving people from internet videos. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 527–540. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  28. 28.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3), 142–175 (2001)CrossRefGoogle Scholar
  29. 29.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: ICPR (2004)Google Scholar
  30. 30.
    Tran, D., Sorokin, A.: Human activity recognition with metric learning. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 548–561. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  31. 31.
    Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: Multiple-instance learning for weakly supervised object categorization. In: CVPR (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Nazli Ikizler-Cinbis
    • 1
  • Stan Sclaroff
    • 1
  1. 1.Department of Computer ScienceBoston University 

Personalised recommendations