On Recognizing Actions in Still Images via Multiple Features

  • Fadime Sener
  • Cagdas Bas
  • Nazli Ikizler-Cinbis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7585)


We propose a multi-cue based approach for recognizing human actions in still images, where relevant object regions are discovered and utilized in a weakly supervised manner. Our approach does not require any explicitly trained object detector or part/attribute annotation. Instead, a multiple instance learning approach is used over sets of object hypotheses in order to represent objects relevant to the actions. We test our method on the extensive Stanford 40 Actions dataset [1] and achieve significant performance gain compared to the state-of-the-art. Our results show that using multiple object hypotheses within multiple instance learning is effective for human action recognition in still images and such an object representation is suitable for using in conjunction with other visual features.


Action Recognition Salient Object Object Region Human Action Recognition Multiple Instance Learn 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L.J., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: International Conference on Computer Vision (ICCV), Barcelona, Spain (November 2011)Google Scholar
  2. 2.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. TPAMI 31, 1775–1789 (2009)CrossRefGoogle Scholar
  3. 3.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR, San Francisco, CA (June 2010)Google Scholar
  4. 4.
    Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE TPAMI 34, 601–614 (2012)CrossRefGoogle Scholar
  5. 5.
    Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: IEEE Conf. on Computer Vision and Pattern Recognition, San Francisco, USA (2010)Google Scholar
  6. 6.
    Poppe, R.: A survey on vision-based human action recognition. Image Vision Computing 28, 976–990 (2010)CrossRefGoogle Scholar
  7. 7.
    Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. CVIU 115, 224–241 (2011)Google Scholar
  8. 8.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  9. 9.
    Wang, Y., Jiang, H., Drew, M.S., Li, Z.N., Mori, G.: Unsupervised discovery of action classes. In: CVPR (2006)Google Scholar
  10. 10.
    Thurau, C., Hlavac, V.: Pose primitive based human action recognition in videos or still images. In: CVPR (2008)Google Scholar
  11. 11.
    Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: Int. Conf. on Computer Vision (2009)Google Scholar
  12. 12.
    Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA (June 2010)Google Scholar
  13. 13.
    Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: Workshop on Structured Models in Computer Vision (2010)Google Scholar
  14. 14.
    Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: NIPS (2011)Google Scholar
  15. 15.
    Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: BMVC (2010)Google Scholar
  16. 16.
    Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. In: CVPR, Springs, USA (June 2011)Google Scholar
  17. 17.
    Chen, Y., Bi, J., Wang, J.Z.: Miles: Multiple-instance learning via embedded instance selection. IEEE TPAMI 28, 1931–1947 (2006)CrossRefGoogle Scholar
  18. 18.
    Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: High five: Recognising human interactions in tv shows. In: British Machine Vision Conference (2010)Google Scholar
  19. 19.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001)Google Scholar
  20. 20.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)CrossRefGoogle Scholar
  21. 21.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose annotations. In: ICCV (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Fadime Sener
    • 1
  • Cagdas Bas
    • 2
  • Nazli Ikizler-Cinbis
    • 2
  1. 1.Computer Engineering DepartmentBilkent UniversityAnkaraTurkey
  2. 2.Computer Engineering DepartmentHacettepe UniversityAnkaraTurkey

Personalised recommendations