Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos

Abstract

Current state-of-the-art action classification methods aggregate space–time features globally, from the entire video clip under consideration. However, the features extracted may in part be due to irrelevant scene context, or movements shared amongst multiple action classes. This motivates learning with local discriminative parts, which can help localise which parts of the video are significant. Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily define a ground truth for initialisation, 3D space–time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local space–time action parts in a weakly supervised setting, we are able to achieve state-of-the-art classification performance, whilst being able to localise actions even in the most challenging video datasets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  1. Andrews, S., Tsochantaridis, I., & Hofmann, T. (2003). Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems.

  2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space–time shapes. In Proceedings of International Conference on Computer Vision (pp. 1395–1402).

  3. Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.

  4. Bronstein, A., Bronstein, M., & Kimmel, R. (2009). Topology-invariant similarity of nonrigid shapes. International Journal of Computer Vision, 81(3), 281–301.

    Article  Google Scholar 

  5. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of European Conference Computer Vision.

  6. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (pp. 65–72).

  7. Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In Proceedings of International Conference on Computer Vision (pp. 1491–1498).

  8. Felzenszwalb, P., & Huttenlocher, D. (2004). Distance transforms of sampled functions. Technical report on Cornell Computing and Information Science.

  9. Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  10. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  11. Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computer, 22(1), 67–92.

    Article  Google Scholar 

  12. Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. In IEEE International Conference on Computer Vision and Pattern Recognition.

  13. Gilbert, A., Illingworth, J., & Bowden, R. (2009). Fast realistic multi-action recognition using mined dense spatio-temporal features. In Proceedings of International Conference on Computer Vision (pp. 925–931).

  14. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings of International Conference on Computer Vision.

  15. Jiang, Z., Lin, Z., & Davis, L. S. (2012). Recognizing human actions by learning and matching shape-motion prototype trees. IEEE Transavtions on Pattern Analysis and Machine Intelligence, 34(3), 533–547.

    Article  Google Scholar 

  16. Ke, Y., Sukthandar, R., & Hebert, M. (2010). Volumetric features for video event detection. International Journal of Computer Vision, 88(3), 339–362.

    MathSciNet  Article  Google Scholar 

  17. Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In Proceedings of British Machine Vision Conference.

  18. Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In International Workshop on Sign, Gesture, Activity.

  19. Kliper-Gross, O., Gurovich, Y., Hassner, T., & Wolf, L. (2012). Motion interchange patterns for action recognition in unconstrained videos. InProceedings of European Conference Computer Vision.

  20. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision.

  21. Laptev, I., & Lindeberg, T. (2003). Space–time interest points. In Proceedings of International Conference on Computer Vision.

  22. Laptev. I., & Pérez, P. (2007). Retrieving actions in movies. In Proceedings of International Conference on Computer Vision.

  23. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE International Conference on Computer Vision and Pattern Recognition.

  24. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE International Conference on Computer Vision and Pattern Recognition.

  25. Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE International Conference on Computer Vision and Pattern Recognition.

  26. Lin, H. T., Lin, C. J., & Weng, R. C. (2007). A note on platts probabilistic outputs for support vector machines. Machine Learning, 68(3), 267–276.

    Article  Google Scholar 

  27. Liu, J., Luo, J., & Shah, M. (2009). Recognising realistic actions from videos “in the wild”. In Proceedings of British Machine Vision Conference.

  28. Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE International Conference on Computer Vision and Pattern Recognition.

  29. Muja, M., & Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. In VISSAPP (pp. 331–340).

  30. Parizi, S. N., Oberlin, J., & Felzenszwalb, P. (2012). Reconfigurable models for scene recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.

  31. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Proceedings of European Conference Computer Vision.

  32. Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In IEEE International Conference on Computer Vision and Pattern Recognition.

  33. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers (Vol. 10(3), pp. 61–74). Cambridge, MA: MIT Press

  34. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28, 976–990.

    Google Scholar 

  35. Rokhlin, V., Szlam, A., & Tygert, M. (2009). A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3), 1100–1124.

    MathSciNet  Article  Google Scholar 

  36. Sapienza, M., Cuzzolin, F., & Torr, P. H. (2012). Learning discriminative space-time actions from weakly labelled videos. In Proceedings of British Machine Vision Conference.

  37. Schüldt, C., Laptev, I., Caputo, B. (2004). Recognizing human actions: A local SVM approach. In IEEE International Conference on Pattern Recognition.

  38. Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional SIFT descriptor and its application to action recognition. In Proceedings of ACM Multimedia (pp. 357–360).

  39. Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, Series B, 127(1), 3–30.

    MathSciNet  Article  MATH  Google Scholar 

  40. Vedaldi, A., & Fulkerson, B. (2008). VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/

  41. Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In IEEE International Conference on Computer Vision and Pattern Recognition.

  42. Vig, E., Dorr, M., & Cox, D. (2012). Space-variant descriptor sampling for action recognition based on saliency and eye movements. In Proceedings of European Conference Computer Vision.

  43. Viola, P., Platt, J., & Zhang, C. (2005). Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems (pp. 1417–1426).

  44. Wang, H., Ullah, M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In Proceedings of British Machine Vision Conference.

  45. Wang, H., Kläser, A., Schmid, C., & Liu, C. (2011). Action recognition by dense trajectories. In IEEE International Conference on Computer Vision and Pattern Recognition.

  46. Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2), 224–241.

    Article  Google Scholar 

  47. Willems, G., Tuytelaars, T., & Gool, L. V. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of European Conference Computer Vision.

  48. Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In Proceedings of International Conference on Computer Vision.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Michael Sapienza.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 3913 KB)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Sapienza, M., Cuzzolin, F. & Torr, P.H. Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos. Int J Comput Vis 110, 30–47 (2014). https://doi.org/10.1007/s11263-013-0662-8

Download citation

Keywords

  • Action classification
  • Localisation
  • Multiple instance learning
  • Deformable part models
  • Space–time videos