Video Action Detection with Relational Dynamic-Poselets

  • Limin Wang
  • Yu Qiao
  • Xiaoou Tang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8693)


Action detection is of great importance in understanding human motion from video. Compared with action recognition, it not only recognizes action type, but also localizes its spatiotemporal extent. This paper presents a relational model for action detection, which first decomposes human action into temporal “key poses” and then further into spatial “action parts”. Specifically, we start by clustering cuboids around each human joint into dynamic-poselets using a new descriptor. The cuboids from the same cluster share consistent geometric and dynamic structure, and each cluster acts as a mixture of body parts. We then propose a sequential skeleton model to capture the relations among dynamic-poselets. This model unifies the tasks of learning the composites of mixture dynamic-poselets, the spatiotemporal structures of action parts, and the local model for each action part in a single framework. Our model not only allows to localize the action in a video stream, but also enables a detailed pose estimation of an actor. We formulate the model learning problem in a structured SVM framework and speed up model inference by dynamic programming. We conduct experiments on three challenging action detection datasets: the MSR-II dataset, the UCF Sports dataset, and the JHMDB dataset. The results show that our method achieves superior performance to the state-of-the-art methods on these datasets.


Action detection dynamic-poselet sequential skeleton model 


  1. 1.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Comput. Surv. 43(3), 16 (2011)CrossRefGoogle Scholar
  2. 2.
    Bourdev, L.D., Maji, S., Malik, J.: Describing people: A poselet-based approach to attribute classification. In: ICCV (2011)Google Scholar
  3. 3.
    Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV (2011)Google Scholar
  4. 4.
    Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010)Google Scholar
  5. 5.
    Derpanis, K.G., Sizintsev, M., Cannons, K.J., Wildes, R.P.: Efficient action spotting based on a spacetime oriented structure representation. In: CVPR (2010)Google Scholar
  6. 6.
    Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV (2013)Google Scholar
  8. 8.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)Google Scholar
  9. 9.
    Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)Google Scholar
  10. 10.
    Packer, B., Saenko, K., Koller, D.: A combined pose, object, and feature model for action understanding. In: CVPR (2012)Google Scholar
  11. 11.
    Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)Google Scholar
  12. 12.
    Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: CVPR (2012)Google Scholar
  13. 13.
    Raptis, M., Sigal, L.: Poselet key-framing: A model for human activity recognition. In: CVPR (2013)Google Scholar
  14. 14.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)Google Scholar
  15. 15.
    Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR (2012)Google Scholar
  16. 16.
    Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: ICPR (2004)Google Scholar
  17. 17.
    Singh, V.K., Nevatia, R.: Action recognition in cluttered dynamic scenes using pose-specific part models. In: ICCV (2011)Google Scholar
  18. 18.
    Sun, C., Nevatia, R.: Active: Activity concept transitions in video event classification. In: ICCV (2013)Google Scholar
  19. 19.
    Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)Google Scholar
  20. 20.
    Tran, D., Yuan, J.: Max-margin structured output regression for spatio-temporal action localization. In: NIPS (2012)Google Scholar
  21. 21.
    Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML (2004)Google Scholar
  22. 22.
    Ullah, M.M., Laptev, I.: Actlets: A novel local representation for human action recognition in video. In: ICIP (2012)Google Scholar
  23. 23.
    Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: CVPR (2013)Google Scholar
  24. 24.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1) (2013)Google Scholar
  25. 25.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  26. 26.
    Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: ICCV (2013)Google Scholar
  27. 27.
    Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3D parts for human motion recognition. In: CVPR (2013)Google Scholar
  28. 28.
    Wang, L., Qiao, Y., Tang, X.: Latent hierarchical model of temporal structure for complex activity classification. TIP 23(2) (2014)Google Scholar
  29. 29.
    Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part III. LNCS, vol. 7726, pp. 572–585. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  30. 30.
    Yang, Y., Saleemi, I., Shah, M.: Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. TPAMI 35(7) (2013)Google Scholar
  31. 31.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)Google Scholar
  32. 32.
    Yao, A., Gall, J., Gool, L.J.V.: A Hough transform-based voting framework for action recognition. In: CVPR (2010)Google Scholar
  33. 33.
    Yu, G., Yuan, J., Liu, Z.: Propagative hough voting for human activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 693–706. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  34. 34.
    Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Limin Wang
    • 1
    • 2
  • Yu Qiao
    • 2
  • Xiaoou Tang
    • 1
    • 2
  1. 1.Department of Information EngineeringThe Chinese University of Hong KongHong Kong
  2. 2.Shenzhen Key Lab of CVPR, Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesShenzhenChina

Personalised recommendations