A Tree-Based Approach to Integrated Action Localization, Recognition and Segmentation

  • Zhuolin Jiang
  • Zhe Lin
  • Larry S. Davis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6553)


A tree-based approach to integrated action segmentation, localization and recognition is proposed. An action is represented as a sequence of joint hog-flow descriptors extracted independently from each frame. During training, a set of action prototypes is first learned based on a k-means clustering, and then a binary tree model is constructed from the set of action prototypes based on hierarchical k-means clustering. Each tree node is characterized by a shape-motion descriptor and a rejection threshold, and an action segmentation mask is defined for leaf nodes (corresponding to a prototype). During testing, an action is localized by mapping each test frame to a nearest neighbor prototype using a fast matching method to search the learned tree, followed by global filtering refinement. An action is recognized by maximizing the sum of the joint probabilities of the action category and action prototype over test frames. Our approach does not explicitly rely on human tracking and background subtraction, and enables action localization and recognition in realistic and challenging conditions (such as crowded backgrounds). Experimental results show that our approach can achieve recognition rates of 100% on the CMU action dataset and 100% on the Weizmann dataset.


Leaf Node Action Recognition Action Category Test Frame Human Action Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005)Google Scholar
  2. 2.
    Liu, J., Ali, S., Shah, M.: Recognizing human actions using multiple features. In: CVPR (2008)Google Scholar
  3. 3.
    Li, F., Nevatia, R.: Single view human action recognition using key pose matching and viterbi path searching. In: CVPR (2007)Google Scholar
  4. 4.
    Bobick, A., Davis, J.: The recognition of human movement using tempral templates. IEEE Trans. PAMI 23, 257–267 (2001)CrossRefGoogle Scholar
  5. 5.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)Google Scholar
  6. 6.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  7. 7.
    Mikolajczyk, K., Uemura, H.: Action recognition with motion-appearance vocabulary forest. In: CVPR (2008)Google Scholar
  8. 8.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  9. 9.
    Lin, Z., Jiang, Z., Davis, L.S.: Recognizing actions by shape-motion prototype trees. In: ICCV (2009)Google Scholar
  10. 10.
    Lampert, C., Blaschko, M., Hofmann, T.: Beyond sliding windows: Object localization by efficient subwindow search. In: CVPR (2008)Google Scholar
  11. 11.
    Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005)Google Scholar
  12. 12.
    Laptev, I., Perez, P.: Retrieving actions in movies. In: ICCV (2007)Google Scholar
  13. 13.
    Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009)Google Scholar
  14. 14.
    Ali, S., Basharat, A., Shah, M.: Chaotic invariants for human action recognition. In: ICCV (2007)Google Scholar
  15. 15.
    Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: ICCV (2003)Google Scholar
  16. 16.
    Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: CVPR (2008)Google Scholar
  17. 17.
    Sheikh, Y., Sheikh, M., Shah, M.: Exploring the space of a human action. In: ICCV (2005)Google Scholar
  18. 18.
    Li, W., Zhang, Z., Liu, Z.: Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans. Circuits and Systems for Video Technology 18, 1499–1510 (2008)CrossRefGoogle Scholar
  19. 19.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int’l J. Computer Vision 79, 299–318 (2008)CrossRefGoogle Scholar
  20. 20.
    Nowozin, S., Bakir, G., Tsuda, K.: Discriminative subsequence mining for action classification. In: ICCV (2007)Google Scholar
  21. 21.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (2005)Google Scholar
  22. 22.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: ICCV (2007)Google Scholar
  23. 23.
    Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recognition require? In: CVPR (2008)Google Scholar
  24. 24.
    Yao, B., Zhu, S.: Learning deformable action templates from cluttered videos. In: ICCV (2009)Google Scholar
  25. 25.
    Thurau, C., Hlavac, V.: Pose primitive based human action recognition in videos or still images. In: CVPR (2008)Google Scholar
  26. 26.
    Weinland, D., Boyer, E.: Action recognition using exemplar-based embedding. In: CVPR (2008)Google Scholar
  27. 27.
    Liu, J., Shah, M.: Learning human actions via information maximization. In: CVPR (2008)Google Scholar
  28. 28.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: ICPR (2004)Google Scholar
  29. 29.
    Elgammal, A., Shet, V., Yacoob, Y., Davis, L.S.: Learning dynamics for exemplar-based gesture recognition. In: CVPR (2003)Google Scholar
  30. 30.
    Veeraraghavan, A., Chellappa, R., Roy-Chowdhury, A.K.: The function space of an activity. In: CVPR (2006)Google Scholar
  31. 31.
    Yao, A., Gall, J., Van Gool, L.: A hough transform-based voting framework for action recognition. In: CVPR (2010)Google Scholar
  32. 32.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  33. 33.
    Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing visual features for multiclass and multiview object detection. IEEE Trans. PAMI 29, 854–869 (2007)CrossRefGoogle Scholar
  34. 34.
    Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Zhuolin Jiang
    • 1
  • Zhe Lin
    • 2
  • Larry S. Davis
    • 1
  1. 1.University of MarylandCollege ParkUSA
  2. 2.Adobe Systems IncorporatedSan JoseUSA

Personalised recommendations