Pipelining Localized Semantic Features for Fine-Grained Action Recognition

  • Yang Zhou
  • Bingbing Ni
  • Shuicheng Yan
  • Pierre Moulin
  • Qi Tian
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8692)


In fine-grained action (object manipulation) recognition, it is important to encode object semantic (contextual) information, i.e., which object is being manipulated and how it is being operated. However, previous methods for action recognition often represent the semantic information in a global and coarse way and therefore cannot cope with fine-grained actions. In this work, we propose a representation and classification pipeline which seamlessly incorporates localized semantic information into every processing step for fine-grained action recognition. In the feature extraction stage, we explore the geometric information between local motion features and the surrounding objects. In the feature encoding stage, we develop a semantic-grouped locality-constrained linear coding (SG-LLC) method that captures the joint distributions between motion and object-in-use information. Finally, we propose a semantic-aware multiple kernel learning framework (SA-MKL) by utilizing the empirical joint distribution between action and object type for more discriminative action classification. Extensive experiments are performed on the large-scale and difficult fine-grained MPII cooking action dataset. The results show that by effectively accumulating localized semantic information into the action representation and classification pipeline, we significantly improve the fine-grained action classification performance over the existing methods.


Object Detection Action Recognition Motion Feature Multiple Kernel Learning Dense Trajectory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels. EPFL 149300 (2010)Google Scholar
  2. 2.
    Bach, F.R., Lanckriet, G.R., Jordan, M.I.: Multiple kernel learning, conic duality, and the smo algorithm. In: ICML, pp. 6–13 (2004)Google Scholar
  3. 3.
    Cao, L., Mu, Y., Natsev, A., Chang, S.-F., Hua, G., Smith, J.R.: Scene aligned pooling for complex video recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 688–701. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. TIST 2(3), 1–27 (2011)CrossRefGoogle Scholar
  5. 5.
    Chao, Y.-W., Yeh, Y.-R., Chen, Y.-W., Lee, Y.-J., Wang, Y.-C.F.: Locality-constrained group sparse representation for robust face recognition. In: ICIP, pp. 761–764 (2011)Google Scholar
  6. 6.
    Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: CVPR, pp. 3273–3280 (2011)Google Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005)Google Scholar
  8. 8.
    Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR, pp. 524–531 (2005)Google Scholar
  9. 9.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. T-PAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  10. 10.
    Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from rgb-d videos. CoRR (2012)Google Scholar
  11. 11.
    Lan, T.: Beyond actions: Discriminative models for contextual group activities. Ph.D. thesis, Applied Science: School of Computing Science (2010)Google Scholar
  12. 12.
    Lan, T., Wang, Y., Mori, G., Robinovitch, S.N.: Retrieving actions in group contexts. In: Kutulakos, K.N. (ed.) ECCV 2010 Workshops, Part I. LNCS, vol. 6553, pp. 181–194. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  13. 13.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)Google Scholar
  14. 14.
    Lee, H., Battle, A., Raina, R., Ng, A.: Efficient sparse coding algorithms. In: NIPS, pp. 801–808 (2006)Google Scholar
  15. 15.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: CVPR, pp. 1996–2003 (2009)Google Scholar
  16. 16.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR, pp. 2929–2936 (2009)Google Scholar
  17. 17.
    Moore, D., Essa, I., Hayes, M.: Exploiting human actions and object context for recognition tasks. In: ICCV, Greece (1999)Google Scholar
  18. 18.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  19. 19.
    Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: Simplemkl. JMLR 9(11), 2491–2521 (2008)zbMATHMathSciNetGoogle Scholar
  20. 20.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR, pp. 1194–1201 (2012)Google Scholar
  21. 21.
    Ullah, M.M., Parizi, S.N., Laptev, I.: Improving bag-of-features action recognition with non-local cues. In: BMVC, vol. 10, pp. 1–11 (2010)Google Scholar
  22. 22.
    Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176 (2011)Google Scholar
  23. 23.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR, pp. 3360–3367 (2010)Google Scholar
  24. 24.
    Wang, Y., Mori, G.: Hidden part models for human action recognition: Probabilistic versus max margin. T-PAMI 33(7), 1310–1323 (2011)CrossRefGoogle Scholar
  25. 25.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  26. 26.
    Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., Rehg, J.: A scalable approach to activity recognition based on object use. In: ICCV, pp. 1–8 (2007)Google Scholar
  27. 27.
    Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV (2011)Google Scholar
  28. 28.
    Yao, B., Khosla, A., Fei-Fei, L.: Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In: ICML (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Yang Zhou
    • 1
  • Bingbing Ni
    • 2
  • Shuicheng Yan
    • 3
  • Pierre Moulin
    • 4
  • Qi Tian
    • 1
  1. 1.University of Texas at San AntonioUSA
  2. 2.Advanced Digital Sciences CenterSingapore
  3. 3.National University of SingaporeSingapore
  4. 4.University of Illinois at Urbana-ChampaignUSA

Personalised recommendations