Online Action Detection

  • Roeland De GeestEmail author
  • Efstratios Gavves
  • Amir Ghodrati
  • Zhenyang Li
  • Cees Snoek
  • Tinne Tuytelaars
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 h of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.


Action recognition Evaluation Online action detection 

Supplementary material

419978_1_En_17_MOESM1_ESM.pdf (262 kb)
Supplementary material 1 (pdf 261 KB)
419978_1_En_17_MOESM2_ESM.mp4 (3.8 mb)
Supplementary material 2 (mp4 3931 KB)


  1. 1.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  2. 2.
    Laptev, I.: On space-time interest points. IJCV 64, 107–123 (2005)CrossRefGoogle Scholar
  3. 3.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Actom sequence models for efficient action detection. In: CVPR (2011)Google Scholar
  4. 4.
    Jain, M., van Gemert, J., Jegou, H., Bouthemy, P., Snoek, C.: Action localization with tubelets from motion. In: CVPR (2014)Google Scholar
  5. 5.
    Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. TPAMI PP(99), 1 (2016)CrossRefGoogle Scholar
  6. 6.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: CVPR (2016)Google Scholar
  7. 7.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. In: CVPR (2012)Google Scholar
  8. 8.
    Hoai, M., Torre, F.: Max-margin early event detectors. IJCV 107(2), 191–202 (2014)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., Lin, Y., Dickinson, S., Siskind, J., Wang, S.: Recognize human activities from partially observed videos. In: CVPR (2013)Google Scholar
  10. 10.
    Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV (2011)Google Scholar
  11. 11.
    Yu, G., Yuan, J., Liu, Z.: Predicting human activities using spatio-temporal structure of interest points. In: ACM MM (2012)Google Scholar
  12. 12.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 596–611. Springer, Heidelberg (2014)Google Scholar
  13. 13.
    Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: ECCVGoogle Scholar
  14. 14.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR (2004)Google Scholar
  15. 15.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005)Google Scholar
  16. 16.
    Over, P., Fiscus, J., Sanders, G., Joy, D., Michel, M., Awad, G., Smeaton, A., Kraaij, W., Quénot, G.: Trecvid 2014-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID (2014)Google Scholar
  17. 17.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  18. 18.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  19. 19.
    Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV (2015)Google Scholar
  20. 20.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  21. 21.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)Google Scholar
  22. 22.
    Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010)Google Scholar
  23. 23.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR (2012)Google Scholar
  24. 24.
    Gorban, A., Idrees, H., Jiang, Y.G., Roshan Zamir, A., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes (2015).
  25. 25.
    Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: ACM MM (2015)Google Scholar
  26. 26.
    Heilbron, F., Escorcia, V., Ghanem, B., Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  27. 27.
    Huang, D., Yao, S., Wang, Y., De La Torre, F.: Sequential max-margin event detectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 410–424. Springer, Heidelberg (2014)Google Scholar
  28. 28.
    Kläser, A., Marszałek, M., Schmid, C., Zisserman, A.: Human focused action localization in video. In: Kutulakos, K.N. (ed.) ECCV 2010 Workshops, Part I. LNCS, vol. 6553, pp. 219–233. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  29. 29.
    Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)Google Scholar
  30. 30.
    Wang, Z., Wang, L., Du, W., Qiao, Y.: Exploring fisher vector and deep networks for action spotting. In: CVPR Workshop (2015)Google Scholar
  31. 31.
    Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)Google Scholar
  32. 32.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 (2015)
  33. 33.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. arXiv preprint arXiv:1511.06984 (2015)
  34. 34.
    Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  35. 35.
    Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data-recommendations for the use of performance metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), pp. 245–251. IEEE (2013)Google Scholar
  36. 36.
    Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Roeland De Geest
    • 1
    Email author
  • Efstratios Gavves
    • 2
  • Amir Ghodrati
    • 1
  • Zhenyang Li
    • 2
  • Cees Snoek
    • 2
  • Tinne Tuytelaars
    • 1
  1. 1.ESAT - PSIKU LeuvenLeuvenBelgium
  2. 2.QUVAUniversity of AmsterdamAmsterdamNetherlands

Personalised recommendations