Spot On: Action Localization from Pointly-Supervised Proposals

  • Pascal MettesEmail author
  • Jan C. van Gemert
  • Cees G. M. Snoek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at


Action localization Action proposals 



This research is supported by the STW STORY project.

Supplementary material

419978_1_En_27_MOESM1_ESM.pdf (870 kb)
Supplementary material 1 (pdf 870 KB)


  1. 1.
    Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)Google Scholar
  2. 2.
    Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.G.M.: Action localization with tubelets from motion. In: CVPR (2014)Google Scholar
  3. 3.
    Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: CVPR (2015)Google Scholar
  4. 4.
    van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G.M.: APT: action localization proposals from dense trajectories. In: BMVC (2015)Google Scholar
  5. 5.
    Soomro, K., Idrees, H., Shah, M.: Action localization in videos through context walk. In: ICCV (2015)Google Scholar
  6. 6.
    Kim, G., Torralba, A.: Unsupervised detection of regions of interest using iterative link analysis. In: NIPS (2009)Google Scholar
  7. 7.
    Russakovsky, O., Lin, Y., Yu, K., Fei-Fei, L.: Object-centric spatial pooling for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 1–15. Springer, Heidelberg (2012)Google Scholar
  8. 8.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Multi-fold MIL training for weakly supervised object localization. In: CVPR (2014)Google Scholar
  9. 9.
    Nguyen, M., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised discriminative localization and classification: a joint learning process. In: ICCV (2009)Google Scholar
  10. 10.
    Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS (2002)Google Scholar
  11. 11.
    Xu, J., Schwing, A.G., Urtasun, R.: Learning to segment under various forms of weak supervision. In: CVPR (2015)Google Scholar
  12. 12.
    Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9909, pp. 549–565. Springer, Heidelberg (2016)Google Scholar
  13. 13.
    Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  14. 14.
    Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)Google Scholar
  15. 15.
    Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)Google Scholar
  16. 16.
    Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: ICCV (2015)Google Scholar
  17. 17.
    Lu, J., Xu, R., Corso, J.J.: Human action segmentation with hierarchical supervoxel consistency. In: CVPR (2015)Google Scholar
  18. 18.
    Wang, L., Qiao, Y., Tang, X.: Video action detection with relational dynamic-poselets. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 565–580. Springer, Heidelberg (2014)Google Scholar
  19. 19.
    Oneata, D., Revaud, J., Verbeek, J., Schmid, C.: Spatio-temporal object detection proposals. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 737–752. Springer, Heidelberg (2014)Google Scholar
  20. 20.
    Chen, W., Corso, J.J.: Action detection by implicit intentional motion clustering. In: ICCV (2015)Google Scholar
  21. 21.
    Marian Puscas, M., Sangineto, E., Culibrk, D., Sebe, N.: Unsupervised tube extraction using transductive learning and dense trajectories. In: ICCV (2015)Google Scholar
  22. 22.
    Soomro, K., Zamir, A.R.: Action recognition in realistic sports videos. In: Moeslund, T.B., Thomas, G., Hilton, A. (eds.) Computer Vision in Sports, pp 181-208. Springer, Heidelberg (2014)Google Scholar
  23. 23.
    Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: CVPR (2012)Google Scholar
  24. 24.
    Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010)Google Scholar
  25. 25.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
  26. 26.
    Zhang, W., Zhu, M., Derpanis, K.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: ICCV (2013)Google Scholar
  27. 27.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.: Towards understanding action recognition. In: ICCV (2013)Google Scholar
  28. 28.
    Gorban, A., Idrees, H., Jiang, Y., Zamir, A.R., Laptev, I., Shah, M., Sukthankar, R.: Thumos challenge: action recognition with a large number of classes. In: CVPR Workshop (2015)Google Scholar
  29. 29.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  30. 30.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  31. 31.
    Mihalcik, D., Doermann, D.: The design and implementation of viper. Technical report (2003)Google Scholar
  32. 32.
    Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. IJCV 101(1), 184–204 (2013)CrossRefGoogle Scholar
  33. 33.
    Yuen, J., Russell, B., Liu, C., Torralba, A.: Labelme video: building a video database with human annotations. In: ICCV (2009)Google Scholar
  34. 34.
    Settles, B.: Active Learning Literature Survey, vol. 52, pp. 55–66. University of Wisconsin, Madison (2010)Google Scholar
  35. 35.
    Vondrick, C., Ramanan, D.: Video annotation and tracking with active learning. In: NIPS (2011)Google Scholar
  36. 36.
    Bianco, S., Ciocca, G., Napoletano, P., Schettini, R.: An interactive tool for manual, semi-automatic and automatic video annotation. CVIU 131, 88–99 (2015)Google Scholar
  37. 37.
    Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: CVPR (2015)Google Scholar
  38. 38.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free? - weakly-supervised learning with convolutional neural networks. In: CVPR (2015)Google Scholar
  39. 39.
    Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery and localization in the wild: part-based matching with bottom-up region proposals. In: CVPR (2015)Google Scholar
  40. 40.
    Ali, K., Hasler, D., Fleuret, F.: Flowboost - appearance learning from sparsely annotated video. In: CVPR (2011)Google Scholar
  41. 41.
    Misra, I., Shrivastava, A., Hebert, M.: Watch and learn: semi-supervised learning for object detectors from video. In: CVPR (2015)Google Scholar
  42. 42.
    Wang, L., Hua, G., Sukthankar, R., Xue, J., Zheng, N.: Video object discovery and co-segmentation with extremely weak supervision. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 640–655. Springer, Heidelberg (2014)Google Scholar
  43. 43.
    Siva, P., Russell, C., Xiang, T.: In defence of negative mining for annotating weakly labelled data. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 594–608. Springer, Heidelberg (2012)Google Scholar
  44. 44.
    Kwak, S., Cho, M., Laptev, I., Ponce, J., Schmid, C.: Unsupervised object discovery and tracking in video collections. In: ICCV (2015)Google Scholar
  45. 45.
    Adeli Mosabbeb, E., Cabral, R., De la Torre, F., Fathy, M.: Multi-label discriminative weakly-supervised human activity recognition and localization. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 241–258. Springer, Heidelberg (2015)Google Scholar
  46. 46.
    Siva, P., Xiang, T.: Weakly supervised action detection. In: BMVC (2011)Google Scholar
  47. 47.
    Jain, M., van Gemert, J.C., Mensink, T., Snoek, C.G.M.: Objects2action: Classifying and localizing actions without any video example. In: ICCV (2015)Google Scholar
  48. 48.
    Tseng, P.H., Carmi, R., Cameron, I.G., Munoz, D.P., Itti, L.: Quantifying center bias of observers in free viewing of dynamic natural scenes. JoV 9(7), 4 (2009)CrossRefGoogle Scholar
  49. 49.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)Google Scholar
  50. 50.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  51. 51.
    Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. IJCV 105(3), 222–245 (2013)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Pascal Mettes
    • 1
    Email author
  • Jan C. van Gemert
    • 2
  • Cees G. M. Snoek
    • 1
  1. 1.University of AmsterdamAmsterdamNetherlands
  2. 2.Delft University of TechnologyDelftNetherlands

Personalised recommendations