Advertisement

Predicting Action Tubes

  • Gurkirt SinghEmail author
  • Suman Saha
  • Fabio Cuzzolin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)

Abstract

In this work, we present a method to predict an entire ‘action tube’ (a set of temporally linked bounding boxes) in a trimmed video just by observing a smaller subset of it. Predicting where an action is going to take place in the near future is essential to many computer vision based applications such as autonomous driving or surgical robotics. Importantly, it has to be done in real-time and in an online fashion. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with their action classification scores. At test time TPnet is used in a (temporal) sliding window setting, and its predictions are put into a tube estimation framework to construct/predict the video long action tubes not only for the observed part of the video but also for the unobserved part. Additionally, the proposed action tube predictor helps in completing action tubes for unobserved segments of the video. We quantitatively demonstrate the latter ability, and the fact that TPnet improves state-of-the-art detection performance, on one of the standard action detection benchmarks - J-HMDB-21 dataset.

Supplementary material

478822_1_En_11_MOESM1_ESM.pdf (437 kb)
Supplementary material 1 (pdf 436 KB)

References

  1. 1.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–971 (2016)Google Scholar
  2. 2.
    Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: IEEE International Conference on Computer Vision (ICCV), vol. 1 (2017)Google Scholar
  3. 3.
    Behl, H.S., Sapienza, M., Singh, G., Saha, S., Cuzzolin, F., Torr, P.H.: Incremental tube construction for human action detection. arXiv preprint arXiv:1704.01358 (2017)
  4. 4.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-24673-2_3CrossRefGoogle Scholar
  5. 5.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  6. 6.
    De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. arXiv preprint arXiv:1604.06506 (2016)
  7. 7.
    Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE International Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  8. 8.
    Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421 (2017)
  9. 9.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. Int. J. Comput. Vis. 107(2), 191–202 (2014)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  11. 11.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.: Towards understanding action recognition (2013)Google Scholar
  12. 12.
    Jiang, Y., Saxena, A.: Modeling high-dimensional humans for activity anticipation using Gaussian process latent CRFs. In: Robotics: Science and Systems, RSS (2014)Google Scholar
  13. 13.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  14. 14.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  15. 15.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_15CrossRefGoogle Scholar
  16. 16.
    Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1481 (2017)Google Scholar
  17. 17.
    Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32(8), 951–970 (2013)CrossRefGoogle Scholar
  18. 18.
    Kroeger, T., Timofte, R., Dai, D., Van Gool, L.: Fast optical flow using dense inverse search. arXiv preprint arXiv:1603.03590 (2016)
  19. 19.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2556–2563. IEEE (2011)Google Scholar
  20. 20.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10578-9_45CrossRefGoogle Scholar
  21. 21.
    Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: distant future prediction in dynamic scenes with interacting agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345 (2017)Google Scholar
  22. 22.
    Liu, W., et al.: SSD: single shot multibox detector. arXiv preprint arXiv:1512.02325 (2015)
  23. 23.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)Google Scholar
  24. 24.
    Nazerfard, E., Cook, D.J.: Using Bayesian networks for daily activity prediction. In: AAAI Workshop: Plan, Activity, and Intent Recognition (2013)Google Scholar
  25. 25.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_45CrossRefGoogle Scholar
  26. 26.
    Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  28. 28.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE International Conference on Computer Vision, pp. 1036–1043. IEEE (2011)Google Scholar
  29. 29.
    Saha, S., Singh, G., Cuzzolin, F.: AMTnet: action-micro-tube regression by end-to-end trainable deep architecture. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  30. 30.
    Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: British Machine Vision Conference (2016)Google Scholar
  31. 31.
    Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  32. 32.
    Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization (2016)Google Scholar
  33. 33.
    Tahmida Mahmud, M.H., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: IEEE International Conference on Computer Vision, vol. 1 (2017)Google Scholar
  34. 34.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 (2015)
  35. 35.
    Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: IEEE International Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  36. 36.
    Weinzaepfel, P., Martin, X., Schmid, C.: Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197 (2016)
  37. 37.
    Weinzaepfel, P., Martin, X., Schmid, C.: Towards weakly-supervised action localization. arXiv preprint arXiv:1605.05197 (2016)
  38. 38.
    Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. In: BMVC (2017)Google Scholar
  39. 39.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 (2015)
  40. 40.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)Google Scholar
  41. 41.
    Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: IEEE International Conference on Computer Vision, pp. 2923–2932. IEEE (2017)Google Scholar
  42. 42.
    Zunino, A., Cavazza, J., Koul, A., Cavallo, A., Becchio, C., Murino, V.: Predicting human intentions from motion cues only: a 2D+ 3D fusion approach. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 591–599. ACM (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Oxford Brookes UniversityOxfordUK

Personalised recommendations