Skip to main content

Get the Whole Action Event by Action Stage Classification

  • Conference paper
  • First Online:
  • 1544 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11016))

Abstract

Spatiotemporal action localization in videos is a challenging problem which is also an essential and important part of video understanding. Impressive progress has been reported in recent literature for action localization in videos, however, current state-of-the-art approaches haven’t considered the scenario of broken actions, in which an action in an untrimmed video is not a continuous image series anymore because of occlusion, shot change, etc. So, one action is divided into two or more footages (sub-actions) and the existing methods localize each of them as an independent action. To overcome the limitation, we introduce two major developments. Firstly, we adopt a tube-based method to localize all sub-actions and discriminate them into three action stages with a CNN classifier: Start, Process and End. Secondly, we propose a scheme to link the sub-actions to a complete action. As a result, our system is not only capable of performing spatiotemporal action localization in an online-realtime style, but also can filter out irrelevant frames and integrate sub-actions into single tube that has better robustness than the existing method.

Supported by National Natural Science Foundation of China (Grant No. 61373104) and Natural Science Foundation of Tianjin (Grant No. 16JCYBJC42300 and Grant No. 17JCQNJC00100) and Science and Technology Commission of Tianjin Municipality (Grant Nos. 15JCYBJC16100) and Program for Innovative Research Team in University of Tianjin (No. TD13-5032).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3657–3666, October 2017. https://doi.org/10.1109/ICCV.2017.393

  2. Gkioxari, G., Malik, J.: Finding action tubes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768, June 2015. https://doi.org/10.1109/CVPR.2015.7298676

  3. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4415–4423, October 2017. https://doi.org/10.1109/ICCV.2017.472

  4. Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45

    Chapter  Google Scholar 

  5. Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums (2017)

    Google Scholar 

  6. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3164–3172, December 2015. https://doi.org/10.1109/ICCV.2015.362

  7. Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos (2016)

    Google Scholar 

  8. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402

  9. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325

  10. Kroeger, T., Timofte, R., Dai, D., Gool, L.J.V.: Fast optical flow using dense inverse search. CoRR abs/1603.03590 (2016). http://arxiv.org/abs/1603.03590

  11. Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y.G., Xue, X.: Weakly supervised dense video captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

    Google Scholar 

  12. Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation, pp. 1273–1283 (2017)

    Google Scholar 

  13. Kaufman, D., Levi, G., Hassner, T., Wolf, L.: Temporal tessellation: a unified approach for video analysis. In: The IEEE International Conference on Computer Vision (ICCV), October 2017

    Google Scholar 

  14. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014). http://arxiv.org/abs/1406.2199

  15. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  16. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Lin, D., Tang, X.: Temporal action detection with structured segment networks. CoRR abs/1704.06228 (2017). http://arxiv.org/abs/1704.06228

  17. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  18. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). http://arxiv.org/abs/1311.2524

  19. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. CoRR abs/1604.07669 (2016). http://arxiv.org/abs/1604.07669

  20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guanghao Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, W., Wang, J., Wang, S., Jin, G. (2018). Get the Whole Action Event by Action Stage Classification. In: Yoshida, K., Lee, M. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2018. Lecture Notes in Computer Science(), vol 11016. Springer, Cham. https://doi.org/10.1007/978-3-319-97289-3_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-97289-3_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-97288-6

  • Online ISBN: 978-3-319-97289-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics