Get the Whole Action Event by Action Stage Classification

Li, Weiqi; Wang, Jianming; Wang, Shengbei; Jin, Guanghao

doi:10.1007/978-3-319-97289-3_18

Get the Whole Action Event by Action Stage Classification

Weiqi Li¹⁵,
Jianming Wang¹⁵,
Shengbei Wang¹⁵ &
…
Guanghao Jin¹⁵

Conference paper
First Online: 27 July 2018

1544 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11016))

Abstract

Spatiotemporal action localization in videos is a challenging problem which is also an essential and important part of video understanding. Impressive progress has been reported in recent literature for action localization in videos, however, current state-of-the-art approaches haven’t considered the scenario of broken actions, in which an action in an untrimmed video is not a continuous image series anymore because of occlusion, shot change, etc. So, one action is divided into two or more footages (sub-actions) and the existing methods localize each of them as an independent action. To overcome the limitation, we introduce two major developments. Firstly, we adopt a tube-based method to localize all sub-actions and discriminate them into three action stages with a CNN classifier: Start, Process and End. Secondly, we propose a scheme to link the sub-actions to a complete action. As a result, our system is not only capable of performing spatiotemporal action localization in an online-realtime style, but also can filter out irrelevant frames and integrate sub-actions into single tube that has better robustness than the existing method.

Supported by National Natural Science Foundation of China (Grant No. 61373104) and Natural Science Foundation of Tianjin (Grant No. 16JCYBJC42300 and Grant No. 17JCQNJC00100) and Science and Technology Commission of Tianjin Municipality (Grant Nos. 15JCYBJC16100) and Program for Innovative Research Team in University of Tianjin (No. TD13-5032).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3657–3666, October 2017. https://doi.org/10.1109/ICCV.2017.393
Gkioxari, G., Malik, J.: Finding action tubes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768, June 2015. https://doi.org/10.1109/CVPR.2015.7298676
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4415–4423, October 2017. https://doi.org/10.1109/ICCV.2017.472
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Chapter Google Scholar
Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums (2017)
Google Scholar
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3164–3172, December 2015. https://doi.org/10.1109/ICCV.2015.362
Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos (2016)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325
Kroeger, T., Timofte, R., Dai, D., Gool, L.J.V.: Fast optical flow using dense inverse search. CoRR abs/1603.03590 (2016). http://arxiv.org/abs/1603.03590
Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y.G., Xue, X.: Weakly supervised dense video captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation, pp. 1273–1283 (2017)
Google Scholar
Kaufman, D., Levi, G., Hassner, T., Wolf, L.: Temporal tessellation: a unified approach for video analysis. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014). http://arxiv.org/abs/1406.2199
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791
Article Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Lin, D., Tang, X.: Temporal action detection with structured segment networks. CoRR abs/1704.06228 (2017). http://arxiv.org/abs/1704.06228
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). http://arxiv.org/abs/1311.2524
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. CoRR abs/1604.07669 (2016). http://arxiv.org/abs/1604.07669
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin, China
Weiqi Li, Jianming Wang, Shengbei Wang & Guanghao Jin

Authors

Weiqi Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shengbei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guanghao Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guanghao Jin .

Editor information

Editors and Affiliations

University of Tsukuba , Tokyo, Japan
Kenichi Yoshida
Shih Chien University, Taipei City, Taiwan
Maria Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Wang, J., Wang, S., Jin, G. (2018). Get the Whole Action Event by Action Stage Classification. In: Yoshida, K., Lee, M. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2018. Lecture Notes in Computer Science(), vol 11016. Springer, Cham. https://doi.org/10.1007/978-3-319-97289-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-97289-3_18
Published: 27 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97288-6
Online ISBN: 978-3-319-97289-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics