Abstract
Spatiotemporal action localization in videos is a challenging problem which is also an essential and important part of video understanding. Impressive progress has been reported in recent literature for action localization in videos, however, current state-of-the-art approaches haven’t considered the scenario of broken actions, in which an action in an untrimmed video is not a continuous image series anymore because of occlusion, shot change, etc. So, one action is divided into two or more footages (sub-actions) and the existing methods localize each of them as an independent action. To overcome the limitation, we introduce two major developments. Firstly, we adopt a tube-based method to localize all sub-actions and discriminate them into three action stages with a CNN classifier: Start, Process and End. Secondly, we propose a scheme to link the sub-actions to a complete action. As a result, our system is not only capable of performing spatiotemporal action localization in an online-realtime style, but also can filter out irrelevant frames and integrate sub-actions into single tube that has better robustness than the existing method.
Supported by National Natural Science Foundation of China (Grant No. 61373104) and Natural Science Foundation of Tianjin (Grant No. 16JCYBJC42300 and Grant No. 17JCQNJC00100) and Science and Technology Commission of Tianjin Municipality (Grant Nos. 15JCYBJC16100) and Program for Innovative Research Team in University of Tianjin (No. TD13-5032).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3657–3666, October 2017. https://doi.org/10.1109/ICCV.2017.393
Gkioxari, G., Malik, J.: Finding action tubes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768, June 2015. https://doi.org/10.1109/CVPR.2015.7298676
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4415–4423, October 2017. https://doi.org/10.1109/ICCV.2017.472
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums (2017)
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3164–3172, December 2015. https://doi.org/10.1109/ICCV.2015.362
Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos (2016)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325
Kroeger, T., Timofte, R., Dai, D., Gool, L.J.V.: Fast optical flow using dense inverse search. CoRR abs/1603.03590 (2016). http://arxiv.org/abs/1603.03590
Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y.G., Xue, X.: Weakly supervised dense video captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation, pp. 1273–1283 (2017)
Kaufman, D., Levi, G., Hassner, T., Wolf, L.: Temporal tessellation: a unified approach for video analysis. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014). http://arxiv.org/abs/1406.2199
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Lin, D., Tang, X.: Temporal action detection with structured segment networks. CoRR abs/1704.06228 (2017). http://arxiv.org/abs/1704.06228
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). http://arxiv.org/abs/1311.2524
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. CoRR abs/1604.07669 (2016). http://arxiv.org/abs/1604.07669
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, W., Wang, J., Wang, S., Jin, G. (2018). Get the Whole Action Event by Action Stage Classification. In: Yoshida, K., Lee, M. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2018. Lecture Notes in Computer Science(), vol 11016. Springer, Cham. https://doi.org/10.1007/978-3-319-97289-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-97289-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97288-6
Online ISBN: 978-3-319-97289-3
eBook Packages: Computer ScienceComputer Science (R0)