Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Wu, Yutang; Wang, Hanli; Li, Qinyu

doi:10.1007/978-3-030-00767-6_28

Yutang Wu^18,19,
Hanli Wang^18,19 &
Qinyu Li^18,20

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11165))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2446 Accesses
1 Citations

Abstract

In recent years, human action detection in videos has gained wide attention. Instead of detection frame by frame, a model named action tubelet (ACT) detector detects human actions sequence by sequence and achieves remarkable performances on both accuracy and speed in the form of two streams. In this work, a three-stream action tubelet detector (three-stream ACT detector) is proposed which adds an extra pose stream to obtain more information about human actions and fuses three streams by weighted average compared to the two-stream architecture. The experimental results on the benchmark UCF-Sports, J-HMDB and UCF-101 datasets demonstrate that the proposed three-stream ACT detector framework is able to boost the performance of human action detection.

This work was supported in part by National Natural Science Foundation of China under Grants 61622115 and 61472281, Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), and IBM Shared University Research Awards Program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2_3
Chapter Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014)
Google Scholar
Gkioxari, G., Malik, J.: Finding action tubes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 759–768 (2015)
Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3192–3199. IEEE (2013)
Google Scholar
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4405–4413 (2017)
Google Scholar
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2003–2010 (2011)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Oliveira, G.L., Burgard, W., Brox, T.: Efficient deep models for monocular road segmentation. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4885–4891 (2016)
Google Scholar
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Chapter Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Fast R-CNN: Towards real-time object detection with region proposal networks. In: 2015 Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015)
Google Scholar
Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016)
Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3637–3646 (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2923–2932 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, Shanghai, 201804, People’s Republic of China
Yutang Wu, Hanli Wang & Qinyu Li
Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai, 200092, People’s Republic of China
Yutang Wu & Hanli Wang
Department of Computer Science, Lanzhou City University, Lanzhou, 730070, People’s Republic of China
Qinyu Li

Authors

Yutang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hanli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qinyu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanli Wang .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Y., Wang, H., Li, Q. (2018). Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-00767-6_28
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics