Advertisement

STMP: Spatial Temporal Multi-level Proposal Network for Activity Detection

  • Guang Chen
  • Yuexian ZouEmail author
  • Can Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11295)

Abstract

We propose a network for unconstrained scene activity detection called STMP to provide a deep learning method that can encode effective multi-level spatiotemporal information simultaneously and perform accurate temporal activity localization and recognition. Aiming at encoding meaningful spatial information to generate high-quality activity proposals in a fixed temporal scale, a spatial feature hierarchy is introduced in this approach. Meanwhile, to deal with various time scale activities, temporal feature hierarchy is proposed to represent activities of different temporal scales. The core component in STMP is STFH, which is a unified network implemented Spatial and Temporal Feature Hierarchy. On each level of STFH, an activity proposal detector is trained to detect activities in inherent temporal scale, which allows our STMP to make the full use of multi-level spatiotemporal information. Most importantly, STMP is a simple, fast and end-to-end trainable model due to its pure and unified framework. We evaluate STMP on two challenging activity detection datasets, and we achieve state-of-the-art results on THUMOS’14 (about 9.3% absolute improvement over the previous state-of-the-art approach R-C3D [1]) and obtains comparable results on ActivityNet1.3.

Keywords

Activity detection Spatiotemporal feature hierarchy Multi-level proposal detector 

Notes

Acknowledgement

This paper was partially supported by the Shenzhen Science & Technology Fundamental Research Program (No: JCYJ20160330095814461) & Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467). Special acknowledgements are given to Aoto-PKUSZ Joint Research Center of Artificial Intelligence on Scene Cognition & Technology Innovation for its support.

References

  1. 1.
    Xu, H., Das, A., Saenko, K.: R-C3D: Region convolutional 3D network for temporal activity detection. In: The IEEE International Conference on Computer Vision (ICCV), p. 8. (2017)Google Scholar
  2. 2.
    Girshick, R.: Fast R-CNN. arXiv preprint arXiv:1504.08083 (2015)
  3. 3.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  4. 4.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 988–996. ACM (2017)Google Scholar
  5. 5.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  6. 6.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: The IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  7. 7.
    Roerdink, J.B., Meijster, A.: The watershed transform: definitions, algorithms and parallelization strategies. Fundamenta informaticae 41, 187–228 (2000)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: Single-stream temporal action proposals. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6373–6382. IEEE (2017)Google Scholar
  9. 9.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. IEEE (2013)Google Scholar
  10. 10.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  12. 12.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7445–7454. IEEE (2017)Google Scholar
  13. 13.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  14. 14.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542. IEEE (2017)Google Scholar
  15. 15.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  16. 16.
    Jiang, Y., et al.: THUMOS challenge: action recognition with a large number of classes (2014)Google Scholar
  17. 17.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)Google Scholar
  18. 18.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)Google Scholar
  19. 19.
    Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1961–1970. IEEE (2016)Google Scholar
  20. 20.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687 (2016)Google Scholar
  21. 21.
    Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3093–3102 (2016)Google Scholar
  22. 22.
    Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)Google Scholar
  23. 23.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  24. 24.
    Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at Thumos 2014 (2014)Google Scholar
  25. 25.
    Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3131–3140 (2016)Google Scholar
  26. 26.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.-F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1417–1426. IEEE (2017)Google Scholar
  27. 27.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q.: Temporal context network for activity localization in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5727–5736. IEEE (2017)Google Scholar
  28. 28.
    Montes, A., Salvador, A., Pascual, S., Giro-i-Nieto, X.: Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128 (2016)
  29. 29.
    Wang, R., Tao, D.: UTS at activitynet 2016. AcitivityNet Large Scale Activity Recognition Challenge 2016, 8 (2016)Google Scholar
  30. 30.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_47CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.ADSPLAB, School of ECEPeking UniversityShenzhenChina
  2. 2.Peng Cheng LaboratoryShenzhenChina

Personalised recommendations