Advertisement

Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network

  • Haisheng Su
  • Xu ZhaoEmail author
  • Tianwei Lin
  • Haiping Fei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)

Abstract

Weakly supervised temporal action detection in untrimmed videos is an important yet challenging task, where only video-level class labels are available for temporally locating actions in the videos during training. In this paper, we propose a novel architecture for this task. Specifically, we put forward an effective shot-based sampling method aiming at generating a more simplified but representative feature sequence for action detection, instead of using uniform sampling which causes extremely irrelevant frames retained. Furthermore, in order to distinguish action instances existing in the videos, we design a multi-stage Temporal Pooling Network (TPN) for the purposes of predicting video categories and localizing class-specific action instances respectively. Experiments conducted on THUMOS14 dataset confirm that our method outperforms other state-of-the-art weakly supervised approaches.

Keywords

Temporal action detection Weak supervision Shot-based sampling Temporal pooling network Class-specific 

References

  1. 1.
    Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. 2016 Int. J. Comput. Vis. 119, 254–271 (2016).  https://doi.org/10.1007/s11263-015-0859-0MathSciNetCrossRefGoogle Scholar
  2. 2.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE Press, New York (2014)Google Scholar
  3. 3.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Press, New York (2015)Google Scholar
  4. 4.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  5. 5.
    Lin, T., Zhao, X., Fan, Z.: Temporal action localization with two-stream segment-based RNN. In: 2017 IEEE Conference on Image Processing, pp. 1–4. IEEE Press, New York (2017)Google Scholar
  6. 6.
    Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058. IEEE Press, New York (2016)Google Scholar
  7. 7.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: 25th ACM International Conference on Multimedia, pp. 988–996. ACM, California (2017)Google Scholar
  8. 8.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision, pp. 6–7. IEEE Press, New York (2017)Google Scholar
  9. 9.
    Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. arXiv preprint arXiv:1806.02964 (2018)
  10. 10.
    Gan, C., Wang, N., Yang, Y., Yeung, D., G.Hauptmann, A.: DevNet: a deep event network for multimedia event detection and evidence recounting. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2577. IEEE Press, New York (2015)Google Scholar
  11. 11.
    Singh, K.K., Lee, Y.J.: Hide-and-Seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: 2017 IEEE International Conference on Computer Vision, pp. 1961–1970. IEEE Press, New York (2017)Google Scholar
  12. 12.
    Simoyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576. Curran Associates Inc., New York (2014)Google Scholar
  13. 13.
    Wang, L., Xiong, Y., Lin, D., van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2–6. IEEE Press, New York (2017)Google Scholar
  14. 14.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE Press, New York (2016)Google Scholar
  15. 15.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-24673-2_3CrossRefGoogle Scholar
  16. 16.
    Yamasaki, T.: Histogram of oriented gradients. In: Journal of the Institute of Image Information and Television Engineers, pp. 1368–1371 (2010)Google Scholar
  17. 17.
    Lin, M., Chen, Q., Yan, S.: Network in network. In: 2014 IEEE International Conference on Learning Representations, pp. 1–4. IEEE Press, New York (2014)Google Scholar
  18. 18.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes. In: ECCV Workshop, vol. 5. Springer, Heidelberg (2014)Google Scholar
  19. 19.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)Google Scholar
  20. 20.
    Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  21. 21.
    Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at thumos2014. In: Thumos14 Action Recognition Challenge, pp. 1–7. Springer, Heidelberg (2014)Google Scholar
  22. 22.
    Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Haisheng Su
    • 1
  • Xu Zhao
    • 1
    Email author
  • Tianwei Lin
    • 1
  • Haiping Fei
    • 2
  1. 1.Department of AutomationShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Industrial Internet Innovation Center (Shanghai) Co., Ltd.ShanghaiChina

Personalised recommendations