Advertisement

A New Temporal Deconvolutional Pyramid Network for Action Detection

  • Xiangli Ji
  • Guibo Luo
  • Yuesheng ZhuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11364)

Abstract

Temporal action detection is a challenging task for detecting various action instances in untrimmed videos. Existing detection approaches are unable to localize the start and end time of action instances precisely. To address this issue, we propose a novel Temporal Deconvolutional Pyramid Network (TDPN), in which a Temporal Deconvolution Fusion (TDF) module in each pyramidal hierarchy is developed to construct strong semantic features of multiple temporal scales for detecting action instances with various durations. In the TDF module, the temporal resolution of high-level feature is expanded by a temporal deconvolution. The expanded high-level features and low-level features are fused by a fusion strategy to form strong semantic features. The fused semantic features with multiple temporal scales are used to predict action categories and boundary offsets simultaneously, which significantly improves the detection performance. Besides, a strict strategy for assigning label is proposed during training to improve the precision of temporal boundaries learned by model. We evaluate our detection approach on two public datasets, i.e., THUMOS14 and MEXaction2. The experimental results have demonstrated that our TDPN model can achieve competitive performance on THUMOS14 and best performance on MEXaction2 compared with the other approaches.

Keywords

Action detection Untrimmed videos TDPN network 

References

  1. 1.
  2. 2.
    Abadi, M., Agarwal, A., et al.: TensorFlow: largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  3. 3.
    Chao, Y., Vijayanarasimhan, S., Seybold, B., et al.: Rethinking the faster R-CNN architecture for temporal action localization. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City (2018)Google Scholar
  4. 4.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2782–2795 (2013)CrossRefGoogle Scholar
  5. 5.
    Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. In: British Machine Vision Conference. BMVA Press, London (2017)Google Scholar
  6. 6.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: TURN TAP: temporal unit regression network for temporal action proposals. In: IEEE International Conference on Computer Vision, pp. 3648–3656. IEEE Computer Society, Venice (2017)Google Scholar
  7. 7.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, M. (eds.) The Thirteenth International Conference on Artificial Intelligence and Statistics, vol. 9, pp. 249–256. PMLR, Sardinia (2010)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE, Las Vegas (2016)Google Scholar
  9. 9.
    Heilbron, F.C., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1914–1923. IEEE, Las Vegas (2016)Google Scholar
  10. 10.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  11. 11.
    Jain, M., van Gemert, J.C., et al.: Action localization by tubelets from motion. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 740–747. IEEE, Columbus (2014)Google Scholar
  12. 12.
    Jain, M., Gemert, J.V., et al.: End-to-end, single-stream temporal action detection in untrimmed videos. In: British Machine Vision Conference. BMVA Press, London (2017)Google Scholar
  13. 13.
    Jiang, Y.G., Liu, J., Roshan Zamir, A., et al.: THUMOS challenge: action recognition with a large number of classes (2014). http://crcv.ucf.edu/THUMOS14/
  14. 14.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  15. 15.
    Lin, T., Dollr, P., Girshick, R., et al.: Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944. IEEE, Honolulu (2017)Google Scholar
  16. 16.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 988–996. ACM, New York (2017)Google Scholar
  17. 17.
    Mettes, P., van Gemert, J.C., et al.: Bag-of-fragments: selecting and encoding video fragments for event detection and recounting. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 427–434. ACM, New York (2015)Google Scholar
  18. 18.
    Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at Thumos 2014 (2014). https://hal.inria.fr/hal-01074442
  19. 19.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. IEEE, Las Vegas (2016)Google Scholar
  20. 20.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  21. 21.
    Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3131–3140. IEEE, Las Vegas (2016)Google Scholar
  22. 22.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1417–1426. IEEE, Venice (2017)Google Scholar
  23. 23.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 568–576. MIT Press, Cambridge (2014)Google Scholar
  24. 24.
    Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Computer Society, Washington (2015)Google Scholar
  25. 25.
    Uijlings, J.R., Van De Sande, K.E., et al.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013).  https://doi.org/10.1007/s11263-013-0620-5CrossRefGoogle Scholar
  26. 26.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp. 3551–3558. IEEE, Sydney (2013)Google Scholar
  27. 27.
    Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6402–6411. IEEE, Honolulu (2017)Google Scholar
  28. 28.
    Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features (2014). http://crcv.ucf.edu/THUMOS14/
  29. 29.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  30. 30.
    Xiong, Y., Wang, L., et al.: CUHK & ETHZ & SIAT submission to ActivityNet challenge 2016. arXiv preprint arXiv:1608.00797 (2016)
  31. 31.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: IEEE International Conference on Computer Vision, pp. 5794–5803. IEEE, Venice (2017)Google Scholar
  32. 32.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687. IEEE, Las Vegas (2016)Google Scholar
  33. 33.
    Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3093–3102. IEEE, Las Vegas (2016)Google Scholar
  34. 34.
    Zhao, Y., Xiong, Y., Wang, L., et al.: Temporal action detection with structured segment networks. In: IEEE International Conference on Computer Vision, pp. 2933–2942. IEEE Computer Society, Venice (2017)Google Scholar
  35. 35.
    Zheng, S., Dongang, W., Fu, C.S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058. IEEE, Las Vegas (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Communication and Information Security Laboratory, Shenzhen Graduate SchoolPeking UniversityShenzhenChina

Personalised recommendations