Multimedia Tools and Applications

, Volume 77, Issue 3, pp 3303–3316 | Cite as

Action detection based on tracklets with the two-stream CNN

  • Minwen Zhang
  • Chenqiang Gao
  • Qiang Li
  • Lan Wang
  • Jiayao Zhang
Article
  • 137 Downloads

Abstract

Different from action recognition which just needs to assign correct labels to video clips, action detection aims to recognize and localize the action from an unknown video. While action recognition has made a good progress, action detection still remains a challenging task. Inspired by the success of object detection and action recognition based on the powerful Convolutional Neural Network (CNN), in this paper, a novel action detection method is proposed by embedding multiple object tracking into the action detection process. Firstly, we fine-tune the off-the-shelf faster-RCNN model to detect people in frames. Then, a simple tracking-by-detection algorithm is adopted to obtain tracklets for keeping temporal consistency. After that, we apply a temporal multi-scale sliding window strategy to each tracklet to generate the action proposal. Finally, the action proposal is further fed into a fully connected neural network to complete the classification task. Here, features of the action proposal are obtained by the two-stream CNN. Experiment results reveal that our method outperforms the state-of-the-art methods on J-HMDB and UCF sports action detection datasets.

Keywords

Action detection Action classification Object tracking 

Notes

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No.61571071), Wenfeng innovation and start-up project of Chongqing University of Posts and Telecommunications (No. WF201404), the Research Innovation Program for Postgraduate of Chongqing (No. CYS17222). The authors also thank NVIDIA corporation for the donation of GeForce GTX TITAN X GPU.

References

  1. 1.
    Chang X, Yang Y (2016) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst PP(99):1–12.  https://doi.org/10.1109/TNNLS.2016.2582746 Google Scholar
  2. 2.
    Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero-shot event detection. In: IJCAI, pp 2234–2240. AAAI PressGoogle Scholar
  3. 3.
    Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning, pp 1348–1357. PMLRGoogle Scholar
  4. 4.
    Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank- k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann A (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197CrossRefGoogle Scholar
  7. 7.
    Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632.  https://doi.org/10.1109/TPAMI.2016.2608901 CrossRefGoogle Scholar
  8. 8.
    Chéron G, Laptev I, Schmid C (2015) P-cnn: pose-based cnn features for action recognition. In: 2015 IEEE international conference on computer vision (ICCV), pp 3218–3226. IEEEGoogle Scholar
  9. 9.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR), vol. 1, pp 886–893. IEEEGoogle Scholar
  10. 10.
    Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European conference on computer vision, pp 428–441. SpringerGoogle Scholar
  11. 11.
    Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: 2005 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72. IEEEGoogle Scholar
  12. 12.
    Gao C, Meng D, Tong W, Yang Y, Cai Y, Shen H, Liu G, Xu S, Hauptmann AG (2014) Interactive surveillance event detection through mid-level discriminative representation. In: Proceedings of international conference on multimedia retrieval, pp 305–312. ACMGoogle Scholar
  13. 13.
    Gao C, Du Y, Liu J, Lv J, Yang L, Meng D, Hauptmann AG (2016) Infar dataset: infrared action recognition at different times. Neurocomputing 212:36–47CrossRefGoogle Scholar
  14. 14.
    Girshick R (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448. IEEEGoogle Scholar
  15. 15.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587. IEEEGoogle Scholar
  16. 16.
    Gkioxari G, Malik J (2015) Finding action tubes. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 759–768. IEEEGoogle Scholar
  17. 17.
    Jain M, Van Gemert J, Jégou H, Bouthemy P, Snoek CG (2014) Action localization with tubelets from motion. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 740–747. IEEEGoogle Scholar
  18. 18.
    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3304–3311. IEEEGoogle Scholar
  19. 19.
    Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: 2013 IEEE international conference on computer vision (ICCV), pp 3192–3199. IEEEGoogle Scholar
  20. 20.
    Ji S, Xu W, Yang M (2013) Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  21. 21.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 1725–1732. IEEEGoogle Scholar
  22. 22.
    Klaser A, Marszałek M, Schmid C, Zisserman A (2010) Human focused action localization in video. In: SGA 2010-international workshop on sign, gesture, and activity, ECCV 2010 workshops, vol. 6553, pp 219–233. SpringerGoogle Scholar
  23. 23.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25 (NIPS 2012), pp 1097–1105. Curran Associates, IncGoogle Scholar
  24. 24.
    Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 IEEE international conference on computer vision (ICCV), pp 2003–2010. IEEEGoogle Scholar
  25. 25.
    LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551CrossRefGoogle Scholar
  26. 26.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: Proceedings of the 14th European conference on computer vision, pp 21–37. SpringerGoogle Scholar
  27. 27.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440. IEEEGoogle Scholar
  28. 28.
    Lowe D.G. (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110CrossRefGoogle Scholar
  29. 29.
    Perronnin F, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of the 11th European conference on computer vision, pp 143–156. SpringerGoogle Scholar
  30. 30.
    Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28:976–990CrossRefGoogle Scholar
  31. 31.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems 28 (NIPS 2015), pp 91–99. Curran Associates, IncGoogle Scholar
  32. 32.
    Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: 2008 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8. IEEEGoogle Scholar
  33. 33.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems 27 (NIPS 2014), pp 568–576. Curran Associates, IncGoogle Scholar
  34. 34.
    Tian Y, Sukthankar R, Shah M (2013) Spatiotemporal deformable part models for action detection. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 2642–2649. IEEEGoogle Scholar
  35. 35.
    Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171CrossRefGoogle Scholar
  36. 36.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision (ICCV), pp 3551–3558. IEEEGoogle Scholar
  37. 37.
    Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 3169–3176. IEEEGoogle Scholar
  38. 38.
    Wang L, Qiao Y, Tang X, Van Gool L (2016) Actionness estimation using hybrid fully convolutional networks. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2708–2717. IEEEGoogle Scholar
  39. 39.
    Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115:224–241CrossRefGoogle Scholar
  40. 40.
    Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: 2015 IEEE international conference on computer vision (ICCV), pp 3164–3172. IEEEGoogle Scholar
  41. 41.
    Xiang Y, Alahi A, Savarese S (2015) Learning to track: online multi-object tracking by decision making. In: 2015 IEEE international conference on computer vision (ICCV), pp 4705–4713. IEEEGoogle Scholar
  42. 42.
    Yan Y, Ricci E, Liu G, Subramanian R, Sebe N (2014) Clustered multi-task linear discriminant analysis for view invariant color-depth action recognition. In: 2014 22nd international conference on pattern recognition (ICPR), pp 3493–3498. IEEEGoogle Scholar
  43. 43.
    Yan Y, Ricci E, Subramanian R, Liu G, Sebe N (2014) Multitask linear discriminant analysis for view invariant action recognition. IEEE Trans Image Process 23:5599–5611MathSciNetCrossRefMATHGoogle Scholar
  44. 44.
    Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2678–2687. IEEEGoogle Scholar
  45. 45.
    Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1302–1311. IEEEGoogle Scholar
  46. 46.
    Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232MathSciNetCrossRefGoogle Scholar
  47. 47.
    Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758MathSciNetCrossRefGoogle Scholar
  48. 48.
    Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39 (5):865–878CrossRefGoogle Scholar
  49. 49.
    Zhu L, Shen J, Jin H, Xie L, Zheng R (2015) Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans Multimedia 17(7):981–993CrossRefGoogle Scholar
  50. 50.
    Zhu L, Shen J, Jin H, Zheng R, Xie L (2015) Content-based visual landmark search via multimodal hypergraph learning. IEEE Trans Cybern 45(12):2756–2769CrossRefGoogle Scholar
  51. 51.
    Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern PP(99):1–14.  https://doi.org/10.1109/TCYB.2016.2591068 CrossRefGoogle Scholar
  52. 52.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29 (2):472–486CrossRefGoogle Scholar
  53. 53.
    Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: Proceedings of the 13th European Conference on Computer Vision, pp 391–405. SpringerGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Minwen Zhang
    • 1
  • Chenqiang Gao
    • 1
  • Qiang Li
    • 1
  • Lan Wang
    • 1
  • Jiayao Zhang
    • 1
  1. 1.Chongqing Key Laboratory of Signal and Information ProcessingChongqing University of Posts and TelecommunicationsChongqingChina

Personalised recommendations