Real Time Violence Detection Based on Deep Spatio-Temporal Features

  • Qing XiaEmail author
  • Ping ZhangEmail author
  • JingJing Wang
  • Ming Tian
  • Chun Fei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10996)


Typical manually-selected features are insufficient to reliably detect violence actions. In this paper, we present a violence detection model that is based on a bi-channels convolutional neural network (CNN) and the support vector machine (SVM). The major contributions are twofolds: (1) we fork the original frames and the differential images into the proposed bi-channels CNN to obtain the appearance features and the motion features respectively. (2) The linear SVMs are adopted to classify the features and a label fusion approach is proposed to improve detection performance by integrating the appearance and motion information. We compared the proposed model with several state-of-the-art methods on two datasets. The results are promising and the proposed method can achieve real-time performance of 30 fps.


Violence detection Bi-channels convolution neural network Deep spatio-temporal features Label fusion 


  1. 1.
    Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vision 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  2. 2.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. In: CVPR 2005, pp. 886–893 (2005)Google Scholar
  3. 3.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). Scholar
  4. 4.
    De Souza, F.D.M., Chvez, G.C., Do Valle Jr., E.A., Arajo, A.D.A.: Violence detection in video using spatio-temporal features. In: Graphics, Patterns and Images, pp. 224–230 (2011)Google Scholar
  5. 5.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  6. 6.
    Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: Computer Vision and Pattern Recognition Workshops, pp. 1–6 (2012)Google Scholar
  7. 7.
    Bermejo Nievas, E., Deniz Suarez, O., Bueno García, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). Scholar
  8. 8.
    Chen, M.Y., Hauptmann, A.: Mosift: recognizing human actions in surveillance videos. Ann. Pharmacother. 39(1), 150–152 (2009)Google Scholar
  9. 9.
    Xu, L., Gong, C., Yang, J., Wu, Q., Yao, L.: Violent video detection based on mosift feature and sparse coding, pp. 3538–3542 (2014)Google Scholar
  10. 10.
    Wang, T., Snoussi, H.: Detection of abnormal visual events via global optical flow orientation histogram. IEEE Trans. Inf. Forensics Secur. 9(6), 988–998 (2014)CrossRefGoogle Scholar
  11. 11.
    Cong, Y., Yuan, J., Liu, J.: Abnormal event detection in crowded scenes using sparse representation. Pattern Recogn. 46(7), 1851–1864 (2013)CrossRefGoogle Scholar
  12. 12.
    Gnanavel, V.K., Srinivasan, A.: Abnormal event detection in crowded video scenes. In: Satapathy, S.C., Biswal, B.N., Udgata, S.K., Mandal, J.K. (eds.) Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014. AISC, vol. 328, pp. 441–448. Springer, Cham (2015). Scholar
  13. 13.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos, vol. 1, pp. 568–576 (2014)Google Scholar
  14. 14.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. iN: International Conference on Computer Vision, ICCV 2015, pp. 4489–4497 (2015)Google Scholar
  15. 15.
    Dong, Z., Qin, J., Wang, Y.: Multi-stream deep networks for person to person violence detection in videos, vol. 662, pp. 517–531 (2016)Google Scholar
  16. 16.
    Meng, Z., Yuan, J., Li, Z.: Trajectory-pooled deep convolutional networks for violence detection in videos. In: Liu, M., Chen, H., Vincze, M. (eds.) ICVS 2017. LNCS, vol. 10528, pp. 437–447. Springer, Cham (2017). Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)CrossRefGoogle Scholar
  18. 18.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition, pp. 1933–1941, January 2016Google Scholar
  19. 19.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets (2014)Google Scholar
  20. 20.
    Senst, T., Eiselein, V., Kuhn, A., Sikora, T.: Crowd violence detection using global motion-compensated lagrangian features and scale-sensitive video-level representation. IEEE Trans. Inf. Forensics Secur. 12(12), 2945–2956 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Optoelectronic Science and Engineering of UESTCUniversity of Electronic Science and Technology of ChinaChengduChina

Personalised recommendations