WatchNet++: efficient and accurate depth-based network for detecting people attacks and intrusion

Abstract

We present an efficient and accurate people detection approach based on deep learning to detect people attacks and intrusion in video surveillance scenarios Unlike other approaches using background segmentation and pre-processing techniques, which are not able to distinguish people from other elements in the scene, we propose WatchNet++ that is a depth-based and sequential network that localizes people in top-view depth images by predicting human body joints and pairwise connections (links) such as head and shoulders. WatchNet++ comprises a set of prediction stages and up-sampling operations that progressively refine the predictions of joints and links, leading to more accurate localization results. In order to train the network with varied and abundant data, we also present a large synthetic dataset of depth images with human models that is used to pre-train the network model. Subsequently, domain adaptation to real data is done via fine-tuning using a real dataset of depth images with people performing attacks and intrusion. An extensive evaluation of the proposed approach is conducted for the detection of attacks in airlocks and the counting of people in indoors and outdoors, showing high detection scores and efficiency. The network runs at 10 and 28 FPS using CPU and GPU, respectively.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. 1.

    http://www.blender.org.

  2. 2.

    http://www.makehuman.org/.

  3. 3.

    http://mocap.cs.cmu.edu/.

  4. 4.

    https://www.idiap.ch/dataset/unicity.

References

  1. 1.

    Ahmad, M., Ahmed, I., Ullah, K., Khan, I., Khattak, A., Adnan, A.: Person detection from overhead view: a survey. Int. J. Adv. Comput. Sci. Appl. 10(4), 567–577 (2019)

    Google Scholar 

  2. 2.

    Ahmed, I., Adnan, A.: A robust algorithm for detecting people in overhead views. Clust. Comput. 21(1), 633–654 (2018)

    Article  Google Scholar 

  3. 3.

    Bondi, E., Seidenari, L., Bagdanov, A.D., Del Bimbo, A.: Real-time people counting from depth imagery of crowded environments. In: 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 337–342. IEEE (2014)

  4. 4.

    Boominathan, L., Kruthiventi, S.S., Babu, R.V.: Crowdnet: A deep convolutional network for dense crowd counting. In: Proceedings of the 2016 ACM on Multimedia Conference (2016)

  5. 5.

    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

  6. 6.

    Carincotte, C., Naturel, X., Hick, M., Odobez, J.M., Yao, J., Bastide, A., Corbucci, B.: Understanding metro station usage using closed circuit television cameras analysis. In: ITSC (2008)

  7. 7.

    Carletti, V., Del Pizzo, L., Percannella, G., Vento, M.: An efficient and effective method for people detection from top-view depth cameras. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE (2017)

  8. 8.

    Chen, S., Bremond, F., Nguyen, H., Thomas, H.: Exploring depth information for head detection with depth images. In: 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 228–234. IEEE (2016)

  9. 9.

    Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., Vento, M.: A versatile and effective method for counting people on either rgb or depth overhead cameras. In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6. IEEE (2015)

  10. 10.

    Dumoulin, J., Canévet, O., Villamizar, M., Nunes, H., Khaled, O.A., Mugellini, E., Moscheni, F., Odobez, J.M.: Unicity: A depth maps database for people detection in security airlocks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE (2018)

  11. 11.

    Galčík, F., Gargalík, R.: Real-time depth map based people counting. In: International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 330–341. Springer (2013)

  12. 12.

    Garrell, A., Villamizar, M., Moreno-Noguer, F., Sanfeliu, A.: Teaching robot’s proactive behavior using human assistance. Int. J. Soc. Robot. 9(2), 231–249 (2017)

    Article  Google Scholar 

  13. 13.

    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

  14. 14.

    Hu, R., Wang, R., Shan, S., Chen, X.: Robust head-shoulder detection using a two-stage cascade framework. In: 2014 22nd International Conference on Pattern Recognition, pp. 2796–2801. IEEE (2014)

  15. 15.

    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015)

  16. 16.

    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

  17. 17.

    Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11977–11986 (2019)

  18. 18.

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  19. 19.

    Lejbolle, A.R., Krogh, B., Nasrollahi, K., Moeslund, T.B.: Attention in multimodal neural networks for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 179–187 (2018)

  20. 20.

    Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Advances in Neural Information Processing Systems, pp. 1324–1332 (2010)

  21. 21.

    Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Trans. Med. Imag. 37(12), 2663–2674 (2018)

    Article  Google Scholar 

  22. 22.

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  23. 23.

    Ma, Z., Chan, A.B.: Crossing the line: Crowd counting by integer programming with local features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2539–2546 (2013)

  24. 24.

    Nalepa, J., Szymanek, J., Kawulok, M.: Real-time people counting from depth images. In: International Conference: Beyond Databases, Architectures and Structures (2015)

  25. 25.

    Rauter, M.: Reliable human detection and tracking in top-view depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 529–534 (2013)

  26. 26.

    Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

  27. 27.

    Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer (2015)

  28. 28.

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  29. 29.

    Song, H., Sun, S., Akhtar, N., Zhang, C., Li, J., Mian, A.: Benchmark data and method for real-time people counting in cluttered scenes using depth sensors. arXiv:1804.04339 (2018)

  30. 30.

    Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., Birchfield, S.: Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 969–977 (2018)

  31. 31.

    Tu, J., Zhang, C., Hao, P.: Robust real-time attention-based head-shoulder detection for video surveillance. In: 2013 IEEE International Conference on Image Processing, pp. 3340–3344. IEEE (2013)

  32. 32.

    Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I., Schmid, C.: Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117 (2017)

  33. 33.

    Vera, P., Zenteno, D., Salas, J.: Counting pedestrians in bidirectional scenarios using zenithal depth images. In: Mexican Conference on Pattern Recognition (2013)

  34. 34.

    Villamizar, M., Andrade-Cetto, J., Sanfeliu, A., Moreno-Noguer, F.: Boosted random ferns for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 272–288 (2018)

    Article  Google Scholar 

  35. 35.

    Villamizar, M., Martínez-González, A., Canévet, O., Odobez, J.M.: Watchnet: efficient and depth-based network for people detection in video surveillance systems. In: IEEE International Conference on Advanced Video and Signal-based Surveillance (2018)

  36. 36.

    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2012)

    Article  Google Scholar 

  37. 37.

    Zhang, X., Yan, J., Feng, S., Lei, Z., Yi, D., Li, S.Z.: Water filling: Unsupervised people counting via vertical kinect sensor. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-based Surveillance, pp. 215–220. IEEE (2012)

  38. 38.

    Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597 (2016)

  39. 39.

    Zhu, L., Wong, K.H.: Human tracking and counting using the kinect range sensor based on adaboost and kalman filter. In: International Symposium on Visual Computing (2013)

Download references

Acknowledgements

The work was supported by Innosuisse, the Swiss innovation agency, through the UNICITY (3D scene understanding through machine learning to secure entrance zones) project.

Author information

Affiliations

Authors

Corresponding author

Correspondence to M. Villamizar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Villamizar, M., Martínez-González, A., Canévet, O. et al. WatchNet++: efficient and accurate depth-based network for detecting people attacks and intrusion. Machine Vision and Applications 31, 41 (2020). https://doi.org/10.1007/s00138-020-01089-y

Download citation

Keywords

  • Video surveillance
  • People detection
  • Convolutional network
  • Deep learning