\(\text{C}^{3}\text{Net}\): end-to-end deep learning for efficient real-time visual active camera control


The need for automated real-time visual systems in applications such as smart camera surveillance, smart environments, and drones necessitates the improvement of methods for visual active monitoring and control. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. However, such methods are difficult to jointly optimize and tune their various parameters for real-time processing in resource constraint systems. In this paper a deep Convolutional Camera Controller Neural Network is proposed to go directly from visual information to camera movement to provide an efficient solution to the active vision problem. It is trained end-to-end without bounding box annotations to control a camera and follow multiple targets from raw pixel values. Evaluation through both a simulation framework and real experimental setup, indicate that the proposed solution is robust to varying conditions and able to achieve better monitoring performance than traditional approaches both in terms of number of targets monitored as well as in effective monitoring time. The advantage of the proposed approach is that it is computationally less demanding and can run at over 10 FPS (\(\sim 4\times \) speedup) on an embedded smart camera providing a practical and affordable solution to real-time active monitoring.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pp. 265–283. USENIX Association, Berkeley, CA, USA (2016). http://dl.acm.org/citation.cfm?id=3026877.3026899

  2. 2.

    Al Haj, M., Fernández, C., Xiong, Z., Huerta, I., Gonzàlez, J., Roca, X.: Beyond the Static Camera: Issues and Trends in Active Vision. Springer, London (2011)

    Google Scholar 

  3. 3.

    Al Machot, F., Ali, M., Haj Mosa, A.: Real-time raindrop detection based on cellular neural networks for ADAS. J Real Time Image Proc. 16, 1 (2019)

    Article  Google Scholar 

  4. 4.

    Angella, F., Reithler, L., Gallesio, F.: Optimal deployment of cameras for video surveillance systems. In: 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 388–392 (2007). https://doi.org/10.1109/AVSS.2007.4425342

  5. 5.

    Bateux, Q., Marchand, E., Leitner, J., Chaumette, F., Corke, P.: Training deep neural networks for visual servoing. In: 2018 IEEEInternational Conference on Robotics and Automation (ICRA), pp. 3307–3314 (2018). https://doi.org/10.1109/ICRA.2018.8461068

  6. 6.

    Bernardin, K., van de Camp, F., Stiefelhagen, R.: Automatic person detection and tracking using fuzzy controlled active cameras. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383502

  7. 7.

    Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple on-line and realtime tracking. In: 2016 IEEE International Confer-ence on Image Processing (ICIP), pp. 3464–3468 (2016). https://doi.org/10.1109/ICIP.2016.7533003

  8. 8.

    Bhanu, B., Ravishankar, C.V., Roy-Chowdhury, A.K., Aghajan, H., Terzopoulos, D.: Distributed Video Sensor Networks, 1st edn. Springer Publishing Company, Incorporated, Berlin (2011)

    Google Scholar 

  9. 9.

    Biswas, A., Guha, P., Mukerjee, A., Venkatesh, K.S.: Intrusion detection and tracking with pan-tilt cameras. In: 2006 IET International Conference on Visual Information Engineering, pp. 565–571 (2006). https://doi.org/10.1049/cp:20060593

  10. 10.

    Bobda, C., Velipasalar, S.: Distributed Embedded Smart Cameras: Architectures, Design and Applications, 1st edn. Springer, New York (2014)

    Google Scholar 

  11. 11.

    Bo Bo, N., Deboeverie, F., Veelaert, P., Philips, W.: Real-time multi-people tracking by greedy likelihood maximization. In: Proceedings of the 9th International Conference on Distributed Smart Cameras, ICDSC ’15, pp. 32–37. ACM, New York (2015). https://doi.org/10.1145/2789116.2789125. http://doi.acm.org/10.1145/2789116.2789125

  12. 12.

    Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms – improving object detection with one line of code. In: Proceedings ofthe IEEE international conference on computer vision (ICCV)(2017)

  13. 13.

    Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to end learning for self-driving cars. CoRR (2016). arXiv:1604.07316

  14. 14.

    Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools (2000)

  15. 15.

    Campmany, V., Silva, S., Espinosa, A., Moure, J., Vázquez, D., López, A.: GPU-based pedestrian detection for autonomous driving. Procedia Comput. Sci. 80, 2377–2381 (2016). https://doi.org/10.1016/j.procs.2016.05.455. International Conference on Computational Science 2016, ICCS 2016, 6–8 June 2016, San Diego, California, USA

  16. 16.

    Chahyati,D., Fanany,M.I., Arymurthy, A.M.: Tracking people by detection using cnn features. Procedia Computer Science, 124, 167–172 (2017)

  17. 17.

    Chen, H., Zhao, X., Tan, M.: A novel pan-tilt camera control approach for visual tracking. In: Proceeding of the 11th World Congress on Intelligent Control and Automation, pp. 2860–2865 (2014). https://doi.org/10.1109/WCICA.2014.7053182

  18. 18.

    Chollet, F.: Keras (2015). https://github.com/fchollet/keras

  19. 19.

    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 (2005). https://doi.org/10.1109/CVPR.2005.177

  20. 20.

    Dhillon, P.S.: Robust real-time face tracking using an active camera. In: Herrero, Á., Gastaldo, P., Zunino, R., Corchado, E. (eds.) Computational Intelligence in Security for Information Systems, pp. 179–186. Springer, Berlin (2009)

    Google Scholar 

  21. 21.

    Ding, C., Song, B., Morye, A., Farrell, J.A., Roy-Chowdhury, A.K.: Collaborative sensing in a distributed PTZ camera network. IEEE Trans. Image Process. 21(7), 3282–3295 (2012). https://doi.org/10.1109/IROS.2009.5353915

    MathSciNet  Article  MATH  Google Scholar 

  22. 22.

    Dinh, T., Yu, Q., Medioni, G.: Real time tracking using an active pan-tilt-zoom network camera. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3786–3793 (2009). https://doi.org/10.1109/IROS.2009.5353915

  23. 23.

    Fan, H., Ling, H.: Siamese cascaded region proposal networksfor real-time visual tracking. In: 2019 IEEE/CVF conferenceon computer vision and pattern recognition (CVPR), pp. 7944–7953 (2019)

  24. 24.

    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track andtrack to detect. In: Proceedings of the IEEE international conference on computer Vision (ICCV), pp. 3038–3046 (2017)

  25. 25.

    Ferryman, J., Shahrokni, A.: Pets2009: Dataset and challenge. In: 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6 (2009). https://doi.org/10.1109/PETS-WINTER.2009.5399556

  26. 26.

    Haj, M.A., Bagdanov, A.D., Gonzalez, J., Roca, F..: Reactive object tracking with a single PTZ camera. In: 2010 20th International Conference on Pattern Recognition, pp. 1690–1693 (2010). https://doi.org/10.1109/ICPR.2010.418

  27. 27.

    Hosang, J., Benenson, R., Schiele, B.: Learning non-maximumsuppression. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp. 4507–4515 (2017)

  28. 28.

    Kiran, M., Tiwari, V., Nguyen-Meidine, L., Morin, L., Granger, E.: On the interaction between deep detectors and siamese trackers in video surveillance. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE Computer Society, Los Alamitos, CA, USA pp. 1–8 (2019)

  29. 29.

    Kulathumani, V., Parupati, S., Ross, A., Jillela, R.: Collaborative Face Recognition Using a Network of Embedded Cameras, pp. 373–387. Springer, London (2011)

    Google Scholar 

  30. 30.

    Kyrkou, C., Christoforou, E.G., Timotheou, S., Theocharides, T., Panayiotou, C., Polycarpou, M.: Optimizing the detectionperformance of smart camera networks through a probabilistic image-based model. In: IEEE transactions on circuits and systems for video technology 28(5), 1197–1211 (2018)

  31. 31.

    Ser-Nam L., Elgammal, A., Davis, L.S.: Image-based pan-tilt camera control in a multi-camera surveillance environment. In: 2003 International conference on multimedia and Expo. ICME’03. Proceedings (Cat. No.03TH8698), vol. 1, pp. I–645 (2003)

  32. 32.

    Luo, W., Sun, P., Zhong, F., Liu, W., Zhang, T., Wang, Y.: End-to-end active object tracking via reinforcement learning. In: Proceed-ings of the 35th international conference on machine learning, proceedings of machine learning research, vol. 80, pp. 3286–3295 (2018)

  33. 33.

    Miao, X., Zhen, X., Liu, X., Deng, C., Athitsos, V., Huang, H.: Direct shape regression networks for end-to-end face alignment. In: 2018 IEEE/CVF conference on computer vi-sion and pattern recognition, pp. 5040–5049 (2018)

  34. 34.

    Micheloni, C., Rinner, B., Foresti, G.L.: Video analysis in pan-tilt-zoom camera networks. IEEE Signal Process. Mag. 27(5), 78–90 (2010). https://doi.org/10.1109/CVPR.2017.690

    Article  Google Scholar 

  35. 35.

    Neff, C., Mendieta, M., Mohan, S., Baharani, M., Rogers, S., Tabkhi, H.: Revamp2t: Real-time edge video analytics for multi camera privacy-aware pedestrian tracking. IEEE Int of Things J 7(4), 2591–2602 (2020)

  36. 36.

    Patil, H.R., Bhagat, K.S.: Detection and tracking of moving objects; a survey. Int. J. Eng. Res. Appl. 5(11), 138–142 (2015)

    Google Scholar 

  37. 37.

    Pflugfelder, R.P.: Siamese learning visual tracking: a survey. CoRR (2017). arXiv:1707.00569

  38. 38.

    Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017). https://doi.org/10.1109/CVPR.2017.690

  39. 39.

    Salih, Y., Malik, A.S.: Depth and geometry from a single 2d image using triangulation. In: 2012 IEEE International Conference on Multimedia and Expo Workshops, pp. 511–515 (2012). https://doi.org/10.1109/ICMEW.2012.95

  40. 40.

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017). https://doi.org/10.1109/ICCV.2017.74

  41. 41.

    Wang, R., Dong, H., Han, T.X., Mei, L.: Robust tracking via monocular active vision for an intelligent teaching system. Vis. Comput. 32(11), 1379–1394 (2016). https://doi.org/10.1007/s00371-015-1206-8

    Article  Google Scholar 

  42. 42.

    Zhao, Z., Zheng, P., Xu, S., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019). http://dl.acm.org/citation.cfm?id=3026877.30268990

    Article  Google Scholar 

Download references


This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 739551 (KIOSCoE) and from the Government of the Republic of Cyprus through the Directorate General for European Programmes, Coordination and Development.

Author information



Corresponding author

Correspondence to Christos Kyrkou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file2 (MP4 459 KB)

Supplementary file3 (AVI 571 KB)

Supplementary file4 (AVI 1420 KB)

Supplementary file5 (AVI 1147 KB)

Supplementary file6 (AVI 340 KB)

Supplementary file7 (AVI 2892 KB)

Supplementary file8 (AVI 11225 KB)

Supplementary file9 (AVI 1647 KB)

Supplementary file10 (AVI 3247 KB)

Supplementary file11 (AVI 3246 KB)

Supplementary file12 (AVI 1675 KB)

Supplementary file13 (AVI 1459 KB)

Supplementary file1 (TXT 2 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kyrkou, C. \(\text{C}^{3}\text{Net}\): end-to-end deep learning for efficient real-time visual active camera control. J Real-Time Image Proc (2021). https://doi.org/10.1007/s11554-021-01077-z

Download citation


  • Real-time active vision
  • Smart camera
  • Deep learning
  • End-to-end learning