Advertisement

RN-VID: A Feature Fusion Architecture for Video Object Detection

  • Hughes PerreaultEmail author
  • Maguelonne Heritier
  • Pierre Gravel
  • Guillaume-Alexandre Bilodeau
  • Nicolas Saunier
Conference paper
  • 130 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12131)

Abstract

Consecutive frames in a video are highly redundant. Therefore, to perform the task of video object detection, executing single frame detectors on every frame without reusing any information is quite wasteful. It is with this idea in mind that we propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection. Our contributions are twofold. First, we propose a new architecture that allows the usage of information from nearby frames to enhance feature maps. Second, we propose a novel module to merge feature maps of same dimensions using re-ordering of channels and \(1 \times 1\) convolutions. We then demonstrate that RN-VID achieves better mean average precision (mAP) than corresponding single frame detectors with little additional cost during inference.

Keywords

Video object detection Feature fusion Road users Traffic scenes 

Notes

Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [RDCPJ 508883 - 17], and the support of Genetec. The authors would like to thank Paule Brodeur for insightful discussions.

References

  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/
  2. 2.
    Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 342–357. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_21CrossRefGoogle Scholar
  3. 3.
    Broad, A., Jones, M., Lee, T.Y.: Recurrent multi-frame single shot detector for video object detection. In: BMVC, p. 94 (2018)Google Scholar
  4. 4.
    Chollet, F., et al.: Keras (2015). https://keras.io
  5. 5.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems 29, pp. 379–387. Curran Associates, Inc. (2016)Google Scholar
  6. 6.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)Google Scholar
  7. 7.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: The IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  8. 8.
    Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 375–391. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_23CrossRefGoogle Scholar
  9. 9.
    Girshick, R.: Fast R-CNN. In: The IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  10. 10.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  11. 11.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  12. 12.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  13. 13.
    Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y.: RON: reverse connection with objectness prior networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 2 (2017)Google Scholar
  14. 14.
    Li, S., Chen, F.: 3D-DETNet: a single stage video-based vehicle detector. In: Third International Workshop on Pattern Recognition, vol. 10828, p. 108280A. International Society for Optics and Photonics (2018)Google Scholar
  15. 15.
    Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  16. 16.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2018)Google Scholar
  17. 17.
    Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  18. 18.
    Liu, S., Deng, W.: Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 730–734 (2015).  https://doi.org/10.1109/ACPR.2015.7486599
  19. 19.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2
  20. 20.
    Perreault, H., Bilodeau, G.A., Saunier, N., Gravel, P.: Road user detection in videos. arXiv preprint arXiv:1903.12049 (2019)
  21. 21.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  22. 22.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  23. 23.
    Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
  24. 24.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  25. 25.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  26. 26.
    Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013).  https://doi.org/10.1007/s11263-013-0620-5CrossRefGoogle Scholar
  27. 27.
    Wang, L., Lu, Y., Wang, H., Zheng, Y., Ye, H., Xue, X.: Evolving boxes for fast vehicle detection. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1135–1140. IEEE (2017)Google Scholar
  28. 28.
    Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 557–573. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_33CrossRefGoogle Scholar
  29. 29.
    Wen, L., et al.: UA-DETRAC: a new benchmark and protocol for multi-object detection and tracking. arXiv CoRR abs/1511.04136 (2015)Google Scholar
  30. 30.
    Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 494–510. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01237-3_30CrossRefGoogle Scholar
  31. 31.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Hughes Perreault
    • 1
    Email author
  • Maguelonne Heritier
    • 2
  • Pierre Gravel
    • 2
  • Guillaume-Alexandre Bilodeau
    • 1
  • Nicolas Saunier
    • 1
  1. 1.Polytechnique MontrealMontrealCanada
  2. 2.GenetecMontrealCanada

Personalised recommendations