Aggregating Motion and Attention for Video Object Detection

  • Ruyi ZhangEmail author
  • Zhenjiang Miao
  • Cong Ma
  • Shanshan Hao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12047)


Video object detection plays a vital role in a wide variety of computer vision applications. To deal with challenges such as motion blur, varying view-points/poses, and occlusions, we need to solve the temporal association across frames. One of the most typical solutions to maintain frame association is exploiting optical flow between consecutive frames. However, using optical flow alone may lead to poor alignment across frames due to the gap between optical flow and high-level features. In this paper, we propose an Attention-Based Temporal Context module (ABTC) for more accurate frame alignments. We first extract two kinds of features for each frame using the ABTC module and a Flow-Guided Temporal Coherence module (FGTC). Then, the features are integrated and fed to the detection network for the final result. The ABTC and FGTC are complementary to each other and can work together to obtain a higher detection quality. Experiments on the ImageNet VID dataset show that the proposed framework performs favorable against the state-of-the-art methods.


Video object detection Optical flow Self-attention End-to-end 



This work is supported by the NSFC 61672089, 61703436, 61572064, 61273274 and CELFA.


  1. 1.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  2. 2.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  3. 3.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS (2016)Google Scholar
  4. 4.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  5. 5.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  6. 6.
    Han, W., et al.: Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016)
  7. 7.
    Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR (2017)Google Scholar
  8. 8.
    Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. In: CVPR (2016)Google Scholar
  9. 9.
    Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. In: T-CSVT (2017)Google Scholar
  10. 10.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: ICCV (2017)Google Scholar
  11. 11.
    Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 557–573. Springer, Cham (2018). Scholar
  12. 12.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: ICCV (2017)Google Scholar
  13. 13.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  14. 14.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate arXiv preprint arXiv:1409.0473 (2014)
  15. 15.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)Google Scholar
  16. 16.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR (2017)Google Scholar
  18. 18.
    Fu, J., Liu, J., Tian, H., et al.: Dual Attention Network for Scene Segmentation. arXiv preprint arXiv:1809.02983v4 (2018)
  19. 19.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv preprint arXiv:1711.07971v3 (2018)
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Ruyi Zhang
    • 1
    Email author
  • Zhenjiang Miao
    • 1
  • Cong Ma
    • 1
  • Shanshan Hao
    • 1
  1. 1.Beijing Jiaotong UniversityBeijingChina

Personalised recommendations