Advertisement

MMA: Motion Memory Attention Network for Video Object Detection

  • Huai Hu
  • Wenzhong Wang
  • Aihua Zheng
  • Bin LuoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11902)

Abstract

Modern object detection frameworks such as Faster R-CNN achieve good performance on static images, benefiting from the powerful feature representations. However, it is still challenging to detect tiny, vague and deformable objects in videos. In this paper, we propose a Motion Memory Attention (MMA) network to tackle this issue by considering the motion and temporal information. Specifically, our network contains two main parts: the dual stream and the memory attention module. The dual stream is designed to improve the detection of tiny object, which is composed of an appearance stream and a motion stream. Our motion stream can be embedded into any video object detection framework. In addition, we also introduce the memory attention module to handle the issue of vague and deformable objects by utilizing the temporal information and distinguishing features. Our experiments demonstrate that the detection performance can be significantly improved when integrating the proposed algorithm with Faster R-CNN and YOLO\(_{v2}\).

Keywords

Video object detection Dual stream Memory attention module 

Notes

Acknowledgements

This work was supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (201900046) and the National Natural Science Foundation of China (61472002).

References

  1. 1.
    Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  2. 2.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint. arXiv:1406.1078 (2014)
  3. 3.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems (2016)Google Scholar
  4. 4.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  5. 5.
    Han, W., et al.: Seq-NMS for video object detection. arXiv preprint. arXiv:1602.08465 (2016)
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  7. 7.
    Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 6(02), 107–116 (1998)CrossRefGoogle Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  9. 9.
    Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  10. 10.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  11. 11.
    Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2896–2907 (2018)CrossRefGoogle Scholar
  12. 12.
    Lee, B., Erdenee, E., Jin, S., Nam, M.Y., Jung, Y.G., Rhee, P.K.: Multi-class multi-object tracking using changing point detection. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 68–83. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_6CrossRefGoogle Scholar
  13. 13.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  14. 14.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning (2013)Google Scholar
  15. 15.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  16. 16.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)Google Scholar
  17. 17.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)Google Scholar
  19. 19.
    Wang, L., Ouyang, W., Wang, X.: Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (2015)Google Scholar
  20. 20.
    Wen, L., et al.: UA-DETRAC: a new benchmark and protocol for multi-object detection and tracking. arXiv preprint. arXiv:1511.04136 (2015)
  21. 21.
    Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  22. 22.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Anhui UniversityHefeiChina

Personalised recommendations