Advertisement

Convolutional LSTM Based Video Object Detection

  • Xiao Wang
  • Xiaohua XieEmail author
  • Jianhuang Lai
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11257)

Abstract

The state-of-the-art performance for object detection has been significantly improved over the past two years. Despite the effectiveness on still images, something stands in the way of transferring the powerful detection networks to videos object detection. In this work, we present a fast and accurate framework for video object detection that incorporates temporal and contextual information using convolutional LSTM [27]. Moreover, an Encoder-Decoder module is made up based on the convolutional LSTM to predict the feature map. It is an end-to-end learning framework and is general and flexible when combining with still-image detection networks. It achieves significant improvement on both speed and accuracy. Our method significantly improves upon strong single-frame baselines in ImageNet VID [21], especially for more challenging moving objects at high speed.

Keywords

Video object detection Convolutional LSTM Encoder-Decoder module 

References

  1. 1.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. CoRR, abs/1605.06409 (2016)Google Scholar
  2. 2.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)Google Scholar
  3. 3.
    Girshick, R.: Fast R-CNN. arXiv preprint arXiv:1504.08083 (2015)
  4. 4.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)Google Scholar
  5. 5.
    Han, W., et al.: Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016)
  6. 6.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision, ICCV, pp. 2980–2988. IEEE (2017)Google Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  8. 8.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016)Google Scholar
  9. 9.
    Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circ. Syst. Video Technol. (2017)Google Scholar
  10. 10.
    Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 817–825 (2016)Google Scholar
  11. 11.
    Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 (2017)
  12. 12.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  13. 13.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  14. 14.
    Luo, Z., Peng, B., Huang, D.-A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821, 2 (2017)
  15. 15.
    Milan, A., Rezatofighi, S.H., Dick, A.R., Reid, I.D., Schindler, K.: Online multi-target tracking using recurrent neural networks. In: AAAI, pp. 4225–4232 (2017)Google Scholar
  16. 16.
    Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4694–4702. IEEE (2015)Google Scholar
  17. 17.
    Ouyang, W., et al.: DeepID-Net: deformable deep convolutional neural networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2015)Google Scholar
  18. 18.
    Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 7464–7473. IEEE (2017)Google Scholar
  19. 19.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  20. 20.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497 (2015)Google Scholar
  21. 21.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)Google Scholar
  23. 23.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)Google Scholar
  24. 24.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  25. 25.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  26. 26.
    Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3119–3127 (2015)Google Scholar
  27. 27.
    Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)Google Scholar
  28. 28.
    Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_47CrossRefGoogle Scholar
  29. 29.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. arXiv preprint arXiv:1703.10025 (2017)
  30. 30.
    Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of CVPR, vol. 2, p. 7 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina
  2. 2.Guangdong Key Laboratory of Information Security TechnologyGuangzhouChina
  3. 3.Key Laboratory of Machine Intelligence and Advanced Computing of the Ministry of EducationGuangzhouChina

Personalised recommendations