Advertisement

Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning

  • Tangming ChenEmail author
  • Qike Zhao
  • Jingkuan Song
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11809)

Abstract

The use of Recurrent Neural Networks and Convolutional Neural Networks for video captioning has received widespread attention, since the deep learning has developed rapidly. Based on classical encoder-decoder approach, we modify the encoding networks and decoding networks to improve the performance of the entire networks. In this paper, we introduce an encoding scheme that can detect the hierarchical structure of the input video. What’s more, we use soft attention mechanism which can learn to automatically select the relevant input frames from the input video to generate the description of the input video. Extensive experiments are conducted on two datasets: the Microsoft Video Description Corpus and the MSR-Video To Text. Three metrics, BLEU@4, METEOR and CIDEr are used to evaluate our approach. Experimental results demonstrate the effectiveness of our approach.

Keywords

LSTM Boundary detector Soft attention 

Notes

Acknowledgments

This work is supported by Major Scientific and Technological Special Project of Guizhou Province (20183002).

References

  1. 1.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)Google Scholar
  2. 2.
    Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666 (2017)Google Scholar
  3. 3.
    Barbu, A., et al.: Video in sentences out. arXiv preprint arXiv:1204.2742 (2012)
  4. 4.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)Google Scholar
  5. 5.
    Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
  6. 6.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 190–200 (2011)Google Scholar
  7. 7.
    Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)Google Scholar
  8. 8.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017)CrossRefGoogle Scholar
  9. 9.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)Google Scholar
  10. 10.
    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)Google Scholar
  11. 11.
    Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013) Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Khan, M.U.G., Gotoh, Y.: Describing video contents in natural language. In: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pp. 27–35 (2012)Google Scholar
  14. 14.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)Google Scholar
  15. 15.
    Raiko, T., Berglund, M., Alain, G., Dinh, L.: Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989 (2014)
  16. 16.
    Rohrbach, A., et al.: Movie description. Int. J. Comput. Vis. 123(1), 94–120 (2017)CrossRefGoogle Scholar
  17. 17.
    Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F.: Training recurrent networks by evolino. Neural Comput. 19(3), 757–779 (2007)CrossRefGoogle Scholar
  18. 18.
    Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
  19. 19.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  20. 20.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  21. 21.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  22. 22.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
  23. 23.
    Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7512–7520 (2018)Google Scholar
  24. 24.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)Google Scholar
  25. 25.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Guizhou Provincial Key Laboratory of Public Big DataGuiZhou UniversityGuiyangChina
  2. 2.University of Electronic Science and Technology of ChinaChengduChina

Personalised recommendations