Skip to main content

Learning Generalized Video Memory for Automatic Video Captioning

  • Conference paper
  • First Online:
Multi-disciplinary Trends in Artificial Intelligence (MIWAI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11248))

Abstract

Recent video captioning methods have made great progress by deep learning approaches with convolutional neural networks (CNN) and recurrent neural networks (RNN). While there are techniques that use memory networks for sentence decoding, few work has leveraged on the memory component to learn and generalize the temporal structure in video. In this paper, we propose a new method, namely Generalized Video Memory (GVM), utilizing a memory model for enhancing video description generation. Based on a class of self-organizing neural networks, GVM’s model is able to learn new video features incrementally. The learned generalized memory is further exploited to decode the associated sentences using RNN. We evaluate our method on the YouTube2Text data set using BLEU and METEOR scores as a standard benchmark. Our results are shown to be competitive against other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  2. Carpenter, G.A., Grossberg, S.: Adaptive Resonance Theory. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks, pp. 87–90. MIT Press, Cambridge (2003)

    Google Scholar 

  3. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy ART: fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Netw. 4(6), 759–771 (1991)

    Article  Google Scholar 

  4. Chang, P.-H., Tan, A.-H.: Encoding and recall of spatio-temporal episodic memory in real time. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1490–1496 (2017)

    Google Scholar 

  5. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 190–200 (2011)

    Google Scholar 

  6. Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  7. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  8. Donahue, J., et al.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: International Conference on Machine Learning, pp. 647–655 (2014)

    Google Scholar 

  9. Fakoor, R., Mohamed, A., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261 (2016)

  10. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE international conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)

    Google Scholar 

  11. Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017)

    Article  MathSciNet  Google Scholar 

  12. Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)

    Google Scholar 

  13. Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 190–198 (2013)

    Google Scholar 

  14. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  15. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  16. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: Association for the Advancement of Artificial Intelligence, vol. 1, p. 2 (2013)

    Google Scholar 

  17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  18. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)

    Google Scholar 

  19. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)

    Google Scholar 

  20. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  21. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (2015)

    Google Scholar 

  23. Subagdja, B., Tan, A.-H.: Neural modeling of sequential inferences and learning over episodic memory. Neurocomputing 161, 229–242 (2015)

    Article  Google Scholar 

  24. Sukhbaatar, S., Weston, J., Fergus, R., et al.: End-to-end memory networks. In: Advances in Neural Information Processing Systems, pp. 2440–2448 (2015)

    Google Scholar 

  25. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Conference of the International Speech Communication Association (2012)

    Google Scholar 

  26. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  27. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  28. Tan, A.-H.: Falcon: a fusion architecture for learning, cognition, and navigation. In: Proceedings of the IEEE International Joint Conference on Neural Network, vol. 4, pp. 3297–3302. IEEE (2004)

    Google Scholar 

  29. Tan, A.-H.: Direct code access in self-organizing neural networks for reinforcement learning. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1071–1076 (2007)

    Google Scholar 

  30. Tan, A.-H., Carpenter, G.A., Grossberg, S.: Intelligence through interaction: towards a unified theory for learning. In: Liu, D., Fei, S., Hou, Z.-G., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4491, pp. 1094–1103. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72383-7_128

    Chapter  Google Scholar 

  31. Teng, T.H., Tan, A.-H., Zurada, J.M.: Self-organizing neural networks integrating domain knowledge and reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 889–902 (2015)

    Article  MathSciNet  Google Scholar 

  32. Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the International Conference on Computational Linguistics, pp. 1218–1227 (2014)

    Google Scholar 

  33. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

    Google Scholar 

  34. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504 (2015)

    Google Scholar 

  35. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3156–3164 (2015)

    Google Scholar 

  36. Wang, P., Zhou, W.J., Wang, D., Tan, A.-H.: Probabilistic guided exploration for reinforcement learning in self-organizing neural networks. In: Proceedings of International Conference on Agents, pp. 109–112 (2018)

    Google Scholar 

  37. Wang, W., Subagdja, B., Tan, A.-H., Starzyk, J.A.: A self-organizing approach to episodic memory modeling. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 1–8. IEEE (2010)

    Google Scholar 

  38. Wang, W., Subagdja, B., Tan, A.-H., Starzyk, J.A.: Neural modeling of episodic memory: encoding, retrieval, and forgetting. IEEE Trans. Neural Netw. Learn. Syst. 23(10), 1574–1586 (2012)

    Article  Google Scholar 

  39. Weston, J., et al.: Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015)

  40. Weston, J., Chopra, S., Bordes, A.: Memory networks. In: Proceedings of the International Conference on Learning Representations (2015)

    Google Scholar 

  41. Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)

    Google Scholar 

Download references

Acknowledgments

This research is supported in part by a research grant (DSOCL16006) from DSO National Laboratories, Singapore and a joint project funded by ICT Virtual Organization of ASEAN Institutes and NICT (ASEAN IVO).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Poo-Hee Chang or Ah-Hwee Tan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chang, PH., Tan, AH. (2018). Learning Generalized Video Memory for Automatic Video Captioning. In: Kaenampornpan, M., Malaka, R., Nguyen, D., Schwind, N. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2018. Lecture Notes in Computer Science(), vol 11248. Springer, Cham. https://doi.org/10.1007/978-3-030-03014-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-03014-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-03013-1

  • Online ISBN: 978-3-030-03014-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics