Adversarial Learning for Visual Storytelling with Sense Group Partition

  • Lingbo MoEmail author
  • Chunhong Zhang
  • Yang Ji
  • Zheng Hu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11364)


Visual storytelling aims to investigate the generation of a paragraph to describe the content of a photo stream. Despite the substantial progress in vision and language research, the techniques for sequential vision-to-language are still far away from being perfect. Due to the limitation of maximum likelihood estimation on training, the majority of existing models encourage high resemblance to texts in the training database, which makes the description overly rigid and lack in diverse expressions. Therefore, We cast the task as a reinforcement learning problem and propose an Adversarial All-in-one Learning (AAL) framework to learn a reward model, which simultaneously incorporates the information of all images in the photo stream and all texts in the paragraph, and optimize a generative model with the estimated reward. Specifically, in light of the linguistic reading theory with sense group as the unit, we propose to do the paragraph generation at sense group level instead of sentence level. Experiments on the widely-used dataset show that our approach generates higher-quality descriptions than previous baselines.


Vision and language Sense group Adversarial learning 



This work is partially supported by Funds for Creative Research Groups of China (No. 61421061), and Natural Science Foundation of China (No. 61601046, No. 61602048).


  1. 1.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  2. 2.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
  3. 3.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  4. 4.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  5. 5.
    Huang, T.H.K., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)Google Scholar
  6. 6.
    Lamb, A.M., Goyal, A.G.A.P., Zhang, Y., Zhang, S., Courville, A.C., Bengio, Y.: Professor forcing: a new algorithm for training recurrent networks. In: Advances In Neural Information Processing Systems, pp. 4601–4609 (2016)Google Scholar
  7. 7.
    Li, F.F., Karpathy, A., Johnson, J.: CS231n: Convolutional neural networks for visual recognition. University Lecture (2015)Google Scholar
  8. 8.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)Google Scholar
  9. 9.
    Liu, Y., Fu, J., Mei, T., Chen, C.W.: Storytelling of photo stream with bidirectional multi-thread recurrent neural network. arXiv preprint arXiv:1606.00625 (2016)
  10. 10.
    Machinery, C.: Computing machinery and intelligence-AM turing. Mind 59(236), 433 (1950)MathSciNetGoogle Scholar
  11. 11.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)
  12. 12.
    Mishima, H., Itow, T.: Encoder and decoder, uS Patent 5,488,418, 30 January 1996Google Scholar
  13. 13.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  14. 14.
    Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: Advances in Neural Information Processing Systems, pp. 73–81 (2015)Google Scholar
  15. 15.
    Peris, Á., Bolaños, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 3–11. Springer, Cham (2016). Scholar
  16. 16.
    Pfau, D., Vinyals, O.: Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945 (2016)
  17. 17.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  18. 18.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  19. 19.
    Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)
  20. 20.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  21. 21.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  22. 22.
    Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. arXiv preprint arXiv:1804.09160 (2018)
  23. 23.
    Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. In: AAAI, pp. 2852–2858 (2017)Google Scholar
  24. 24.
    Yu, L., Bansal, M., Berg, T.L.: Hierarchically-attentive RNN for album summarization and storytelling. arXiv preprint arXiv:1708.02977 (2017)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.State Key Laboratory of Networking and Switching TechnologyBeijing University of Posts and TelecommunicationsBeijingChina
  2. 2.Key Laboratory of Universal Wireless CommunicationsMinistry of EducationBeijingChina

Personalised recommendations