A Hierarchical Approach for Visual Storytelling Using Image Description

  • Md. Sultan Al NahianEmail author
  • Tasmia Tasrin
  • Sagar Gandhi
  • Ryan Gaines
  • Brent Harrison
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11869)


One of the primary challenges of visual storytelling is developing techniques that can maintain the context of the story over long event sequences to generate human-like stories. In this paper, we propose a hierarchical deep learning architecture based on encoder-decoder networks to address this problem. To better help our network maintain this context while also generating long and diverse sentences, we incorporate natural language image descriptions along with the images themselves to generate each story sentence. We evaluate our system on the Visual Storytelling (VIST) dataset [7] and show that our method outperforms state-of-the-art techniques on a suite of different automatic evaluation metrics. The empirical results from this evaluation demonstrate the necessities of different components of our proposed architecture and shows the effectiveness of the architecture for visual storytelling.


Visual storytelling Deep learning Natural language processing 


  1. 1.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)Google Scholar
  2. 2.
    Cardona-Rivera, R.E., Li, B.: PLOTSHOT: generating discourse-constrained stories around photos. In: AIIDE (2016)Google Scholar
  3. 3.
    Gonzalez-Rico, D., Fuentes-Pineda, G.: Contextualize, show and tell: a neural visual storyteller. arXiv preprint. arXiv:1806.00738 (2018)
  4. 4.
    Harrison, B., Purdy, C., Riedl, M.O.: Toward automated story generation with Markov chain Monte Carlo methods and deep neural networks. In: Thirteenth Artificial Intelligence and Interactive Digital Entertainment Conference (2017)Google Scholar
  5. 5.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Huang, T.H.K., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)Google Scholar
  8. 8.
    Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016)Google Scholar
  9. 9.
    Kim, T., Heo, M.O., Son, S., Park, K.W., Zhang, B.T.: GLAC Net: glocal attention cascading networks for multi-image cued story generation. arXiv preprint. arXiv:1805.10973 (2018)
  10. 10.
    Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017)Google Scholar
  11. 11.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)Google Scholar
  12. 12.
    Lukin, S.M., Hobbs, R., Voss, C.R.: A pipeline for creative visual storytelling. arXiv preprint. arXiv:1807.08077 (2018)
  13. 13.
    Martin, L.J., et al.: Event representations for automated story generation with deep neural nets. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  14. 14.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  15. 15.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556 (2014)
  16. 16.
    Smilevski, M., Lalkovski, I., Madjarov, G.: Stories for images-in-sequence by using visual and narrative components. In: Kalajdziski, S., Ackovska, N. (eds.) ICT 2018. CCIS, vol. 940, pp. 148–159. Springer, Cham (2018). Scholar
  17. 17.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  18. 18.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  19. 19.
    Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. arXiv preprint. arXiv:1804.09160 (2018)
  20. 20.
    Yao, L., Peng, N., Weischedel, R.M., Knight, K., Zhao, D., Yan, R.: Plan-and-write: towards better automatic storytelling. In: CoRR. abs/1811.05701 (2018)
  21. 21.
    Young, R.M., Ware, S.G., Cassell, B.A., Robertson, J.: Plans and planning in narrative generation: a review of plan-based approaches to the generation of story, discourse and interactivity in narratives. Sprache und Datenverarbeitung Spec. Issue Formal Comput. Models Narrative 37(1–2), 41–64 (2013)Google Scholar
  22. 22.
    Yu, L., Bansal, M., Berg, T.L.: Hierarchically-attentive rnn for album summarization and storytelling. arXiv preprint. arXiv:1708.02977 (2017)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Md. Sultan Al Nahian
    • 1
    Email author
  • Tasmia Tasrin
    • 1
  • Sagar Gandhi
    • 1
  • Ryan Gaines
    • 1
  • Brent Harrison
    • 1
  1. 1.Department of Computer ScienceUniversity of KentuckyLexingtonUSA

Personalised recommendations