Learning Visual Storylines with Skipping Recurrent Neural Networks

  • Gunnar A. SigurdssonEmail author
  • Xinlei Chen
  • Abhinav Gupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


What does a typical visit to Paris look like? Do people first take photos of the Louvre and then the Eiffel Tower? Can we visually model a temporal event like “Paris Vacation” using current frameworks? In this paper, we explore how we can automatically learn the temporal aspects, or storylines of visual concepts from web data. Previous attempts focus on consecutive image-to-image transitions and are unsuccessful at recovering the long-term underlying story. Our novel Skipping Recurrent Neural Network (S-RNN) model does not attempt to predict each and every data point in the sequence, like classic RNNs. Rather, S-RNN uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums via an efficient sampling procedure. This approach reduces the negative impact of strong short-term correlations, and recovers the latent story more accurately. We show how our learned storylines can be used to analyze, predict, and summarize photo albums from Flickr. Our experimental results provide strong qualitative and quantitative evidence that S-RNN is significantly better than other candidate methods such as LSTMs on learning long-term correlations and recovering latent storylines. Moreover, we show how storylines can help machines better understand and summarize photo streams by inferring a brief personalized story of each individual album.


Visual Concept Representative Event Photo Album Amazon Mechanical Turk Future Image 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was supported by the Yahoo-CMU InMind program, ONR MURI N000014-16-1-2007, and a hardware grant from Nvidia. The authors would like to thank Olga Russakovsky and Christoph Dann for invaluable suggestions and advice, and all the anonymous reviewers for helpful advice on improving the manuscript.

Supplementary material

419978_1_En_5_MOESM1_ESM.pdf (6.4 mb)
Supplementary material 1 (pdf 6592 KB)


  1. 1.
    Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: ICCV (2013)Google Scholar
  2. 2.
    Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014)Google Scholar
  3. 3.
    Sadeghi, F., Divvala, S.K., Farhadi, A.: VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. In: CVPR (2015)Google Scholar
  4. 4.
    Izadinia, H., Farhadi, A., Hertzmann, A., Hoffman, M.D.: Image classification and retrieval from user-supplied tags (2014). arXiv preprint: arXiv:1411.6909
  5. 5.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  6. 6.
    Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)CrossRefGoogle Scholar
  7. 7.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)Google Scholar
  8. 8.
    Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering (2015). arXiv preprint: arXiv:1511.07394
  9. 9.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering (2015). arXiv preprint: arXiv:1511.02274
  10. 10.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images (2015). arXiv preprint: arXiv:1511.03416
  11. 11.
    Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering (2016). arXiv preprint: arXiv:1603.01417
  12. 12.
    Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions (2014). arXiv preprint: arXiv:1412.2306
  13. 13.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  14. 14.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  15. 15.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text (2015). arXiv preprint: arXiv:1505.00487
  16. 16.
    Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention (2015). arXiv preprint: arXiv:1502.03044
  17. 17.
    Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: Draw: a recurrent neural network for image generation (2015). arXiv preprint: arXiv:1502.04623
  18. 18.
    Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). arXiv preprint: arXiv:1506.06724
  19. 19.
    Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. In: CVPR (2015)Google Scholar
  20. 20.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. TNN 5(2), 157–166 (1994)Google Scholar
  21. 21.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  22. 22.
    Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: CVPR (2014)Google Scholar
  23. 23.
    Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: CVPR (2014)Google Scholar
  24. 24.
    DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. In: ACM MM, pp. 211–218. ACM (1998)Google Scholar
  25. 25.
    Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. TCSVT 15(2), 296–305 (2005)Google Scholar
  26. 26.
    Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: CVPR (2013)Google Scholar
  27. 27.
    Martin-Brualla, R., He, Y., Russell, B.C., Seitz, S.M.: The 3D jigsaw puzzle: mapping large indoor spaces. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 1–16. Springer, Heidelberg (2014)Google Scholar
  28. 28.
    Sadeghi, F., Tena, J.R., Farhadi Ali, S.L.: Learning to select and order vacation photographs. In: WACV (2015)Google Scholar
  29. 29.
    Xiong, B., Kim, G., Sigal, L.: Storyline representation of egocentric videos with an applications to story-based search. In: ICCV, pp. 4525–4533 (2015)Google Scholar
  30. 30.
    Kim, G., Moon, S., Sigal, L.: Joint photo stream and blog post summarization and exploration. In: CVPR, pp. 3081–3089. IEEE (2015)Google Scholar
  31. 31.
    Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: CVPR, pp. 3584–3592 (2015)Google Scholar
  32. 32.
    Shank, R., Abelson, R.: Scripts, plans, goals and understanding (1977)Google Scholar
  33. 33.
    Chambers, N., Jurafsky, D.: Unsupervised learning of narrative event chains. In: ACL (2008)Google Scholar
  34. 34.
    McIntyre, N., Lapata, M.: Learning to tell tales: a data-driven approach to story generation. In: ACL (2009)Google Scholar
  35. 35.
    Wang, D., Li, T., Ogihara, M.: Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs. In: AAAI (2012)Google Scholar
  36. 36.
    Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR (2009)Google Scholar
  37. 37.
    Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. TOMCCAP 3(1), 3 (2007)CrossRefGoogle Scholar
  38. 38.
    Cernekova, Z., Pitas, I., Nikou, C.: Information theory-based shot cut/fade detection and video summarization. TCSVT 16(1), 82–91 (2006)Google Scholar
  39. 39.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  40. 40.
    Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM MM (2002)Google Scholar
  41. 41.
    Sinha, P., Mehrotra, S., Jain, R.: Summarization of personal photologs using multidimensional content and context. In: ICMR (2011)Google Scholar
  42. 42.
    Obrador, P., De Oliveira, R., Oliver, N.: Supporting personal photo storytelling for social albums. In: ACM MM, pp. 561–570. ACM (2010)Google Scholar
  43. 43.
    Mikolov, T.: Recurrent neural network based language model. In: INTERSPEECH (2010)Google Scholar
  44. 44.
    Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML (2011)Google Scholar
  45. 45.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)Google Scholar
  46. 46.
    Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Back-Propagation: Theory, Architectures and Applications, pp. 433–486 (1995)Google Scholar
  47. 47.
    Werbos, P.J.: Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1(4), 339–356 (1988)CrossRefGoogle Scholar
  48. 48.
    Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research (2015). arXiv preprint: arXiv:1503.01817
  49. 49.
    Karpathy, A., Johnson, J., Li, F.: Visualizing and understanding recurrent networks (2015). arXiv preprint: arXiv:1506.02078
  50. 50.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 1027–1035 (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Gunnar A. Sigurdsson
    • 1
    Email author
  • Xinlei Chen
    • 1
  • Abhinav Gupta
    • 1
  1. 1.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations