Advertisement

Coherent Multi-sentence Video Description with Variable Level of Detail

  • Anna RohrbachEmail author
  • Marcus Rohrbach
  • Wei Qiu
  • Annemarie Friedrich
  • Manfred Pinkal
  • Bernt Schiele
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8753)

Abstract

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

Keywords

Language Model Semantic Representation Video Segment Visual Recognition Statistical Machine Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD).

References

  1. 1.
    Das, P., Xu, C., Doell, R.F., Corso, J.: Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  2. 2.
    Dyer, C., Muresan, S., Resnik, P.: Generalizing word lattice translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2008)Google Scholar
  3. 3.
    Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)Google Scholar
  5. 5.
    Gupta, A., Srinivasan, P., Shi, J.B., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  6. 6.
    Khan, M.U.G., Zhang, L., Gotoh, Y.: Human focused video description. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (2011)Google Scholar
  7. 7.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (demo) (2007)Google Scholar
  8. 8.
    Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. (IJCV) 50, 171–184 (2002)CrossRefzbMATHGoogle Scholar
  9. 9.
    Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI Conference on Artificial Intelligence (AAAI) (2013)Google Scholar
  10. 10.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  11. 11.
    Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2012)Google Scholar
  12. 12.
    Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A.C., Berg, T.L., III, H.D.: Midge: Generating image descriptions from computer vision detections. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2012)Google Scholar
  13. 13.
    Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. (TACL) 1, 25–36 (2013)Google Scholar
  14. 14.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision (ICCV) (2013)Google Scholar
  15. 15.
    Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  16. 16.
    Schmidt, M.: UGM: Matlab code for undirected graphical models (2013). http://www.di.ens.fr/~mschmidt/Software/UGM.html
  17. 17.
    Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Coherent multi-sentence video description with variable level of detail. arXiv:1403.6173 (2014)
  18. 18.
    Siddharth, N., Barbu, A., Siskind, J.M.: Seeing what youre told: Sentence-guided activity recognition in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  19. 19.
    Tan, C.C., Jiang, Y.G., Ngo, C.W.: Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia (2011)Google Scholar
  20. 20.
    Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008). http://www.vlfeat.org/
  21. 21.
    Wang, H., Kläser, A., Schmid, C., Liu, C.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. (IJCV) 103, 60–79 (2013)CrossRefGoogle Scholar
  22. 22.
    Yu, H., Siskind, J.M.: Grounded language learning from videos described with sentences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2013)Google Scholar
  23. 23.
    Zukerman, I., Litman, D.: Natural language processing and user modeling: Synergies and limitations. User Model. User-Adap. Inter. 11, 129–158 (2001)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Anna Rohrbach
    • 1
    Email author
  • Marcus Rohrbach
    • 1
    • 2
  • Wei Qiu
    • 1
    • 3
  • Annemarie Friedrich
    • 3
  • Manfred Pinkal
    • 3
  • Bernt Schiele
    • 1
  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.UC Berkeley EECS and ICSIBerkeleyUSA
  3. 3.Department of Computational LinguisticsSaarland UniversitySaarbrückenGermany

Personalised recommendations