Hierarchical Video Understanding

  • Farzaneh MahdisoltaniEmail author
  • Roland Memisevic
  • David Fleet
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


We introduce a hierarchical architecture for video understanding that exploits the structure of real world actions by capturing targets at different levels of granularity. We design the model such that it first learns simpler coarse-grained tasks, and then moves on to learn more fine-grained targets. The model is trained with a joint loss on different granularity levels. We demonstrate empirical results on the recent release of Something-Something (Second release of Something-Something is used throughout this paper) dataset, which provides a hierarchy of targets, namely coarse-grained action groups, fine-grained action categories, and captions. Experiments suggest that models that exploit targets at different levels of granularity achieve better performance on all levels.


Video understanding Hierarchical models Fine-grained targets Video classification Video captioning Something-Something Dataset 


  1. 1.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset (2017). arXiv preprint: arXiv:1705.07750
  2. 2.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR 2015 (2015)Google Scholar
  3. 3.
    Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV 2017 (2017)Google Scholar
  4. 4.
    Ji, S., et al.: 3d convolutional neural networks for human action recognition. TPAMI 35, 221–231 (2013)CrossRefGoogle Scholar
  5. 5.
    Kaufman, D., et al.: Temporal tessellation for video annotation and summarization (2016). arXiv preprint: arXiv:1612.06950
  6. 6.
    Laptev, I., et al.: Learning realistic human actions from movies. In: CVPR 2008 (2008)Google Scholar
  7. 7.
    Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV 2015 (2015)Google Scholar
  8. 8.
    Venugopalan, S., et al.: Translating videos to natural language using deep recurrent neural networks (2014). arXiv preprint: arXiv:1412.4729

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Farzaneh Mahdisoltani
    • 1
    Email author
  • Roland Memisevic
    • 2
  • David Fleet
    • 1
  1. 1.University of TorontoTorontoCanada
  2. 2.Twenty Billion NeuronsTorontoCanada

Personalised recommendations