Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data

  • Alina RoitbergEmail author
  • Manuel Martinez
  • Monica Haurilet
  • Rainer Stiefelhagen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


Zero-shot action recognition aims to classify actions not previously seen during training. This is achieved by learning a visual model for the seen source classes and establishing a semantic relationship to the unseen target classes e.g. through the action labels. In order to draw a clear line between zero-shot and conventional supervised classification, the source and target categories must be disjoint. Ensuring this premise is not trivial, especially when the source dataset is external. In this work, we propose an evaluation procedure that enables fair use of external data for zero-shot action recognition. We empirically show that external sources tend to have actions excessively similar to the target classes, strongly influencing the performance and violating the zero-shot premise. To address this, we propose a corrective method to automatically filter out too similar categories by exploiting the pairwise intra-dataset similarity of the labels. Our experiments on the HMDB-51 dataset demonstrate that the zero-shot models consistently benefit from the external sources even under our realistic evaluation, especially when the source categories of internal and external domains are combined.


Action recognition Zero-Shot Learning 



The research leading to this results has been partially funded by the German Federal Ministry of Education and Research (BMBF) within the PAKoS project.


  1. 1.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  2. 2.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  3. 3.
    Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: a large video database for human motion recognition. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering, pp. 571–582. Springer, Heidelberg (2013)Google Scholar
  4. 4.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  5. 5.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  6. 6.
    Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650 (2013)
  7. 7.
    Ohn-Bar, E., Trivedi, M.M.: Looking at humans in the age of self-driving and highly automated vehicles. IEEE Trans. Intell. Veh. 1(1), 90–104 (2016)CrossRefGoogle Scholar
  8. 8.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)CrossRefGoogle Scholar
  9. 9.
    Qin, J., et al.: Zero-shot action recognition with error-correcting output codes. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  10. 10.
    Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., He, K.: Data distillation: towards omni-supervised learning. arXiv preprint arXiv:1712.04440 (2017)
  11. 11.
    Roitberg, A., Al-Halah, Z., Stiefelhagen, R.: Informed democracy: voting-based novelty detection for action recognition. In: British Machine Vision Conference (BMVC). Newcastle upon Tyne, UK, September 2018Google Scholar
  12. 12.
    Roitberg, A., Somani, N., Perzylo, A., Rickert, M., Knoll, A.: Multimodal human activity recognition for industrial manufacturing processes in robotic workcells. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 259–266. ACM (2015)Google Scholar
  13. 13.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision (2013)Google Scholar
  14. 14.
    Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embedding. Int. J. Comput. Vis. 124(3), 356–383 (2017)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning-the good, the bad and the ugly. arXiv preprint arXiv:1703.04394 (2017)
  16. 16.
    Xu, X., Hospedales, T., Gong, S.: Semantic embedding space for zero-shot action recognition. In: International Conference on Image Processing (2015)Google Scholar
  17. 17.
    Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. Int. J. Comput. Vis. 123, 1–25 (2017)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Zhu, Y., Long, Y., Guan, Y., Newsam, S., Shao, L.: Towards universal representation for unseen action recognition. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Alina Roitberg
    • 1
    Email author
  • Manuel Martinez
    • 1
  • Monica Haurilet
    • 1
  • Rainer Stiefelhagen
    • 1
  1. 1.Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations