Abstract
With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lag far behind. Current action recognition databases contain on the order of ten different action categories collected under fairly controlled conditions. State-of-the-art performance on these datasets is now near ceiling and thus there is a need for the design and creation of new benchmarks. To address this issue we collected the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube. The goal of this effort is to provide a tool to evaluate the performance of computer vision systems for action recognition and explore the robustness of these methods under various conditions such as camera motion, viewpoint, video quality and occlusion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
C. Schuldt, I. Laptev, B. Caputo: Recognizing human actions: A Local SVM Approach. In: Proc. ICPR 2004, Cambridge, UK.
R. Blake and M. Shiffrar: Perception of human motion. In: Annu. Rev. Psychol., 58: 47–73, 2007.
M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri: Actions as space-time shapes. In: IEEE International Conference on Computer Vision, 2005. ICCV 2005, 2, 2005.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei: Imagenet: A large-scale hierarchical image database. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009 2009.
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman: The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. http://www.pascalnetwork.org/challenges/VOC/voc2010/workshop/index.html.
H. Jhuang, E. Garrote, X. Yu, V. Khilnani, T. Poggio, A. D. Steele, and T. Serre. Automated home-cage behavioural phenotyping of mice. Nature Communications, 1(5): 1–9, 2010.
H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. Proceedings of the Eleventh IEEE International Conference on Computer Vision (ICCV), 2007.
R. F. L. Fei-Fei and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. IEEE CVPR Workshop on Generative-Model Based Vision, 2004.
I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2–3): 107–123, 2005
I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jan 2008.
B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1): 157–173, 2008
E. Simoncelli and D. Heeger. A model of neuronal responses in visual area mt. Vision Research, 38(5): 743–61, 1998.
A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE T. Pattern Anal., pages 1958–1970, 2008.
A. Torralba, B. Russell, and J. Yuen. Labelme: online image annotation and applications. Proceedings of the IEEE, 98(8): 1467–1484, 2010.
H. Wang, M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. Biritish Machine Vision Conference, London, UK, pages 1–11, 2009
D.Weinland, E. Boyer, R. Ronfard, and G. LJK-INRIA. Action recognition from arbitrary views using 3d exemplars. Computer Vision, Jan 2007.
D. Weinland, R. Ronfard, and E. Boyer. A survey of visionbased methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, pages 1–51, Oct 2010.
J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3485–3492, 2010.
J. C. Niebles, C.-W. Chen and L. Fei-Fei. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. Proceedings of the 12th European Conference of Computer Vision (ECCV). 2010.
Acknowledgements
This report describes research done in part at the Center for Biological & Computational Learning, affiliated with MIBR, BCS, CSAIL at MIT. This research was sponsored by grants from DARPA (IPTO and DSO), NSF (NSF-0640097, NSF-0827427), AFSOR-THRL (FA8650-05- C-7262). Additional support was provided by: Adobe, King Abdullah University Science and Technology grant to B. DeVore, NEC, Sony and by the Eugene McDermott Foundation. This work is also done and supported by Brown University, Center for Computation and Visualization, and the Robert J. and Nancy D. Carney Fund for Scientific Innovation, by DARPA (DARPA-BAA-09-31), and ONR (ONR-BAA-11-001). H.K. was supported by a grant from the Ministry of Science, Research and the Arts of Baden Württemberg, Germany.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T. (2013). HMDB51: A Large Video Database for Human Motion Recognition. In: Nagel, W., Kröner, D., Resch, M. (eds) High Performance Computing in Science and Engineering ‘12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33374-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-33374-3_41
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33373-6
Online ISBN: 978-3-642-33374-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)