Abstract
Action recognition field has recently grown dramatically due to its importance in many applications like smart surveillance, human–computer interaction, assisting aged citizens or web-video search and retrieval. Many research trials have tackled action recognition as an open problem. Different datasets are built to evaluate architectures variations. In this survey, different action recognition datasets are explored to highlight their ability to evaluate different models. In addition, for each dataset, a usage is proposed based on the content and format of data it includes, the number of classes and challenges it covers. On other hand, another exploration for different architectures is drawn showing the contribution of each of them to handle different action recognition problem challenges and the scientific explanation behind their results. An overall of 21 datasets is covered with 13 architectures that are shallow and deep models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Shao, L., Jones, S., Li, X.: Efficient search and localization of human actions in video databases. IEEE Trans. Circuits Syst. Video Technol. 24(3), 504–512 (2014)
Wang, F., Xu, D., Lu, W., Xu, H.: Automatic annotation and retrieval for videos. In: Pacific-Rim Symposium on Image and Video Technology, pp. 1030–1040. Springer, Heidelberg (2006)
Hung, M.H., Pan, J.S.: A real-time action detection system for surveillance videos using template matching. J. Inf. Hiding Multimedia Signal Process. 6(6), 1088–1099 (2015)
Campo, E., Chan, M.: Detecting abnormal behaviour by real-time monitoring of patients. In: Proceedings of the AAAI-02 Workshop Automation as Caregiver, pp. 8–12 (2002)
Mumtaz, M., Habib, H. A.: Evaluation of Activity Recognition Algorithms for Employee Performance Monitoring. Int. J. Comput. Sci. Issues (IJCSI), 9(5), 203–210 (2012)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recogn. 47(10), 3343–3361 (2014)
Rodriguez, M.: Spatio-temporal maximum average correlation height templates in action recognition and video summarization (2010)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2929–2936. IEEE (2009)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36. IEEE (2004)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1996–2003. IEEE (2009)
Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)
Jhuang, H., et al.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (2013)
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1194–1201. IEEE (2012)
http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/. Accessed 29 Jan 2013
Escalera, S., Gonzà lez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Escalante, H.: Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 445–452. ACM (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale Video Classification with Convolutional Neural Networks (2014)
Badler, N. I., O’Rourke, J., Platt, S., Morris, M. A.: Human movement understanding: a variety of perspectives. In: AAAI, pp. 53–55 (1980)
Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (pp. 65–72). IEEE (2005)
Klaser, A., Marszałek, M., Schmid, C. A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008–19th British Machine Vision Conference, pp. 275–1. British Machine Vision Association (2008)
Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663. Springer, Heidelberg (2008)
Wang, H., Ullah, M. M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput. Vis. Image Underst. (2016).
Dodge, S. F., Karam, L.J.: Is Bottom-Up Attention Useful for Scene Recognition? (2013). arXiv:1307.5702
Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: European Conference on Computer Vision, pp. 581–595. Springer International Publishing (2014)
Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition (2016)
Wang, L., Qiao, Y., Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)
Bottou, L., Vapnik, V.: Local learning algorithms. Neural Comput. 4(6), 888–900 (1992)
Strasburger, H., Rentschler, I., Jüttner, M.: Peripheral vision and pattern recognition: a review. J. Vis. 11(5), 13–13 (2011)
Ni, B., Paramathayalan, V.R., Moulin, P.: Multiple granularity analysis for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 756–763 (2014)
Freedman, R.G., Jung, H.T., Zilberstein, S.: Plan and activity recognition from a topic modeling perspective. In: ICAPS (2014)
Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., Poggio, T.: A quantitative theory of immediate visual recognition. Prog. Brain Res. 165, 33–56 (2007)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: AISTATS, vol. 1, p. 3 (2009)
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: European Conference on Computer Vision, pp. 140–153. Springer, Heidelberg (2010)
Le, Q. V.: Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing pp. 8595–8598 (2013)
Sun, L., Jia, K., Chan, T.H., Fang, Y., Wang, G., Yan, S.: DL-SFA: deeply-learned slow feature analysis for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2632 (2014)
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. pp. 248–255. IEEE (2009)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Berg, A.C.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments, vol. 1, no. 2, p. 3, Technical Report 07-49, University of Massachusetts, Amherst (2007)
Zhang, W., Sun, J., Tang, X.: Cat head detection-how to effectively exploit shape and texture features. In: European Conference on Computer Vision, pp. 802–816. Springer, Heidelberg (2008)
Keller, C. G., Enzweiler, M., Gavrila, D. M.: A new benchmark for stereo-based pedestrian detection. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 691–696. IEEE (2011)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 190–200. Association for Computational Linguistics (2011)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation (2014)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research (2015). arXiv:1503.01070
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Chawky, B.S., Elons, A.S., Ali, A., Shedeed, H.A. (2018). A Study of Action Recognition Problems: Dataset and Architectures Perspectives. In: Hassanien, A., Oliva, D. (eds) Advances in Soft Computing and Machine Learning in Image Processing. Studies in Computational Intelligence, vol 730. Springer, Cham. https://doi.org/10.1007/978-3-319-63754-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-63754-9_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63753-2
Online ISBN: 978-3-319-63754-9
eBook Packages: EngineeringEngineering (R0)