Attention-Based Top-Down Single-Task Action Recognition in Still Images

Yang, Jinhai; Zhou, Xiao; Yang, Hua

doi:10.1007/978-981-15-3341-9_10

Jinhai Yang¹¹,
Xiao Zhou¹² &
Hua Yang¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1181))

Included in the following conference series:

International Forum on Digital TV and Wireless Multimedia Communications

697 Accesses

Abstract

Human action recognition via deep learning methods in still images has been an active research topic in computer vision recently . Different from the traditional action recognition based on videos or image sequences, a single image contains no temporal information or motion features for action characterization. In this study, we utilize a top-down action recognition strategy to analyze person instances in a scene respectively, on the task of detecting determine persons playing a cellphone. A YOLOv3 detector is applied to predict the human bounding boxes, and the HRNet (High Resolution Network) is used to regress the attention map centered on the area of playing a cellphone, taking the region of given human bounding box as the input. Experimental results on a custom dataset show that HRNet can reliably represent a person image to a heatmap where the region of interest (ROI) is highlighted. The accuracy of the proposed framework exceeds the performance of all the evaluated naive classification models, i.e., Densenet, inception_v3 and shufflenet_v2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_12
Chapter Google Scholar
Diba, A., Mohammad Pazandeh, A., Pirsiavash, H., Van Gool, L.: DeepCamp: deep convolutional action & attribute mid-level patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3557–3565 (2016)
Google Scholar
Du, W., Wang, Y., Qiao, Y.: Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans. Image Process. 27(3), 1347–1360 (2017)
Article MathSciNet Google Scholar
Du, W., Wang, Y., Qiao, Y.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3725–3734 (2017)
Google Scholar
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)
Google Scholar
Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recogn. 47(10), 3343–3361 (2014)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Kwak, S., Cho, M., Laptev, I.: Thin-slicing for pose: learning to understand pose without explicit pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2016)
Google Scholar
Liu, L., Tan, R.T., You, S.: Loss guided activation for action recognition in still images. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 152–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_10
Chapter Google Scholar
Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)
Chapter Google Scholar
Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_25
Chapter Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Rodríguez, N.D., Cuéllar, M.P., Lilius, J., Calvo-Flores, M.D.: A survey on ontologies for human behavior recognition. ACM Comput. Surv. (CSUR) 46(4), 43 (2014)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147 (2013)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
Wang, Y., Zhou, L., Qiao, Y.: Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5314–5322 (2018)
Google Scholar
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Chapter Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5D graph matching. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 173–186. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_13
Chapter Google Scholar

Download references

Acknowledgement

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant No. 61771303 and 61671289), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 17DZ1205602, 18DZ1200-102, 18DZ2270700), and SJTUYitu/Thinkforce Joint laboratory for visual computing and application. Director is funded by National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data PSRPC.

Author information

Authors and Affiliations

Institution of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China
Jinhai Yang & Hua Yang
Suzhou Keensense Technology Co., Ltd., Suzhou, China
Xiao Zhou

Authors

Jinhai Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hua Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hua Yang .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
Shanghai Jiao Tong University, Shanghai, China
Jun Zhou
Shanghai Jiao Tong University, Shanghai, China
Hua Yang
Shanghai University, Shanghai, China
Ping An
Shanghai Jiao Tong University, Shanghai, China
Xiaokang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, J., Zhou, X., Yang, H. (2020). Attention-Based Top-Down Single-Task Action Recognition in Still Images. In: Zhai, G., Zhou, J., Yang, H., An, P., Yang, X. (eds) Digital TV and Wireless Multimedia Communication. IFTC 2019. Communications in Computer and Information Science, vol 1181. Springer, Singapore. https://doi.org/10.1007/978-981-15-3341-9_10

Download citation

DOI: https://doi.org/10.1007/978-981-15-3341-9_10
Published: 16 February 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3340-2
Online ISBN: 978-981-15-3341-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics