Advertisement

Attention-Based Top-Down Single-Task Action Recognition in Still Images

  • Jinhai Yang
  • Xiao Zhou
  • Hua YangEmail author
Conference paper
  • 45 Downloads
Part of the Communications in Computer and Information Science book series (CCIS, volume 1181)

Abstract

Human action recognition via deep learning methods in still images has been an active research topic in computer vision recently . Different from the traditional action recognition based on videos or image sequences, a single image contains no temporal information or motion features for action characterization. In this study, we utilize a top-down action recognition strategy to analyze person instances in a scene respectively, on the task of detecting determine persons playing a cellphone. A YOLOv3 detector is applied to predict the human bounding boxes, and the HRNet (High Resolution Network) is used to regress the attention map centered on the area of playing a cellphone, taking the region of given human bounding box as the input. Experimental results on a custom dataset show that HRNet can reliably represent a person image to a heatmap where the region of interest (ROI) is highlighted. The accuracy of the proposed framework exceeds the performance of all the evaluated naive classification models, i.e., Densenet, inception_v3 and shufflenet_v2.

Keywords

Still image action recognition Attention mechanism High-resolution representation Behavior analysis 

Notes

Acknowledgement

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant No. 61771303 and 61671289), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 17DZ1205602, 18DZ1200-102, 18DZ2270700), and SJTUYitu/Thinkforce Joint laboratory for visual computing and application. Director is funded by National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data PSRPC.

References

  1. 1.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
  2. 2.
    Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_12CrossRefGoogle Scholar
  3. 3.
    Diba, A., Mohammad Pazandeh, A., Pirsiavash, H., Van Gool, L.: DeepCamp: deep convolutional action & attribute mid-level patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3557–3565 (2016)Google Scholar
  4. 4.
    Du, W., Wang, Y., Qiao, Y.: Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans. Image Process. 27(3), 1347–1360 (2017)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Du, W., Wang, Y., Qiao, Y.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3725–3734 (2017)Google Scholar
  6. 6.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)Google Scholar
  7. 7.
    Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recogn. 47(10), 3343–3361 (2014)CrossRefGoogle Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)Google Scholar
  9. 9.
    Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)Google Scholar
  10. 10.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  11. 11.
    Kwak, S., Cho, M., Laptev, I.: Thin-slicing for pose: learning to understand pose without explicit pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2016)Google Scholar
  12. 12.
    Liu, L., Tan, R.T., You, S.: Loss guided activation for action recognition in still images. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 152–167. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20873-8_10CrossRefGoogle Scholar
  13. 13.
    Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)CrossRefGoogle Scholar
  14. 14.
    Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_25CrossRefGoogle Scholar
  15. 15.
    Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
  16. 16.
    Rodríguez, N.D., Cuéllar, M.P., Lilius, J., Calvo-Flores, M.D.: A survey on ontologies for human behavior recognition. ACM Comput. Surv. (CSUR) 46(4), 43 (2014)CrossRefGoogle Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  18. 18.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)Google Scholar
  19. 19.
    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019)
  20. 20.
    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147 (2013)Google Scholar
  21. 21.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  22. 22.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  23. 23.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
  24. 24.
    Wang, Y., Zhou, L., Qiao, Y.: Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5314–5322 (2018)Google Scholar
  25. 25.
    Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_28CrossRefGoogle Scholar
  26. 26.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  27. 27.
    Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5D graph matching. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 173–186. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_13CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Institution of Image Communication and Network EngineeringShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Suzhou Keensense Technology Co., Ltd.SuzhouChina

Personalised recommendations