Skip to main content

Attention-Based Top-Down Single-Task Action Recognition in Still Images

  • Conference paper
  • First Online:
Digital TV and Wireless Multimedia Communication (IFTC 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1181))

  • 697 Accesses

Abstract

Human action recognition via deep learning methods in still images has been an active research topic in computer vision recently . Different from the traditional action recognition based on videos or image sequences, a single image contains no temporal information or motion features for action characterization. In this study, we utilize a top-down action recognition strategy to analyze person instances in a scene respectively, on the task of detecting determine persons playing a cellphone. A YOLOv3 detector is applied to predict the human bounding boxes, and the HRNet (High Resolution Network) is used to regress the attention map centered on the area of playing a cellphone, taking the region of given human bounding box as the input. Experimental results on a custom dataset show that HRNet can reliably represent a person image to a heatmap where the region of interest (ROI) is highlighted. The accuracy of the proposed framework exceeds the performance of all the evaluated naive classification models, i.e., Densenet, inception_v3 and shufflenet_v2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)

  2. Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_12

    Chapter  Google Scholar 

  3. Diba, A., Mohammad Pazandeh, A., Pirsiavash, H., Van Gool, L.: DeepCamp: deep convolutional action & attribute mid-level patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3557–3565 (2016)

    Google Scholar 

  4. Du, W., Wang, Y., Qiao, Y.: Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans. Image Process. 27(3), 1347–1360 (2017)

    Article  MathSciNet  Google Scholar 

  5. Du, W., Wang, Y., Qiao, Y.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3725–3734 (2017)

    Google Scholar 

  6. Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)

    Google Scholar 

  7. Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recogn. 47(10), 3343–3361 (2014)

    Article  Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

    Google Scholar 

  9. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

    Google Scholar 

  11. Kwak, S., Cho, M., Laptev, I.: Thin-slicing for pose: learning to understand pose without explicit pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2016)

    Google Scholar 

  12. Liu, L., Tan, R.T., You, S.: Loss guided activation for action recognition in still images. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 152–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_10

    Chapter  Google Scholar 

  13. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)

    Chapter  Google Scholar 

  14. Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_25

    Chapter  Google Scholar 

  15. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  16. Rodríguez, N.D., Cuéllar, M.P., Lilius, J., Calvo-Flores, M.D.: A survey on ontologies for human behavior recognition. ACM Comput. Surv. (CSUR) 46(4), 43 (2014)

    Article  Google Scholar 

  17. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

    Google Scholar 

  18. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)

    Google Scholar 

  19. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019)

  20. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147 (2013)

    Google Scholar 

  21. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  22. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  23. Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)

  24. Wang, Y., Zhou, L., Qiao, Y.: Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5314–5322 (2018)

    Google Scholar 

  25. Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28

    Chapter  Google Scholar 

  26. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

  27. Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5D graph matching. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 173–186. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_13

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant No. 61771303 and 61671289), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 17DZ1205602, 18DZ1200-102, 18DZ2270700), and SJTUYitu/Thinkforce Joint laboratory for visual computing and application. Director is funded by National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data PSRPC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hua Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, J., Zhou, X., Yang, H. (2020). Attention-Based Top-Down Single-Task Action Recognition in Still Images. In: Zhai, G., Zhou, J., Yang, H., An, P., Yang, X. (eds) Digital TV and Wireless Multimedia Communication. IFTC 2019. Communications in Computer and Information Science, vol 1181. Springer, Singapore. https://doi.org/10.1007/978-981-15-3341-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-3341-9_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-3340-2

  • Online ISBN: 978-981-15-3341-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics