Loss Guided Activation for Action Recognition in Still Images

Liu, Lu; Tan, Robby T.; You, Shaodi

doi:10.1007/978-3-030-20873-8_10

Lu Liu¹⁸,
Robby T. Tan^18,19 &
Shaodi You^20,21

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11365))

Included in the following conference series:

Asian Conference on Computer Vision

2453 Accesses
16 Citations

Abstract

One significant problem of deep-learning based human action recognition is that it can be easily misled by the presence of irrelevant objects or backgrounds. Existing methods commonly address this problem by employing bounding boxes on the target humans as part of the input, in both training and testing stages. This requirement of bounding boxes as part of the input is needed to enable the methods to ignore irrelevant contexts and extract only human features. However, we consider this solution is inefficient, since the bounding boxes might not be available. Hence, instead of using a person bounding box as an input, we introduce a human-mask loss to automatically guide the activations of the feature maps to the target human who is performing the action, and hence suppress the activations of misleading contexts. We propose a multi-task deep learning method that jointly predicts the human action class and human location heatmap. Extensive experiments demonstrate our approach is more robust compared to the baseline methods under the presence of irrelevant misleading contexts. Our method achieves 94.06% and 40.65% (in terms of mAP) on Stanford40 and MPII dataset respectively, which are 3.14% and 12.6% relative improvements over the best results reported in the literature, and thus set new state-of-the-art results. Additionally, unlike some existing methods, we eliminate the requirement of using a person bounding box as an input during testing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This backbone feature is shared by both two branches.
2.
The person bounding box coordinates are given by the dataset.

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)
Google Scholar
Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_12
Chapter Google Scholar
Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017)
Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. IEEE (2015)
Google Scholar
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. arXiv preprint arXiv:1704.07333 (2017)
Gkioxari, G., Girshick, R., Malik, J.: Actions and attributes from wholes and parts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2470–2478 (2015)
Google Scholar
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)
Google Scholar
Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hoai, M.: Regularized max pooling for image categorization. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014)
Google Scholar
Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3177–3184. IEEE (2011)
Google Scholar
Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_25
Chapter Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1717–1724. IEEE (2014)
Google Scholar
Pishchulin, L., Andriluka, M., Schiele, B.: Fine-grained activity recognition with holistic and pose based features. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 678–689. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11752-2_56
Chapter Google Scholar
Ranjan, R., Patel, V.M., Chellappa, R.: HyperFace: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41, 121–135 (2017)
Article Google Scholar
Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for semantic description of humans in still images. IEEE Trans. Pattern Anal. Mach. Intell. 39(1), 87–101 (2017)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)
Google Scholar
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Yang, H., Tianyi Zhou, J., Zhang, Y., Gao, B.B., Wu, J., Cai, J.: Exploit bounding box annotations for multi-label object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–288 (2016)
Google Scholar
Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5D graph matching. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 173–186. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_13
Chapter Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1331–1338. IEEE (2011)
Google Scholar
Zhang, Y., Cheng, L., Wu, J., Cai, J., Do, M.N., Lu, J.: Action recognition in still images with minimum annotation efforts. IEEE Trans. Image Process. 25(11), 5479–5490 (2016)
Article MathSciNet Google Scholar
Zhao, Z., Ma, H., You, S.: Single image action recognition using semantic body part actions. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, pp. 3411–3419 (2017)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. IEEE (2016)
Google Scholar
Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 589–598. IEEE (2017)
Google Scholar

Download references

Acknowledgement

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its Strategic Capability Research Centres Funding Initiative. R.T. Tan’s work is supported in part by Yale-NUS College Start-Up Grant. Lu Liu is supported by Yale-NUS College PhD Scholarship.

Author information

Authors and Affiliations

ECE Department, National University of Singapore, Singapore, Singapore
Lu Liu & Robby T. Tan
Yale-NUS College, Singapore, Singapore
Robby T. Tan
DATA61-CSIRO, Canberra, Australia
Shaodi You
Australian National University, Canberra, Australia
Shaodi You

Authors

Lu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Robby T. Tan
View author publications
You can also search for this author in PubMed Google Scholar
Shaodi You
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Liu .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16140 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, L., Tan, R.T., You, S. (2019). Loss Guided Activation for Action Recognition in Still Images. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-20873-8_10
Published: 26 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20872-1
Online ISBN: 978-3-030-20873-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics