Pairwise Body-Part Attention for Recognizing Human-Object Interactions

  • Hao-Shu Fang
  • Jinkun Cao
  • Yu-Wing Tai
  • Cewu LuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)


In human-object interactions (HOI) recognition, conventional methods consider the human body as a whole and pay a uniform attention to the entire body region. They ignore the fact that normally, human interacts with an object by using some parts of the body. In this paper, we argue that different body parts should be paid with different attention in HOI recognition, and the correlations between different body parts should be further considered. This is because our body parts always work collaboratively. We propose a new pairwise body-part attention model which can learn to focus on crucial parts, and their correlations for HOI recognition. A novel attention based feature selection method and a feature representation scheme that can capture pairwise correlations between body parts are introduced in the model. Our proposed approach achieved \(\mathbf {10}\%\) relative improvement (36.1 mAP \(\rightarrow \) 39.9 mAP) over the state-of-the-art results in HOI recognition on the HICO dataset. We will make our model and source codes publicly available.


Human-object interactions Body-part correlations Attention model 



This work is supported in part by the National Key R&D Program of China No. 2017YFA0700800, National Natural Science Foundation of China under Grants 61772332 and SenseTime Ltd.


  1. 1.
    Aksoy, E.E., Abramov, A., Dörr, J., Ning, K., Dellen, B., Wörgötter, F.: Learning the semantics of object-action relations by observation. Int. J. Rob. Res. 30(10), 1229–1249 (2011)CrossRefGoogle Scholar
  2. 2.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014Google Scholar
  3. 3.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
  4. 4.
    Boyer, T.W., Maouene, J., Sethuraman, N.: Attention to body-parts varies with visual preference and verb-effector associations. Cogn. Process. 18(2), 195–203 (2017)CrossRefGoogle Scholar
  5. 5.
    Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: Hico: a benchmark for recognizing human-object interactions in images. In: ICCV (2015)Google Scholar
  6. 6.
    Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: BMVC (2010)Google Scholar
  7. 7.
    Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking. Neural Comput. 24(8), 2151–2184 (2012)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: CVPR’w (2010)Google Scholar
  9. 9.
    Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)Google Scholar
  10. 10.
    Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: NIPS (2017)Google Scholar
  11. 11.
    Gkioxari, G., Girshick, R., Malik, J.: Actions and attributes from wholes and parts. In: ICCV (2015)Google Scholar
  12. 12.
    Gkioxari, G., Hariharan, B., Girshick, R., Malik, J.: R-CNNs for pose estimation and action detection. arXiv preprint arXiv:1406.5212 (2014)
  13. 13.
    Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object intaractions. arXiv preprint arXiv:1704.07333 (2017)
  14. 14.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R* CNN. In: ICCV (2015)Google Scholar
  15. 15.
    Goferman, S., Zelnik-Manor, L., Tal, A.: Context-aware saliency detection. TPAMI 34(10), 1915–1926 (2012)CrossRefGoogle Scholar
  16. 16.
    Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: ICCV (2009)Google Scholar
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  18. 18.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  19. 19.
    Hou, X., Zhang, L.: Saliency detection: a spectral residual approach. In: CVPR (2007)Google Scholar
  20. 20.
    Hu, J.F., Zheng, W.S., Lai, J., Gong, S., Xiang, T.: Recognising human-object interaction via exemplar based modelling. In: ICCV (2013)Google Scholar
  21. 21.
    Ikizler, N., Cinbis, R.G., Pehlivan, S., Duygulu, P.: Recognizing actions from still images. In: ICPR (2008)Google Scholar
  22. 22.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. TPAMI 20(11), 1254–1259 (1998)CrossRefGoogle Scholar
  23. 23.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  24. 24.
    Khan, F.S., Anwer, R.M., van de Weijer, J., Bagdanov, A.D., Lopez, A.M., Felsberg, M.: Coloring action recognition in still images. IJCV 105(3), 205–221 (2013)CrossRefGoogle Scholar
  25. 25.
    Larochelle, H., Hinton, G.E.: Learning to combine foveal glimpses with a third-order Boltzmann machine. In: NIPS (2010)Google Scholar
  26. 26.
    Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: CVPR (2011)Google Scholar
  27. 27.
    Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: CVPR (2017)Google Scholar
  28. 28.
    Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: CVPR (2011)Google Scholar
  29. 29.
    Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). Scholar
  30. 30.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning (1998)Google Scholar
  31. 31.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NIPS (2014)Google Scholar
  32. 32.
    Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. TPAMI 34(3), 601–614 (2012)CrossRefGoogle Scholar
  33. 33.
    Ramanathan, V., et al.: Learning semantic relationships for better action retrieval in images. In: CVPR (2015)Google Scholar
  34. 34.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  35. 35.
    Ro, T., Friggel, A., Lavie, N.: Attentional biases for faces and body parts. Vis. Cogn. 15(3), 322–348 (2007)CrossRefGoogle Scholar
  36. 36.
    Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: CVPR (2013)Google Scholar
  37. 37.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention (2015)Google Scholar
  38. 38.
    Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR (2016)Google Scholar
  39. 39.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI (2017)Google Scholar
  40. 40.
    Thurau, C., Hlavác, V.: Pose primitive based human action recognition in videos or still images. In: CVPR (2008)Google Scholar
  41. 41.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR (2011)Google Scholar
  42. 42.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  43. 43.
    Wang, Y., Jiang, H., Drew, M.S., Li, Z.N., Mori, G.: Unsupervised discovery of action classes. In: CVPR (2006)Google Scholar
  44. 44.
    Wörgötter, F., Aksoy, E.E., Krüger, N., Piater, J., Ude, A., Tamosiunaite, M.: A simple ontology of manipulation actions based on hand-object relations. IEEE Trans. Auton. Mental Dev. 5(2), 117–134 (2013)CrossRefGoogle Scholar
  45. 45.
    Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: CVPR (2015)Google Scholar
  46. 46.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, vol. 14 (2015)Google Scholar
  47. 47.
    Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with latent poses. In: CVPR (2010)Google Scholar
  48. 48.
    Yang, Y., Fermuller, C., Aloimonos, Y.: Detection of manipulation action consequences (MAC). In: CVPR (2013)Google Scholar
  49. 49.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)Google Scholar
  50. 50.
    Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: CVPR (2010)Google Scholar
  51. 51.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010)Google Scholar
  52. 52.
    Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5D graph matching. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 173–186. Springer, Heidelberg (2012). Scholar
  53. 53.
    Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV (2011)Google Scholar
  54. 54.
    Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. In: CVPR (2011)Google Scholar
  55. 55.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Shanghai Jiao Tong UniversityShanghaiChina
  2. 2.Tencent YouTu LabShanghaiChina

Personalised recommendations