Graph-based method for human-object interactions detection

基于图的人-物交互行为检测方法

Abstract

Human-object interaction (HOIs) detection is a new branch of visual relationship detection, which plays an important role in the field of image understanding. Because of the complexity and diversity of image content, the detection of HOIs is still an onerous challenge. Unlike most of the current works for HOIs detection which only rely on the pairwise information of a human and an object, we propose a graph-based HOIs detection method that models context and global structure information. Firstly, to better utilize the relations between humans and objects, the detected humans and objects are regarded as nodes to construct a fully connected undirected graph, and the graph is pruned to obtain an HOI graph that only preserving the edges connecting human and object nodes. Then, in order to obtain more robust features of human and object nodes, two different attention-based feature extraction networks are proposed, which model global and local contexts respectively. Finally, the graph attention network is introduced to pass messages between different nodes in the HOI graph iteratively, and detect the potential HOIs. Experiments on V-COCO and HICO-DET datasets verify the effectiveness of the proposed method, and show that it is superior to many existing methods.

摘要

人-物交互行为检测作为视觉关系检测的一个新分支, 在图像理解领域起着重要的作用。由于 图像内容复杂多样, 人-物交互行为的检测仍是一大挑战。与当前仅依靠人与物体间的成对信息的方 法不同, 本文提出了一种可以对上下文和全局结构信息进行建模的基于图的人-物交互行为检测方法。 首先, 为了更好地利用人与物体之间的关系, 将图像中检测到的人和对象视为节点, 构造人-物交互 图。其次, 为了获得更鲁棒的人与物体节点的特征表示, 通过两个的特征提取网络, 分别对全局和局 部上下文进行建模。最后, 引入图注意力网络, 在人-物交互图中的不同节点间迭代传递信息, 检测 潜在的人-物交互行为。在V-COCO 和HICO-DET 数据集上的实验验证了该方法的有效性, 并表明该 方法优于现有的许多方法。

This is a preview of subscription content, access via your institution.

References

  1. [1]

    LIN T Y, DOLLÁR P, GIRSHICK R, HE K M, HARIHARAN B, BELONGIE S. Feature pyramid networks for object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2117–2125. DOI:https://doi.org/10.1109/cvpr.2017.106.

  2. [2]

    HE K, ZHANG X, REN S, SUN J. Deep residual learning for image recognition [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.

  3. [3]

    REN S, HE K, GIRSHICK R, SUN J. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. DOI: https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  4. [4]

    WANG J, XIA L, HU X, XIAO Y. Abnormal event detection with semi-supervised sparse topic model [J]. Neural Computing and Applications, 2019, 31(5): 1607–1617. DOI: https://doi.org/10.1007/s00521-018-3417-1.

    Article  Google Scholar 

  5. [5]

    WANG P, CHEN P, YUAN Y, HUANG Z, HOU X, COTTRELL G. Understanding Convolution for Semantic Segmentation [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1451–1460. DOI: https://doi.org/10.1109/WACV.2018.00163.

  6. [6]

    GAO R, XIONG B, GRAUMAN K. Im2flow: Motion hallucination from static images for action recognition [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5937–5947. DOI: https://doi.org/10.1109/cvpr.2018.00622.

  7. [7]

    CHÉRON G, LAPTEV I, SCHMID C. P-CNN: Pose-based CNN features for action recognition [C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 3218–3226. DOI: https://doi.org/10.1109/iccv.2015.368.

  8. [8]

    LIU J, WANG G, DUAN L Y, ABDIYEVA K, KOT A C. Skeleton-based human action recognition with global context-aware attention LSTM networks [J]. IEEE Transactions on Image Processing, 2017, 27(4): 1586–1599. DOI: https://doi.org/10.1109/tip.2017.2785279.

    MathSciNet  MATH  Article  Google Scholar 

  9. [9]

    MAJD M, SAFABAKHSH R. A motion-aware ConvLSTM network for action recognition [J]. Applied Intelligence, 2019, 49(7): 2515–2521. DOI: https://doi.org/10.1007/s10489-018-1395-8.

    Article  Google Scholar 

  10. [10]

    XIA L M, GUO W T, WANG H. Interaction behavior recognition from multiple views [J]. Journal of Central South University, 2020, 27(1): 101–113. DOI: https://doi.org/10.1007/s11771-020-4281-6.

    Article  Google Scholar 

  11. [11]

    LI Y, OUYANG W, ZHOU B, WANG K, WANG X. Scene graph generation from objects, phrases and region captions [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1261–1270. DOI: https://doi.org/10.1109/iccv.2017.142.

  12. [12]

    XU D, ZHU Y, CHOY C B, LI F F. Scene graph generation by iterative message passing [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5410–5419. DOI: https://doi.org/10.1109/cvpr.2017.330.

  13. [13]

    LU C, KRISHNA R, BERNSTEIN M, LI F F. Visual relationship detection with language priors [C]//European Conference on Computer Vision. Springer, Cham, 2016: 852–869. DOI: https://doi.org/10.1007/978-3-319-46448-0_51.

    Google Scholar 

  14. [14]

    DAI Y, WANG C, DONG J, SUN C Y. Visual relationship detection based on bidirectional recurrent neural network [J]. Multimedia Tools and Applications, 2019: 1–17. DOI: https://doi.org/10.1007/s11042-019-7732-z.

  15. [15]

    TENEY D, LIU L, van den HENGEL A. Graph-structured representations for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1–9. DOI: https://doi.org/10.1109/cvpr.2017.344.

  16. [16]

    PENG L, YANG Y, BIN Y, XIE N, SHEN F M, JI Y L, XU X. Word-to-region attention network for visual question answering [J]. Multimedia Tools and Applications, 2019, 78(3): 3843–3858. DOI: https://doi.org/10.1007/s11042-018-6389-3.

    Article  Google Scholar 

  17. [17]

    CHEN X, ZITNICK C L. Mind’s eye: A recurrent visual representation for image caption generation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2422–2431. DOI: https://doi.org/10.1109/cvpr.2015.7298856.

  18. [18]

    JOHNSON J, HARIHARAN B, van der MAATEN L, HOFFMAN J, LI F F, ZITNICK C L, GIRSHICK R. Inferring and executing programs for visual reasoning [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2989–2998. DOI: https://doi.org/10.1109/iccv.2017.325.

  19. [19]

    GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: Using spatial and functional compatibility for recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775–1789. DOI: https://doi.org/10.1109/tpami.2009.83.

    Article  Google Scholar 

  20. [20]

    YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 17–24. DOI: https://doi.org/10.1109/cvpr.2010.5540235.

  21. [21]

    CHAO Y W, LIU Y, LIU X, ZENG H. Learning to detect human-object interactions [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 381–389. DOI: https://doi.org/10.1109/wacv.2018.00048.

  22. [22]

    FANG H S, CAO J, TAI Y W, LU C. Pairwise body-part attention for recognizing human-object interactions [J]. Lecture Notes in Computer Science, 2018: 52–68. DOI: https://doi.org/10.1007/978-3-030-01249-6_4.

  23. [23]

    XIA L, LI R. Multi-stream neural network fused with local information and global information for HOI detection [J]. Applied Intelligence, 2020, 50(12): 4495–4505. DOI: https://doi.org/10.1007/s10489-020-01794-1.

    Article  Google Scholar 

  24. [24]

    HU J F, ZHENG W S, LAI J, GONG S G. Recognising human-object interaction via exemplar based modelling [C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3144–3151. DOI: https://doi.org/10.1109/iccv.2013.390.

  25. [25]

    GUPTA S, MALIK J. Visual semantic role labeling [J]. Computer Science: Computer Vision and Pattern Recognition, 2015: arXiv:1505.04474.

  26. [26]

    SHEN L, YEUNG S, HOFFMAN J, MORIG, LI F F. Scaling human-object interaction recognition through zero-shot learning [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1568–1576. DOI: https://doi.org/10.1109/wacv.2018.00181.

  27. [27]

    QI S, WANG W, JIA B, SHEN J, ZHU S C. Learning human-object interactions by graph parsing neural networks [C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 401–417. DOI: https://doi.org/10.1007/978-3-030-01240-3_25.

  28. [28]

    OLIVA A, TORRALBA A. The role of context in object recognition [J]. Trends in Cognitive Sciences, 2007, 11(12): 520–527. DOI: https://doi.org/10.1016/j.tics.2007.09.009.

    Article  Google Scholar 

  29. [29]

    VELICKOVIC P, CUCURULL G, CASANOVA A, ROMERO A, LIÒ P, BENGIO Y. Graph attention networks [C]//International Conference on Learning Representations, 2018. DOI: https://doi.org/10.17863/CAM.48429.

  30. [30]

    PREST A, SCHMID C, FERRARI V. Weakly supervised learning of interactions between humans and objects [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(3): 601–614. DOI: https://doi.org/10.1109/tpami.2011.158.

    Article  Google Scholar 

  31. [31]

    DESAI C, RAMANAN D, FOWLKES C. Discriminative models for static human-object interactions [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 2010: 9–16. DOI: https://doi.org/10.1109/cvprw.2010.5543176.

  32. [32]

    GKIOXARI G, GIRSHICK R, DOLLÁR P, HE K. Detecting and recognizing human-object interactions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8359–8367. DOI: https://doi.org/10.1109/cvpr.2018.00872.

  33. [33]

    GORI M, MONFARDINI G, SCARSELLI F. A new model for learning in graph domains [C]//Proceedings of 2005 IEEE International Joint Conference on Neural Networks. IEEE, 2005, 2: 729–734. DOI: https://doi.org/10.1109/ijcnn.2005.1555942.

    Google Scholar 

  34. [34]

    KIPF T, WELLING M. Semi-supervised classification with graph convolutional networks [C]//International Conference on Learning Representations. 2017.

  35. [35]

    JAIN A, ZAMIR A R, SAVARESE S, SAXENA A. Structural-RNN: Deep learning on spatio-temporal graphs [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5308–5317. DOI: https://doi.org/10.1109/cvpr.2016.573.

  36. [36]

    CHEN X, LI L J, LI F F, GUPTA A. Iterative visual reasoning beyond convolutions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7239–7248. DOI: https://doi.org/10.1109/cvpr.2018.00756.

  37. [37]

    MARINO K, SALAKHUTDINOV R, GUPTA A. The More You Know: Using Knowledge Graphs for Image Classification [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 20–28. DOI: https://doi.org/10.1109/cvpr.2017.10.

  38. [38]

    HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132–7141. DOI: https://doi.org/10.1109/cvpr.2018.00745.

  39. [39]

    PEYRE J, SIVIC J, LAPTEV I, SIVIC J. Weakly-supervised learning of visual relations [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 5179–5188. DOI: https://doi.org/10.1109/iccv.2017.554.

  40. [40]

    NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines [C]//International Conference on Machine Learning. 2010: 807–814.

  41. [41]

    GIRDHAR R, RAMANAN D. Attentional pooling for action recognition [C]//Advances in Neural Information Processing Systems. 2017: 34–45.

  42. [42]

    KINGMA D P, BA J. Adam: A method for stochastic optimization [J]. arXiv preprint, 2014: arXiv:1412.6980.

  43. [43]

    LIN T Y, MAIRE M, BELONGIE S, HAYS J, PERONA P, RAMANAN D, DOLLÁR P, ZITNICK C L. Microsoft coco: Common objects in context [C]//European Conference on Computer Vision. Cham: Springer, 2014: 740–755. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.

    Google Scholar 

  44. [44]

    KAREN S Y, ANDREW Z M. Very deep convolutional networks for large-scale image recognition [J]. arXiv preprint, 2014: arXiv:1409.1556.

Download references

Author information

Affiliations

Authors

Contributions

XIA Li-min provided the concept and edited the draft of manuscript. WU Wei conducted the literature review and wrote the first draft of the manuscript.

Corresponding author

Correspondence to Li-min Xia 夏利民.

Additional information

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Foundation item

Project(51678075) supported by the National Natural Science Foundation of China; Project(2017GK2271) supported by the Hunan Provincial Science and Technology Department, China

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Xia, Lm., Wu, W. Graph-based method for human-object interactions detection. J. Cent. South Univ. 28, 205–218 (2021). https://doi.org/10.1007/s11771-021-4597-x

Download citation

Key words

  • human-object interactions
  • visual relationship
  • context information
  • graph attention network

关键词

  • 人-物交互
  • 视觉关系
  • 上下文信息
  • 图注意力网络