Abstract
Human-object interaction (HOIs) detection is a new branch of visual relationship detection, which plays an important role in the field of image understanding. Because of the complexity and diversity of image content, the detection of HOIs is still an onerous challenge. Unlike most of the current works for HOIs detection which only rely on the pairwise information of a human and an object, we propose a graph-based HOIs detection method that models context and global structure information. Firstly, to better utilize the relations between humans and objects, the detected humans and objects are regarded as nodes to construct a fully connected undirected graph, and the graph is pruned to obtain an HOI graph that only preserving the edges connecting human and object nodes. Then, in order to obtain more robust features of human and object nodes, two different attention-based feature extraction networks are proposed, which model global and local contexts respectively. Finally, the graph attention network is introduced to pass messages between different nodes in the HOI graph iteratively, and detect the potential HOIs. Experiments on V-COCO and HICO-DET datasets verify the effectiveness of the proposed method, and show that it is superior to many existing methods.
摘要
人-物交互行为检测作为视觉关系检测的一个新分支, 在图像理解领域起着重要的作用。由于 图像内容复杂多样, 人-物交互行为的检测仍是一大挑战。与当前仅依靠人与物体间的成对信息的方 法不同, 本文提出了一种可以对上下文和全局结构信息进行建模的基于图的人-物交互行为检测方法。 首先, 为了更好地利用人与物体之间的关系, 将图像中检测到的人和对象视为节点, 构造人-物交互 图。其次, 为了获得更鲁棒的人与物体节点的特征表示, 通过两个的特征提取网络, 分别对全局和局 部上下文进行建模。最后, 引入图注意力网络, 在人-物交互图中的不同节点间迭代传递信息, 检测 潜在的人-物交互行为。在V-COCO 和HICO-DET 数据集上的实验验证了该方法的有效性, 并表明该 方法优于现有的许多方法。
Similar content being viewed by others
References
LIN T Y, DOLLÁR P, GIRSHICK R, HE K M, HARIHARAN B, BELONGIE S. Feature pyramid networks for object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2117–2125. DOI:https://doi.org/10.1109/cvpr.2017.106.
HE K, ZHANG X, REN S, SUN J. Deep residual learning for image recognition [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.
REN S, HE K, GIRSHICK R, SUN J. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. DOI: https://doi.org/10.1109/TPAMI.2016.2577031
WANG J, XIA L, HU X, XIAO Y. Abnormal event detection with semi-supervised sparse topic model [J]. Neural Computing and Applications, 2019, 31(5): 1607–1617. DOI: https://doi.org/10.1007/s00521-018-3417-1.
WANG P, CHEN P, YUAN Y, HUANG Z, HOU X, COTTRELL G. Understanding Convolution for Semantic Segmentation [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1451–1460. DOI: https://doi.org/10.1109/WACV.2018.00163.
GAO R, XIONG B, GRAUMAN K. Im2flow: Motion hallucination from static images for action recognition [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5937–5947. DOI: https://doi.org/10.1109/cvpr.2018.00622.
CHÉRON G, LAPTEV I, SCHMID C. P-CNN: Pose-based CNN features for action recognition [C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 3218–3226. DOI: https://doi.org/10.1109/iccv.2015.368.
LIU J, WANG G, DUAN L Y, ABDIYEVA K, KOT A C. Skeleton-based human action recognition with global context-aware attention LSTM networks [J]. IEEE Transactions on Image Processing, 2017, 27(4): 1586–1599. DOI: https://doi.org/10.1109/tip.2017.2785279.
MAJD M, SAFABAKHSH R. A motion-aware ConvLSTM network for action recognition [J]. Applied Intelligence, 2019, 49(7): 2515–2521. DOI: https://doi.org/10.1007/s10489-018-1395-8.
XIA L M, GUO W T, WANG H. Interaction behavior recognition from multiple views [J]. Journal of Central South University, 2020, 27(1): 101–113. DOI: https://doi.org/10.1007/s11771-020-4281-6.
LI Y, OUYANG W, ZHOU B, WANG K, WANG X. Scene graph generation from objects, phrases and region captions [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1261–1270. DOI: https://doi.org/10.1109/iccv.2017.142.
XU D, ZHU Y, CHOY C B, LI F F. Scene graph generation by iterative message passing [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5410–5419. DOI: https://doi.org/10.1109/cvpr.2017.330.
LU C, KRISHNA R, BERNSTEIN M, LI F F. Visual relationship detection with language priors [C]//European Conference on Computer Vision. Springer, Cham, 2016: 852–869. DOI: https://doi.org/10.1007/978-3-319-46448-0_51.
DAI Y, WANG C, DONG J, SUN C Y. Visual relationship detection based on bidirectional recurrent neural network [J]. Multimedia Tools and Applications, 2019: 1–17. DOI: https://doi.org/10.1007/s11042-019-7732-z.
TENEY D, LIU L, van den HENGEL A. Graph-structured representations for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1–9. DOI: https://doi.org/10.1109/cvpr.2017.344.
PENG L, YANG Y, BIN Y, XIE N, SHEN F M, JI Y L, XU X. Word-to-region attention network for visual question answering [J]. Multimedia Tools and Applications, 2019, 78(3): 3843–3858. DOI: https://doi.org/10.1007/s11042-018-6389-3.
CHEN X, ZITNICK C L. Mind’s eye: A recurrent visual representation for image caption generation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2422–2431. DOI: https://doi.org/10.1109/cvpr.2015.7298856.
JOHNSON J, HARIHARAN B, van der MAATEN L, HOFFMAN J, LI F F, ZITNICK C L, GIRSHICK R. Inferring and executing programs for visual reasoning [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2989–2998. DOI: https://doi.org/10.1109/iccv.2017.325.
GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: Using spatial and functional compatibility for recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775–1789. DOI: https://doi.org/10.1109/tpami.2009.83.
YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 17–24. DOI: https://doi.org/10.1109/cvpr.2010.5540235.
CHAO Y W, LIU Y, LIU X, ZENG H. Learning to detect human-object interactions [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 381–389. DOI: https://doi.org/10.1109/wacv.2018.00048.
FANG H S, CAO J, TAI Y W, LU C. Pairwise body-part attention for recognizing human-object interactions [J]. Lecture Notes in Computer Science, 2018: 52–68. DOI: https://doi.org/10.1007/978-3-030-01249-6_4.
XIA L, LI R. Multi-stream neural network fused with local information and global information for HOI detection [J]. Applied Intelligence, 2020, 50(12): 4495–4505. DOI: https://doi.org/10.1007/s10489-020-01794-1.
HU J F, ZHENG W S, LAI J, GONG S G. Recognising human-object interaction via exemplar based modelling [C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3144–3151. DOI: https://doi.org/10.1109/iccv.2013.390.
GUPTA S, MALIK J. Visual semantic role labeling [J]. Computer Science: Computer Vision and Pattern Recognition, 2015: arXiv:1505.04474.
SHEN L, YEUNG S, HOFFMAN J, MORIG, LI F F. Scaling human-object interaction recognition through zero-shot learning [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1568–1576. DOI: https://doi.org/10.1109/wacv.2018.00181.
QI S, WANG W, JIA B, SHEN J, ZHU S C. Learning human-object interactions by graph parsing neural networks [C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 401–417. DOI: https://doi.org/10.1007/978-3-030-01240-3_25.
OLIVA A, TORRALBA A. The role of context in object recognition [J]. Trends in Cognitive Sciences, 2007, 11(12): 520–527. DOI: https://doi.org/10.1016/j.tics.2007.09.009.
VELICKOVIC P, CUCURULL G, CASANOVA A, ROMERO A, LIÒ P, BENGIO Y. Graph attention networks [C]//International Conference on Learning Representations, 2018. DOI: https://doi.org/10.17863/CAM.48429.
PREST A, SCHMID C, FERRARI V. Weakly supervised learning of interactions between humans and objects [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(3): 601–614. DOI: https://doi.org/10.1109/tpami.2011.158.
DESAI C, RAMANAN D, FOWLKES C. Discriminative models for static human-object interactions [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 2010: 9–16. DOI: https://doi.org/10.1109/cvprw.2010.5543176.
GKIOXARI G, GIRSHICK R, DOLLÁR P, HE K. Detecting and recognizing human-object interactions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8359–8367. DOI: https://doi.org/10.1109/cvpr.2018.00872.
GORI M, MONFARDINI G, SCARSELLI F. A new model for learning in graph domains [C]//Proceedings of 2005 IEEE International Joint Conference on Neural Networks. IEEE, 2005, 2: 729–734. DOI: https://doi.org/10.1109/ijcnn.2005.1555942.
KIPF T, WELLING M. Semi-supervised classification with graph convolutional networks [C]//International Conference on Learning Representations. 2017.
JAIN A, ZAMIR A R, SAVARESE S, SAXENA A. Structural-RNN: Deep learning on spatio-temporal graphs [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5308–5317. DOI: https://doi.org/10.1109/cvpr.2016.573.
CHEN X, LI L J, LI F F, GUPTA A. Iterative visual reasoning beyond convolutions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7239–7248. DOI: https://doi.org/10.1109/cvpr.2018.00756.
MARINO K, SALAKHUTDINOV R, GUPTA A. The More You Know: Using Knowledge Graphs for Image Classification [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 20–28. DOI: https://doi.org/10.1109/cvpr.2017.10.
HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132–7141. DOI: https://doi.org/10.1109/cvpr.2018.00745.
PEYRE J, SIVIC J, LAPTEV I, SIVIC J. Weakly-supervised learning of visual relations [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 5179–5188. DOI: https://doi.org/10.1109/iccv.2017.554.
NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines [C]//International Conference on Machine Learning. 2010: 807–814.
GIRDHAR R, RAMANAN D. Attentional pooling for action recognition [C]//Advances in Neural Information Processing Systems. 2017: 34–45.
KINGMA D P, BA J. Adam: A method for stochastic optimization [J]. arXiv preprint, 2014: arXiv:1412.6980.
LIN T Y, MAIRE M, BELONGIE S, HAYS J, PERONA P, RAMANAN D, DOLLÁR P, ZITNICK C L. Microsoft coco: Common objects in context [C]//European Conference on Computer Vision. Cham: Springer, 2014: 740–755. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.
KAREN S Y, ANDREW Z M. Very deep convolutional networks for large-scale image recognition [J]. arXiv preprint, 2014: arXiv:1409.1556.
Author information
Authors and Affiliations
Contributions
XIA Li-min provided the concept and edited the draft of manuscript. WU Wei conducted the literature review and wrote the first draft of the manuscript.
Corresponding author
Additional information
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Foundation item
Project(51678075) supported by the National Natural Science Foundation of China; Project(2017GK2271) supported by the Hunan Provincial Science and Technology Department, China
Rights and permissions
About this article
Cite this article
Xia, Lm., Wu, W. Graph-based method for human-object interactions detection. J. Cent. South Univ. 28, 205–218 (2021). https://doi.org/10.1007/s11771-021-4597-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11771-021-4597-x