Learning Human-Object Interactions by Graph Parsing Neural Networks

  • Siyuan Qi
  • Wenguan Wang
  • Baoxiong Jia
  • Jianbing ShenEmail author
  • Song-Chun Zhu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11213)


This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For a given scene, GPNN infers a parse graph that includes (i) the HOI graph structure represented by an adjacency matrix, and (ii) the node labels. Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels. We extensively evaluate our model on three HOI detection benchmarks on images and videos: HICO-DET, V-COCO, and CAD-120 datasets. Our approach significantly outperforms state-of-art methods, verifying that GPNN is scalable to large datasets and applies to spatial-temporal settings.


Human-object interaction Message passing Graph parsing Neural networks 



The authors thank Prof. Ying Nian Wu from UCLA Statistics Department for helpful comments on this work. This research is supported by DARPA XAI N66001-17-2-4029, ONR MURI N00014-16-1-2007, ARO W911NF1810296, and N66001-17-2-3602.


  1. 1.
    Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions (2018)Google Scholar
  2. 2.
    Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: A benchmark for recognizing human-object interactions in images. In: ICCV (2015)Google Scholar
  3. 3.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI (2016)Google Scholar
  4. 4.
    Chen, L.C., Schwing, A., Yuille, A., Urtasun, R.: Learning deep structured models. In: ICML (2015)Google Scholar
  5. 5.
    Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Syntax, Semantics and Structure in Statistical Translation, p. 103 (2014)Google Scholar
  6. 6.
    Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)Google Scholar
  7. 7.
    Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS (2016)Google Scholar
  8. 8.
    Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: NIPS (2011)Google Scholar
  9. 9.
    Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012). Scholar
  10. 10.
    Elman, J.L.: Finding structure in time. Cogn. Sci. (1990)Google Scholar
  11. 11.
    Fang, H.S., Xu, Y., Wang, W., Zhu, S.C.: Learning pose grammar to encode human body configuration for 3D pose estimation. In: AAAI (2018)Google Scholar
  12. 12.
    Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML (2017)Google Scholar
  13. 13.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  14. 14.
    Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: CVPR (2018)Google Scholar
  15. 15.
    Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007)Google Scholar
  16. 16.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI (2009)Google Scholar
  17. 17.
    Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
  18. 18.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)Google Scholar
  19. 19.
    Hu, J.F., Zheng, W.S., Lai, J., Gong, S., Xiang, T.: Recognising human-object interaction via exemplar based modelling. In: ICCV (2013)Google Scholar
  20. 20.
    Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)Google Scholar
  21. 21.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)Google Scholar
  22. 22.
    Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. PAMI (2016)Google Scholar
  23. 23.
    Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. (2013)Google Scholar
  24. 24.
    Li, R., Tapaswi, M., Liao, R., Jia, J., Urtasun, R., Fidler, S.: Situation recognition with graph neural networks. In: ICCV (2017)Google Scholar
  25. 25.
    Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)Google Scholar
  26. 26.
    Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., Xing, E.P.: Interpretable structure-evolving LSTM. In: ICCV (2017)Google Scholar
  27. 27.
    Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016). Scholar
  28. 28.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  29. 29.
    Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). Scholar
  30. 30.
    Marino, K., Salakhutdinov, R., Gupta, A.: The more you know: using knowledge graphs for image classification. In: CVPR (2016)Google Scholar
  31. 31.
    Monti, F., Boscaini, D., Masci, J., Rodolà, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model CNNs. In: CVPR (2016)Google Scholar
  32. 32.
    Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: ICML (2016)Google Scholar
  33. 33.
    Park, S., Nie, X., Zhu, S.C.: Attribute and-or grammar for joint parsing of human pose, parts and attributes. PAMI (2017)Google Scholar
  34. 34.
    Qi, S., Huang, S., Wei, P., Zhu, S.C.: Predicting human activities using stochastic grammar. In: ICCV (2017)Google Scholar
  35. 35.
    Qi, S., Jia, B., Zhu, S.C.: Generalized earley parser: bridging symbolic grammars and sequence data for future prediction. In: ICML (2018)Google Scholar
  36. 36.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  37. 37.
    Seo, Y., Defferrard, M., Vandergheynst, P., Bresson, X.: Structured sequence modeling with graph convolutional recurrent networks. arXiv preprint arXiv:1612.07659 (2016)
  38. 38.
    Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning (2018)Google Scholar
  39. 39.
    Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NIPS (2015)Google Scholar
  40. 40.
    Simonovsky, M., Komodakis, N.: Dynamic edge-conditioned filters in convolutional neural networks on graphs. In: CVPR (2017)Google Scholar
  41. 41.
    Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR (2017)Google Scholar
  42. 42.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)Google Scholar
  43. 43.
    Wang, W., Xu, Y., Shen, J., Zhu, S.C.: Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: CVPR (2018)Google Scholar
  44. 44.
    Wu, Z., Lin, D., Tang, X.: Deep Markov random field for image modeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 295–312. Springer, Cham (2016). Scholar
  45. 45.
    Xia, F., Zhu, J., Wang, P., Yuille, A.L.: Pose-guided human parsing by an And/Or graph using pose-context features. In: AAAI (2016)Google Scholar
  46. 46.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: ICCV (2017)Google Scholar
  47. 47.
    Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: CVPR (2010)Google Scholar
  48. 48.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010)Google Scholar
  49. 49.
    Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV (2011)Google Scholar
  50. 50.
    Yuan, Y., Liang, X., Wang, X., Yeung, D.Y., Gupta, A.: Temporal dynamic graph LSTM for action-driven video object detection. In: ICCV (2017)Google Scholar
  51. 51.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of California, Los AngelesLos AngelesUSA
  2. 2.International Center for AI and Robot Autonomy (CARA)Los AngelesUSA
  3. 3.Beijing Institute of TechnologyBeijingChina
  4. 4.Peking UniversityBeijingChina
  5. 5.Inception Institute of Artificial IntelligenceAbu DhabiUnited Arab Emirates

Personalised recommendations