Iterative Visual Relationship Detection via Commonsense Knowledge Graph

  • Hai Wan
  • Jialing Ou
  • Baoyi Wang
  • Jianfeng DuEmail author
  • Jeff Z. Pan
  • Juan ZengEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12032)


Visual relationship detection, i.e., discovering the interaction between pairs of objects in an image, plays a significant role in image understanding. However, most of recent works only consider visual features, ignoring the implicit effect of common sense. Motivated by the iterative visual reasoning in image recognition, we propose a novel model to take the advantage of common sense in the form of the knowledge graph in visual relationship detection, named Iterative Visual Relationship Detection with Commonsense Knowledge Graph (IVRDC). Our model consists of two modules: a feature module that predicts predicates by visual features and semantic features with a bi-directional RNN; and a commonsense knowledge module that constructs a specific commonsense knowledge graph for predicate prediction. After iteratively combining prediction from both modules, IVRDC updates the memory and commonsense knowledge graph. The final predictions are made by taking the result of each iteration into account with an attention mechanism. Our experiments on the Visual Relationship Detection (VRD) dataset and the Visual Genome (VG) dataset demonstrate that our proposed model is competitive.


Commonsense knowledge graph Visual relationship detection Visual Genome 



This paper was supported by the National Natural Science Foundation of China (No. 61375056, 61876204, 61976232, and 51978675), Guangdong Province Natural Science Foundation (No. 2017A070706010 (soft science), 2018A030313086), All-China Federation of Returned Overseas Chinese Research Project (17BZQK216), Science and Technology Program of Guangzhou (No. 201804010496, 201804010435).


  1. 1.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part V. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). Scholar
  2. 2.
    Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of International Conference on Neural Information Processing Systems (NIPS2013), pp. 2787–2795 (2013)Google Scholar
  3. 3.
    Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: Proceedings of CVPR, 2016, pp. 3640–3649 (2016).
  4. 4.
    Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014).
  5. 5.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of CVPR, 2017, pp. 3298–3308 (2017).
  6. 6.
    Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multi-column convolutional neural networks. In: Proceedings of ACL, 2015, pp. 260–269 (2015).
  7. 7.
    Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of CVPR, pp. 3668–3678 (2015).
  8. 8.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). Scholar
  9. 9.
    Liang, K., Guo, Y., Chang, H., Chen, X.: Visual relationship detection with deep structural ranking. In: Proceedings of AAAI, 2018 (2018).
  10. 10.
    Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: Proceedings of CVPR, 2017, pp. 4408–4417 (2017).
  11. 11.
    Liao, W., Lin, S., Rosenhahn, B., Yang, M.Y.: Natural language guided visual relationship detection. CoRR abs/1711.06032 (2017).
  12. 12.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  13. 13.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013).
  14. 14.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of NIPS, 2015, pp. 91–99 (2015).
  15. 15.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).
  17. 17.
    Wan, H., Luo, Y., Peng, B., Zheng, W.: Representation learning for scene graph completion via jointly structural and visual embedding. In: Proceedings of IJCAI, 2018, pp. 949–956 (2018).
  18. 18.
    Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of ICCV, 2017, pp. 1068–1076 (2017).
  19. 19.
    Zhang, H., Kyaw, Z., Chang, S., Chua, T.: Visual translation embedding network for visual relation detection. In: Proceedings of CVPR, 2017, pp. 3107–3115 (2017).
  20. 20.
    Zhang, H., Kyaw, Z., Yu, J., Chang, S.: PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In: Proceedings of IEEE, 2017, pp. 4243–4251 (2017).
  21. 21.
    Zhu, Y., Jiang, S.: Deep structured learning for visual relationship detection. In: Proceedings of AAAI, 2018 (2018).

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina
  2. 2.School of Information Science and Technology/School of Cyber SecurityGuangdong University of Foreign StudiesGuangzhouChina
  3. 3.Department of Computing ScienceThe University of AberdeenAberdeenUK
  4. 4.School of Geography and PlanningSun Yat-sen UniversityGuangzhouChina

Personalised recommendations