Exploiting Attention for Visual Relationship Detection

  • Tongxin HuEmail author
  • Wentong Liao
  • Michael Ying Yang
  • Bodo Rosenhahn
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11824)


Visual relationship detection targets on predicting categories of predicates and object pairs, and also locating the object pairs. Recognizing the relationships between individual objects is important for describing visual scenes in static images. In this paper, we propose a novel end-to-end framework on the visual relationship detection task. First, we design a spatial attention model for specializing predicate features. Compared to a normal ROI-pooling layer, this structure significantly improves Predicate Classification performance. Second, for extracting relative spatial configuration, we propose to map simple geometric representations to a high dimension, which boosts relationship detection accuracy. Third, we implement a feature embedding model with a bi-directional RNN which considers subject, predicate and object as a time sequence. We evaluate our method on three tasks. The experiments demonstrate that our method achieves competitive results compared to state-of-the-art methods.



The work is funded by DFG (German Research Foundation) YA 351/2-1 and RO 4804/2-1 within SPP 1894. The authors gratefully acknowledge the support. The authors also acknowledge NVIDIA Corporation for the donated GPUs.


  1. 1.
    Awiszus, M., Rosenhahn, B.: Markov chain neural networks. In: CVPR Workshops, pp. 2180–2187 (2018)Google Scholar
  2. 2.
    Berg, A.C., et al.: Understanding and predicting importance in images. In: CVPR, pp. 3562–3569. IEEE (2012)Google Scholar
  3. 3.
    Choi, W., Chao, Y.W., Pantofaru, C., Savarese, S.: Understanding indoor scenes using 3D geometric phrases. In: CVPR, pp. 33–40 (2013)Google Scholar
  4. 4.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: CVPR, pp. 3076–3086 (2017)Google Scholar
  5. 5.
    Das, P., Xu, C., Doell, R.F., Corso, J.J.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: CVPR, pp. 2634–2641 (2013)Google Scholar
  6. 6.
    Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: Webly-supervised visual concept learning. In: CVPR, pp. 3270–3277 (2014)Google Scholar
  7. 7.
    Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)Google Scholar
  8. 8.
    Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). Scholar
  9. 9.
    Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Natl. Acad. Sci. 112(12), 3618–3623 (2015)Google Scholar
  10. 10.
    Henschel, R., von Marcard, T., Rosenhahn, B.: Simultaneous identification and tracking of multiple people using video and IMUs. In: CVPR Workshops (2019)Google Scholar
  11. 11.
    Izadinia, H., Sadeghi, F., Farhadi, A.: Incorporating scene context and object layout into appearance modeling. In: CVPR, pp. 232–239 (2014)Google Scholar
  12. 12.
    Jia, Z., Gallagher, A., Saxena, A., Chen, T.: 3D-based reasoning with blocks, support, and stability. In: CVPR, pp. 1–8 (2013)Google Scholar
  13. 13.
    Kluger, F., et al.: Region-based cycle-consistent data augmentation for object detection. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 5205–5211. IEEE (2018)Google Scholar
  14. 14.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: CVPR. Citeseer (2011)Google Scholar
  16. 16.
    Laskar, Z., Kannala, J.: Context aware query image representation for particular object retrieval. In: Sharma, P., Bianchi, F.M. (eds.) SCIA 2017, Part II. LNCS, vol. 10270, pp. 88–99. Springer, Cham (2017). Scholar
  17. 17.
    Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part I. LNCS, vol. 11205, pp. 346–363. Springer, Cham (2018). Scholar
  18. 18.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: ICCV, pp. 1261–1270 (2017)Google Scholar
  19. 19.
    Liao, W., Rosenhahn, B., Shuai, L., Ying Yang, M.: Natural language guided visual relationship detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019)Google Scholar
  20. 20.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  21. 21.
    Mensink, T., Gavves, E., Snoek, C.G.: Costa: co-occurrence statistics for zero-shot classification. In: CVPR, pp. 2441–2448 (2014)Google Scholar
  22. 22.
    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). Scholar
  23. 23.
    Peyre, J., Sivic, J., Laptev, I., Schmid, C.: Weakly-supervised learning of visual relations. In: ICCV, pp. 5179–5188 (2017)Google Scholar
  24. 24.
    Prabhu, N., Venkatesh Babu, R.: Attribute-graph: a graph based approach to image ranking. In: ICCV, pp. 1071–1079 (2015)Google Scholar
  25. 25.
    Ramanathan, V., et al.: Learning semantic relationships for better action retrieval in images. In: CVPR, pp. 1100–1109 (2015)Google Scholar
  26. 26.
    Reinders, C., Ackermann, H., Yang, M.Y., Rosenhahn, B.: Object recognition from very few training examples for enhancing bicycle maps. In: 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1–8. IEEE (2018)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  28. 28.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR 2011, pp. 1745–1752. IEEE (2011)Google Scholar
  29. 29.
    Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)CrossRefGoogle Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint: arXiv:1409.1556 (2014)
  31. 31.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  32. 32.
    Wandt, B., Rosenhahn, B.: RepNet: weakly supervised training of an adversarial reprojection network for 3D human pose estimation. In: CVPR, pp. 7782–7791 (2019)Google Scholar
  33. 33.
    Xiong, Y., Zhu, K., Lin, D., Tang, X.: Recognize complex events from static images by fusing deep channels. In: CVPR, pp. 1600–1609 (2015)Google Scholar
  34. 34.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR, pp. 5410–5419 (2017)Google Scholar
  35. 35.
    Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part I. LNCS, vol. 11205, pp. 690–706. Springer, Cham (2018). Scholar
  36. 36.
    Yang, M.Y., Liao, W., Ackermann, H., Rosenhahn, B.: On support relations and semantic scene graphs. ISPRS J. Photogramm. Remote Sens. 131, 15–25 (2017)CrossRefGoogle Scholar
  37. 37.
    Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: ICCV, pp. 1974–1982 (2017)Google Scholar
  38. 38.
    Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)Google Scholar
  39. 39.
    Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR, pp. 5532–5540 (2017)Google Scholar
  40. 40.
    Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: ICCV, pp. 589–598 (2017)Google Scholar
  41. 41.
    Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: ICCV, pp. 1681–1688 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Tongxin Hu
    • 1
    Email author
  • Wentong Liao
    • 1
  • Michael Ying Yang
    • 2
  • Bodo Rosenhahn
    • 1
  1. 1.Leibniz University HannoverHanoverGermany
  2. 2.University of TwenteEnschedeNetherlands

Personalised recommendations