Learning visual features for relational CBIR


Recent works in deep-learning research highlighted remarkable relational reasoning capabilities of some carefully designed architectures. In this work, we employ a relationship-aware deep learning model to extract compact visual features used relational image descriptors. In particular, we are interested in relational content-based image retrieval (R-CBIR), a task consisting in finding images containing similar inter-object relationships. Inspired by the relation networks (RN) employed in relational visual question answering (R-VQA), we present novel architectures to explicitly capture relational information from images in the form of network activations that can be subsequently extracted and used as visual features. We describe a two-stage relation network module (2S-RN), trained on the R-VQA task, able to collect non-aggregated visual features. Then, we propose the aggregated visual features relation network (AVF-RN) module that is able to produce better relationship-aware features by learning the aggregation directly inside the network. We employ an R-CBIR ground-truth built by exploiting scene-graphs similarities available in the CLEVR dataset in order to rank images in a relational fashion. Experiments show that features extracted from our 2S-RN model provide an improved retrieval performance with respect to standard non-relational methods. Moreover, we demonstrate that the features extracted from the novel AVF-RN can further improve the performance measured on the R-CBIR task, reaching the state-of-the-art on the proposed dataset.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. CoRR arXiv:1505.00468

  2. 2.

    Belilovsky E, Blaschko MB, Kiros JR, Urtasun R, Zemel R (2017) Joint embeddings of scene graphs and images. ICLR

  3. 3.

    Cai H, Zheng VW, Chang KC (2017) A comprehensive survey of graph embedding: problems, techniques and applications. CoRR arXiv:1709.07604

  4. 4.

    Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3298–3308. IEEE

  5. 5.

    Gordo A, Almazan J, Revaud J, Larlus D (2016) End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940

  6. 6.

    Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. In: The IEEE international conference on computer vision (ICCV)

  7. 7.

    Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Zitnick CL, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning

  8. 8.

    Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Inferring and executing programs for visual reasoning. In: The IEEE international conference on computer vision (ICCV)

  9. 9.

    Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678

  10. 10.

    Kahou SE, Atkinson A, Michalski V, Kádár Á, Trischler A, Bengio Y (2017) Figureqa: an annotated figure dataset for visual reasoning. CoRR arXiv:1710.07300

  11. 11.

    Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations

  12. 12.

    Kuznetsova A, Rom H, Alldrin N, Uijlings JRR, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T, Ferrari V (2018) The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR arXiv:1811.00982

  13. 13.

    Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer vision

  14. 14.

    Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-VQA: learning visual relation facts with semantic attention for visual question answering. In: SIGKDD 2018

  15. 15.

    Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems 27. Curran Associates Inc, pp 1682–1690

  16. 16.

    Mascharka D, Tran P, Soklaski R, Majumdar A (2018) Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  17. 17.

    Melucci M (2007) On rank correlation in information retrieval evaluation. SIGIR Forum 41(1):18–33. https://doi.org/10.1145/1273221.1273223

    Article  Google Scholar 

  18. 18.

    Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2019) Learning relationship-aware visual features. In: Leal-Taixé L, Roth S (eds) Computer vision: ECCV 2018 workshops. Springer, Cham, pp 486–501

    Google Scholar 

  19. 19.

    Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: ICCV 2017—international conference on computer vision 2017. Venice, Italy. https://hal.archives-ouvertes.fr/hal-01576035

  20. 20.

    Qi M, Li W, Yang Z, Wang Y, Luo J (2018) Attentive relational networks for mapping images to scene graphs. CoRR arXiv:1811.10696

  21. 21.

    Raposo D, Santoro A, Barrett DGT, Pascanu R, Lillicrap TP, Battaglia PW (2017) Discovering objects and their relations from entangled scene representations. CoRR arXiv:1702.05068

  22. 22.

    Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 2953–2961

  23. 23.

    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 91–99

  24. 24.

    Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959. https://doi.org/10.1016/j.imavis.2008.04.004

    Article  Google Scholar 

  25. 25.

    Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, pp 4967–4976

  26. 26.

    Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879

  27. 27.

    Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. CoRR arXiv:1808.00191

  28. 28.

    Yang Z, He X, Gao J, Deng L, Smola AJ (2015) Stacked attention networks for image question answering. CoRR arXiv:1511.02274

  29. 29.

    Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. CoRR arXiv:1809.07041

  30. 30.

    Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal AM, Elhoseiny M (2018) Large-scale visual relationship understanding. CoRR arXiv:1804.10660

  31. 31.

    Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus R (2015) Simple baseline for visual question answering. CoRR arXiv:1512.02167

Download references


This work was partially supported by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by the AI4EU project, funded by the EC (H2020—Contract no. 825619). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information



Corresponding author

Correspondence to Nicola Messina.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Messina, N., Amato, G., Carrara, F. et al. Learning visual features for relational CBIR. Int J Multimed Info Retr 9, 113–124 (2020). https://doi.org/10.1007/s13735-019-00178-7

Download citation


  • Content-based image retrieval
  • Deep learning
  • Relational reasoning
  • Relation networks
  • Deep features