Learning visual features for relational CBIR

Messina, Nicola; Amato, Giuseppe; Carrara, Fabio; Falchi, Fabrizio; Gennaro, Claudio

doi:10.1007/s13735-019-00178-7

Learning visual features for relational CBIR

Regular Paper
Published: 14 September 2019

Volume 9, pages 113–124, (2020)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Nicola Messina¹,
Giuseppe Amato¹,
Fabio Carrara¹,
Fabrizio Falchi¹ &
…
Claudio Gennaro¹

395 Accesses
11 Citations
Explore all metrics

Abstract

Recent works in deep-learning research highlighted remarkable relational reasoning capabilities of some carefully designed architectures. In this work, we employ a relationship-aware deep learning model to extract compact visual features used relational image descriptors. In particular, we are interested in relational content-based image retrieval (R-CBIR), a task consisting in finding images containing similar inter-object relationships. Inspired by the relation networks (RN) employed in relational visual question answering (R-VQA), we present novel architectures to explicitly capture relational information from images in the form of network activations that can be subsequently extracted and used as visual features. We describe a two-stage relation network module (2S-RN), trained on the R-VQA task, able to collect non-aggregated visual features. Then, we propose the aggregated visual features relation network (AVF-RN) module that is able to produce better relationship-aware features by learning the aggregation directly inside the network. We employ an R-CBIR ground-truth built by exploiting scene-graphs similarities available in the CLEVR dataset in order to rank images in a relational fashion. Experiments show that features extracted from our 2S-RN model provide an improved retrieval performance with respect to standard non-relational methods. Moreover, we demonstrate that the features extracted from the novel AVF-RN can further improve the performance measured on the R-CBIR task, reaching the state-of-the-art on the proposed dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 2

Learning Relationship-Aware Visual Features

Re-implementing and Extending Relation Network for R-CBIR

A Critical Analysis of Learning Approaches for Image Annotation Based on Semantic Correlation

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. CoRR arXiv:1505.00468
Belilovsky E, Blaschko MB, Kiros JR, Urtasun R, Zemel R (2017) Joint embeddings of scene graphs and images. ICLR
Cai H, Zheng VW, Chang KC (2017) A comprehensive survey of graph embedding: problems, techniques and applications. CoRR arXiv:1709.07604
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3298–3308. IEEE
Gordo A, Almazan J, Revaud J, Larlus D (2016) End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940
Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. In: The IEEE international conference on computer vision (ICCV)
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Zitnick CL, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning
Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Inferring and executing programs for visual reasoning. In: The IEEE international conference on computer vision (ICCV)
Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678
Kahou SE, Atkinson A, Michalski V, Kádár Á, Trischler A, Bengio Y (2017) Figureqa: an annotated figure dataset for visual reasoning. CoRR arXiv:1710.07300
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations
Kuznetsova A, Rom H, Alldrin N, Uijlings JRR, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T, Ferrari V (2018) The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR arXiv:1811.00982
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer vision
Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-VQA: learning visual relation facts with semantic attention for visual question answering. In: SIGKDD 2018
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems 27. Curran Associates Inc, pp 1682–1690
Mascharka D, Tran P, Soklaski R, Majumdar A (2018) Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Melucci M (2007) On rank correlation in information retrieval evaluation. SIGIR Forum 41(1):18–33. https://doi.org/10.1145/1273221.1273223
Article Google Scholar
Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2019) Learning relationship-aware visual features. In: Leal-Taixé L, Roth S (eds) Computer vision: ECCV 2018 workshops. Springer, Cham, pp 486–501
Chapter Google Scholar
Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: ICCV 2017—international conference on computer vision 2017. Venice, Italy. https://hal.archives-ouvertes.fr/hal-01576035
Qi M, Li W, Yang Z, Wang Y, Luo J (2018) Attentive relational networks for mapping images to scene graphs. CoRR arXiv:1811.10696
Raposo D, Santoro A, Barrett DGT, Pascanu R, Lillicrap TP, Battaglia PW (2017) Discovering objects and their relations from entangled scene representations. CoRR arXiv:1702.05068
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 2953–2961
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 91–99
Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959. https://doi.org/10.1016/j.imavis.2008.04.004
Article Google Scholar
Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, pp 4967–4976
Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. CoRR arXiv:1808.00191
Yang Z, He X, Gao J, Deng L, Smola AJ (2015) Stacked attention networks for image question answering. CoRR arXiv:1511.02274
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. CoRR arXiv:1809.07041
Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal AM, Elhoseiny M (2018) Large-scale visual relationship understanding. CoRR arXiv:1804.10660
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus R (2015) Simple baseline for visual question answering. CoRR arXiv:1512.02167

Download references

Acknowledgements

This work was partially supported by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by the AI4EU project, funded by the EC (H2020—Contract no. 825619). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Authors and Affiliations

Via G. Moruzzi, 1, 56124, Pisa, Italy
Nicola Messina, Giuseppe Amato, Fabio Carrara, Fabrizio Falchi & Claudio Gennaro

Authors

Nicola Messina
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Amato
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Carrara
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Falchi
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Gennaro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Messina.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Messina, N., Amato, G., Carrara, F. et al. Learning visual features for relational CBIR. Int J Multimed Info Retr 9, 113–124 (2020). https://doi.org/10.1007/s13735-019-00178-7

Download citation

Received: 15 April 2019
Revised: 20 July 2019
Accepted: 04 September 2019
Published: 14 September 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s13735-019-00178-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Learning visual features for relational CBIR

Abstract

Access this article

Similar content being viewed by others

Learning Relationship-Aware Visual Features

Re-implementing and Extending Relation Network for R-CBIR

A Critical Analysis of Learning Approaches for Image Annotation Based on Semantic Correlation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning visual features for relational CBIR

Abstract

Access this article

Similar content being viewed by others

Learning Relationship-Aware Visual Features

Re-implementing and Extending Relation Network for R-CBIR

A Critical Analysis of Learning Approaches for Image Annotation Based on Semantic Correlation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation