I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

Naeem, Muhammad Ferjad; Xian, Yongqin; Gool, Luc Van; Tombari, Federico

doi:10.1007/s11263-024-02053-3

I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

Published: 24 April 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Muhammad Ferjad Naeem ORCID: orcid.org/0000-0001-7455-7280¹,
Yongqin Xian²,
Luc Van Gool¹ &
…
Federico Tombari²

223 Accesses
Explore all metrics

Abstract

Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer+, a novel transformer-based ZSL framework that jointly learn to encode images and documents by aligning both modalities in a shared embedding space. I2DFormer+ utilizes our novel Document Summary Transformer (DSTransformer), a text transformer, that learns to encode a sequence of text into a fixed set of summary tokens. These summary tokens are utilized by a cross-model attention module that learns finegrained interactions between image patches and the summary of the document. Consequently, our I2DFormer+ not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to explain what regions of the image are important for the decision. Quantitatively, we demonstrate that I2DFormer+ significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our methods lead to highly interpretable results. Furthermore, we scale our model to the large scale zero-shot learning setting and show state-of-the-art performance on two challenging ImageNet benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Learning to Prompt for Vision-Language Models

Article 31 July 2022

References

Akata, Z., Reed, S., Walter, D., Lee, H. & Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936.
Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2015). Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38, 1425–1438.
Article Google Scholar
Al-Halah, Z., & Stiefelhagen, R. (2017). Automatic discovery, association estimation and learning of semantic attributes for a thousand categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 614–623.
Beltagy, I., Peters, M.E., & Cohan, A. (2020). Longformer: The long-document transformer. In: arXiv:2004.05150
Bucher, M., Herbin, S., & Jurie, F. (2017). Generating visual representations for zero-shot classification. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2666–2673.
Bujwid, S., & Sullivan, J. (2021). Large-scale zero-shot image classification from rich and diverse textual descriptions. In: LANTERN.
Cacheux, Y.L., Borgne, H.L., & Crucianu, M. (2019). Modeling inter and intra-class relations in the triplet loss for zero-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10333–10342.
Changpinyo, S., Chao, W.-L., Gong, B., & Sha, F. (2016). Synthesized classifiers for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336.
Chao, W.-L., Changpinyo, S., Gong, B., & Sha, F. (2016). An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 52–68. Springer.
Chen, S., Wang, W., Xia, B., Peng, Q., You, X., Zheng, F., & Shao, L. (2021). Free: Feature refinement for generalized zero-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 122–131.
Cui, Y., Zhao, L., Liang, F., Li, Y., & Shao, J. (2022). Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., & Batra, D. (2017). Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335.
De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., & Courville, A. C. (2017). Modulating early visual processing by language. Advances in Neural Information Processing Systems, 30.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR.
Elhoseiny, M., Saleh, B., & Elgammal, A. (2013). Write a classifier: Zero-shot learning using purely textual descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2584–2591.
Elhoseiny, M., Zhu, Y., Zhang, H., & Elgammal, A. (2017). Link the head to the" beak": Zero shot learning from noisy text description at part precision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5640–5649.
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785. IEEE.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems, 26.
Ghiasi, G., Gu, X., Cui, Y., & Lin, T.-Y. (2022). Scaling open-vocabulary image segmentation with image-level labels. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pp. 540–557. Springer.
Gu, X., Lin, T.-Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921
Hascoet, T., Ariki, Y., & Takiguchi, T. (2019). On zero-shot recognition of generic objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9553–9561.
Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., & Shelhamer, E. (2021). Perceiver io: A general architecture for structured inputs & outputs. In: ICLR.
Ji, Z., Fu, Y., Guo, J., Pang, Y., & Zhang, Z. M. (2018). Stacked semantics-guided attention model for fine-grained zero-shot learning. Advances in Neural Information Processing Systems, 31.
Jiang, H., Wang, R., Shan, S., & Chen, X. (2019). Transferable contrastive network for generalized zero-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9765–9774.
Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., & Xing, E.P. (2019). Rethinking knowledge graph propagation for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11487–11496.
Kil, J., & Chao, W.-L. (2021). Revisiting document representations for large-scale zero-shot learning. In: NAACL.
Lei Ba, J., Swersky, K., & Fidler, S. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4247–4255.
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208
Liu, S., Long, M., Wang, J., & Jordan, M. I. (2018). Generalized zero-shot learning with deep calibration network. Advances in Neural Information Processing Systems, 31.
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. 32
Lüddecke, T., & Ecker, A. (2022). Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7086–7096.
Mancini, M., Naeem, M.F., Xian, Y., & Akata, Z. (2021). Open world compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5222–5230.
Mancini, M., Naeem, M. F., Xian, Y., & Akata, Z. (2022). Learning graph embeddings for open world compositional zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.
Naeem, M.F., Örnek, E.P., Xian, Y., Van Gool, L., & Tombari, F. (2022). 3d compositional zero-shot learning with decompositional consensus. In: European Conference on Computer Vision, pp. 713–730. Springer.
Naeem, M.F., Xian, Y., Tombari, F., & Akata, Z. (2021). Learning graph embeddings for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 953–962.
Naeem, M. F., Xian, Y., Gool, L. V., & Tombari, F. (2022). I2dformer: Learning image to document attention for zero-shot image classification. Advances in Neural Information Processing Systems, 35, 12283–12294.
Google Scholar
Narayan, S., Gupta, A., Khan, F.S., Snoek, C.G., & Shao, L. (2020). Latent embedding feedback and discriminative features for zero-shot classification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pp. 479–495. Springer.
Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE.
Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G., & Dean, J. (2014). Zero-shot learning by convex combination of semantic embeddings. In: ICLR.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The sun attribute database: Beyond categories for deeper scene understanding. International Journal of Computer Vision, 108, 59–81.
Article Google Scholar
Pennington, J., Socher, R. & Manning, C.D. (2014). Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.
Pham, H., Dai, Z., Ghiasi, G., Kawaguchi, K., Liu, H., Yu, A. W., Yu, J., Chen, Y.-T., Luong, M.-T., & Wu, Y. (2023). Combined scaling for zero-shot transfer learning. Neurocomputing, 555, 126658.
Article Google Scholar
Qiao, R., Liu, L., Shen, C., & Van Den Hengel, A. (2016). Less is more: zero-shot learning from online textual documents with noise suppression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2249–2257.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR.
Reimers, N., Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In: EMNLP.
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2016). Grounding of textual phrases in images by reconstruction. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 817–834. Springer.
Rohrbach, A., Rohrbach, M., Tang, S., Joon Oh, S., & Schiele, B. (2017). Generating descriptions with grounded and co-referenced people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4979–4989.
Romera-Paredes, B., & Torr, P. (2015). An embarrassingly simple approach to zero-shot learning. In: International Conference on Machine Learning, pp. 2152–2161. PMLR.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. In: Information Processing & Management.
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., & Akata, Z. (2019). Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8247–8255.
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. Advances in Neural Information Processing Systems, 26.
Song, J., Shen, C., Lei, J., Zeng, A.-X., Ou, K., Tao, D., & Song, M. (2018). Selective zero-shot classification with augmented attributes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 468–483.
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33, 16857–16867.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Verma, V.K., Arora, G., Mishra, A., & Rai, P. (2018). Generalized zero-shot learning via synthesized examples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4281–4289.
Vyas, M.R., Venkateswara, H., & Panchanathan, S. (2020). Leveraging seen and unseen semantic relationships for generative zero-shot learning. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 70–86. Springer.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
Google Scholar
Wang, X., Ye, Y., & Gupta, A. (2018). Zero-shot recognition via semantic embeddings and knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866.
Website: A-Z Animals. https://a-z-animals.com/
Website: Wikipedia. https://en.wikipedia.org/
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., & Schiele, B. (2016). Latent embeddings for zero-shot classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 69–77.
Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018). Feature generating networks for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5542–5551.
Xian, Y., Sharma, S., Schiele, B., & Akata, Z. (2019). f-vaegan-d2: A feature generating framework for any-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10275–10284.
Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018). Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41, 2251–2265.
Article Google Scholar
Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., & Wang, X. (2022). Groupvit: Semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18134–18144.
Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2022). Vgse: Visually-grounded semantic embeddings for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9316–9325.
Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2020). Attribute prototype network for zero-shot learning. Advances in Neural Information Processing Systems, 33, 21969–21980.
Google Scholar
Yamada, I., Asai, A., Sakuma, J., Shindo, H., Takeda, H., Takefuji, Y., & Matsumoto, Y. (2020). Wikipedia2vec: An efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia. In: ACL.
Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., & Xu, C. (2022). FILIP: Fine-grained interactive language-image pre-training. In: ICLR.
Yu, F.X., Cao, L., Feris, R.S., Smith, J.R., & Chang, S.-F. (2013). Designing category-level attributes for discriminative visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 771–778.
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022). Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133.
Zhang, L., Xiang, T., & Gong, S. (2017). Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030.
Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., & Elgammal, A. (2018). A generative adversarial approach for zero-shot learning from noisy texts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1004–1013.
Zhu, Y., Xie, J., Liu, B., & Elgammal, A. (2019). Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9844–9854.
Zhu, Y., Xie, J., Tang, Z., Peng, X., & Elgammal, A. (2019). Semanticguided multi-attention localization for zero-shot learning. Advances in Neural Information Processing Systems, 32.

Download references

Author information

Authors and Affiliations

Computer Vision Lab, ETH Zürich, Zurich, Switzerland
Muhammad Ferjad Naeem & Luc Van Gool
Google, Zurich, Switzerland
Yongqin Xian & Federico Tombari

Authors

Muhammad Ferjad Naeem
View author publications
You can also search for this author in PubMed Google Scholar
Yongqin Xian
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar
Federico Tombari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongqin Xian.

Additional information

Communicated by Vittorio Murino.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Naeem, M.F., Xian, Y., Gool, L.V. et al. I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02053-3

Download citation

Received: 14 March 2023
Accepted: 08 March 2024
Published: 24 April 2024
DOI: https://doi.org/10.1007/s11263-024-02053-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation