1 Introduction

Large-scale pre-training of neural networks has recently resulted in the construction of a multitude of foundation models for Language (Devlin et al., 2018; Radford et al., 2019) and Vision & Language (V &L) understanding (Radford et al., 2021; Jia et al., 2021; Yu et al., 2022; Alayrac et al., 2022). Unlike the previous generation of neural networks, such models can better capture the distribution of the world from which new favorable properties and characteristics emerge. Of particular interest to this work are V &L models trained with contrastive learning (i.e. CLIP-like models (Radford et al., 2021; Jia et al., 2021; Li et al., 2021; Yao et al., 2021; Yu et al., 2022)), which have enabled seamless few-shot and even zero-shot adaptation to new downstream tasks and datasets. Specifically, this paper proposes a simple yet highly effective way to drastically improve soft prompt learning for the few-shot adaptation of the V &L model to a given downstream task.

Similarly to their NLP counterparts (Radford et al., 2021; Lester et al., 2021; Li & Liang, 2021), prompt engineering and learning has emerged as one of the most powerful techniques for adapting a V &L to new tasks. Initially, in Radford et al. (2021), a set of manually-defined hand-engineered templates (or prompts) like a photo of a {cls_name}, or a black and white photo of a {cls_name} were passed through the text encoder of the V &L model to create class-specific weights for category cls_name that can be used for zero-shot recognition. Following research in NLP (Lester et al., 2021; Li & Liang, 2021), subsequent work (Zhou et al., 2022, 2022a) has proposed replacing the manually picked templates with a sequence of learnable vectors, also coined soft prompts, which are fed as input to the text encoder along with the class name cls_name. The soft prompts are learned from a few training examples, with the parameters of the entire V &L model kept frozen. The whole process can be seen as parameter efficient fine-tuning of the V &L model on a small training dataset.

However, a clearly identifiable problem with prompt learning is base class overfitting: while the accuracy on the classes used for training (base classes) significantly increases, the accuracy on unseen, during training, (novel) classes significantly drops. This is to some extent expected, as soft prompts are learned from few examples belonging to the base classes. Notably, on novel classes, direct, zero-shot recognition using hand-engineered prompts outperforms all existing soft prompt learning methods.

In addition to this, for adaptation, all prior works assume the existence of paired vision and language data. Herein, we seek to relax this setting and advance the idea of vision-language adaptation without images, i.e. using solely language data, namely the class names of interest.

Key ideas: Firstly, to alleviate base class overfitting, in this work, we propose a solution motivated by the following observation: since prompt learning improves the accuracy on base classes, but prompt engineering is significantly better on novel classes, we propose to learn the soft prompts by adding a cross entropy text-to-text loss that enforces the learned prompts to be close, in embedding space, to the textual ones, thus exploiting the intrinsic information captured by the text encoder. The proposed text-to-text loss enables language-only optimization for vision-language adaption for the first time. This is in contrast with prior soft-prompt learning methods that only capture vision-language interactions.

Secondly, as CLIP learns a joint shared representation for the two domains, i.e. vision and language, one can approximate, to some extent, the vision domain with language (limited by the induced contrastive domain gap). Hence, by exploiting this, we devise a prompt learning framework for vision language adaptation that can learn solely based on the class names.

Key contributions: Based on the above, we propose a novel framework for soft prompt learning which we call Language-Aware Soft Prompting (LASP) trained either with labeled vision-language data or solely in the language domain (LASP-Z). Our main contributions within the LASP framework are as follows:

  • We propose, for the first time, language-only optimization for vision-language adaption. Specifically, we propose a novel text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to the hand-engineered ones and show its effectiveness in terms of alleviating base-class overfitting.

  • To increase the representation capacity of the prompts, and inspired by grouped convolution and multi-head attention, we propose a grouped language-aware prompt representation where each group of prompts specializes to a different subset of the pre-defined manual templates.

  • We identify a visual-language misalignment introduced by prompt learning and LASP which impacts the generalization. More importantly, we propose a re-calibration mechanism based on (a) Layer Normalization fine-tuning and (b) learning a class-agnostic bias to address it.

  • Thanks to our language-only learning framework, we propose training LASP with virtual classes by including, during training, class names for which no visual samples are available. Importantly, we show that this further increases the robustness of the learned prompts.

  • Finally, by capitalizing on our language-only optimization framework, we present a zero-shot variant of LASP where no visual samples at all are available for the downstream adaptation task and show its superiority upon CLIP with prompt engineering. Effectively, this accomplishes vision-language adaptation without vision data.

Main results: Our methods set a new state-of-the-art for few-shot and zero-shot image classification on 11 datasets, significantly outperforming all soft prompting prior works. Importantly, we present, for the first time, a prompt learning method that outperforms, for the majority of the test datasets (8 out of 11), the very strong baseline based on hand-crafted prompts and CLIP for the recognition of novel classes (i.e. zero-shot setting). Moreover, our zero-shot V &L adaptation approach, LASP-Z, improves upon zero-shot CLIP without requiring any images at train time.

2 Related Work

Contrastive Vision-Language Models: Recently, large scale V &L pre-training with contrastive learning has been used to train foundation models resulting in robust representations, transferable to new tasks both under few-shot and zero-shot settings (Radford et al., 2021; Jia et al., 2021; Li et al., 2021; Yao et al., 2021; Yu et al., 2022). Such networks consist of a vision encoder (typically a ViT (Dosovitskiy et al., 2020)) and a Transformer-based text encoder (Vaswani et al., 2017). Highly parameterized instantiations of such architectures are trained on large corpora of image-caption pairs (e.g. Radford et al. (2021) uses 400 M and Jia et al. (2021) 1B pairs) using contrastive learning. We used CLIP (Radford et al., 2021) as the foundation model for our method.

Domain generalization aims to learn models that generalize to out-of-domain data. Current approaches attempt to perform data alignment (Hu et al., 2020; Mahajan et al., 2021; Shao et al., 2019), augmentation (Shi et al., 2020; Zhou et al., 2023), meta-learning (Balaji et al., 2018; Dou et al., 2019), self-supervised learning (Albuquerque et al., 2020) or reinforcement learning (Laskin et al., 2020; Yarats et al., 2021). As our approach can generalize outside of the source data, either via few-shot adaptation (LASP) or more extremely, using solely the class names (LASP-Z), it can be also considered as a domain-generalization method.

Zero/few-shot learning is concerned with the construction of models that can be adapted to downstream tasks using few or even no labeled samples. Both scenarios are currently dominated by large-scale constrastively pretrained vision-language models (Radford et al., 2021; Yao et al., 2021), a line of work which our approach builds upon too. While a full review goes beyond the scope of this work, we note that this is a vast research field (Nichol et al., 2018; Rajeswaran et al., 2019; Li et al., 2017), referring the reader to (Song et al., 2023).

Prompt Learning is about adapting pre-trained foundational models on (downstream) tasks, typically in a zero-shot or few-shot setting. Firstly proposed in the context of Language Models (LM), prompting was initially about prepending hand-crafted instructions/examples to the task input so that the LM generates the appropriate output conditioned to the input (Radford et al., 2019; Brown et al., 2020). In (Schick & Schutze, 2020a, b), the main idea is to reformulate the downstream task as a cloze task using hand-crafted patterns (or templates), thus avoiding the need to train a task-specific classifier. As finding the optimal patterns is laborious, recent works have attempted to address this by learning a set of soft (continuous) prompts (Lester et al., 2021; Li & Liang, 2021).

In V &L foundation models, like CLIP, the class names are used to create hand-crafted prompts (Radford et al., 2021) that are fed as input to the text encoder, enabling zero-shot visual recognition. CoOp (Zhou et al., 2022) extends work on soft prompt optimization to the V &L domain by learning a set of M prompts which are used as input to the text encoder alongside the class name. The prompts are learned by minimizing the classification error on a training set consisted of the given base classes. One major limitation of CoOp is weak generalization: the learned prompts overfit the base classes and do not work well when tested on novel classes. To alleviate this, CoCoOp (Zhou et al., 2022a) proposes a dynamic version of Zhou et al. (2022) where a small network is trained to produce a visual feature from the input image that is added to the learned prompts, hence making them input specific (i.e. dynamic). ProDA (Lu et al., 2022) adopts a probabilistic approach by modelling the distribution of the prompts at the output of the text encoder as a multivariate Gaussian distribution. The estimated mean is used during inference. UPL (Huang et al., 2022) uses CLIP to generate pseudo-labels on the target dataset and then self-training to learn the soft prompts. Finally, ProGrad (Zhu et al., 2022) aims to adapt the V &L model to each target domain by encouraging it “not to forget” CLIP’s zero-shot predictions using a KL visual-text loss between the CLIP’s logits and their model’s logits (i.e. they use visual features). The weights are then updated in the direction perpendicular to CLIP gradients. In contrast, our loss is a pure text-to-text loss, further allowing for the incorporation of virtual classes. Unlike (Zhu et al., 2022), we outperform CLIP on novel classes.

The proposed LASP framework alleviates base class overfitting and significantly improves upon the previously reported best results without resorting to a dynamic approach as in CoCoOp (Zhou et al., 2022a). In its basic version, LASP deploys a text-to-text loss that enforces the learned prompts to be “close” to a set of manually defined textual prompts in the text encoder space. Importantly, the basic LASP can be extended in three important ways: (1) by allowing the incorporation of virtual classes i.e. novel class name information for which no (visual) training data is available (LASP-V). This is shown to significantly improve the robustness of the learned prompts at no extra cost during inference; (2) by allowing the use of a grouped prompt representation within the proposed language-aware training which is shown to increase the representation capacity of the learned prompts; (3) by performing further optimization of the visual encoder so that the visual and text embeddings are realigned resulting in significant accuracy gains. Finally, we present a zero-shot variant of LASP where no training images at all are available for the downstream adaptation task. Notably, our approach is very efficient (as efficient as Zhou et al. (2022)) as opposed to Zhou et al. (2022a) which requires recomputing all the class-related text embeddings every time a new image is to be classified.

3 Method

3.1 Background

Prompt engineering enables zero-shot visual recognition using V &L models trained with contrastive learning (CLIP in this work) as follows: Given a set \({\mathcal {V}}\) of C class names, \(\texttt {class\_name}_c\), \(c \in \{1,\dots , C\}\), a prompt, i.e. a manually designed template concatenated with the class name like \(h_c=\)a photo of a \(\{\texttt {class\_name}_c\}\), is passed through the V &L’s text encoder \(g_T(.)\) to compute the class specific text feature (weight) \(\textbf{t}^h_c = g_T(h_c)\). Moreover, an image \({\textbf {x}}\) to be classified is passed through the V &L’s image encoder \(g_I(.)\) to compute image specific feature \({\textbf {f}} = g_I(\textbf{x})\). A probability distribution over the class labels is given by:

$$\begin{aligned} P_h(y|\textbf{x}) = \frac{\exp \Bigl ( \texttt {cos}(\textbf{t}^h_y, \textbf{f}) / \tau \Bigr ) }{\sum _{c=1}^{C} \exp \Bigl (\texttt {cos}(\textbf{t}^h_c, \textbf{f})/\tau \Bigr )}, \end{aligned}$$
(1)

where \(\tau \) is a temperature factor and \(\texttt {cos}\) the cosine similarity. Finally, the class for \({\textbf {x}}\) is given by \(\tilde{y}=\arg _{max} P_h(y|\textbf{x})\). Note that, to compute \(\textbf{t}^h_c\), no training with class specific image data is required, thus enabling zero-shot recognition for any given class name.

Soft prompt learning (Lester et al., 2021; Li & Liang, 2021; Zhou et al., 2022) is concerned with parameter efficient fine-tuning of a pre-trained V &L model by learning a sequence of M learnable vectors \(\textbf{p}_m\in {\mathbb {R}}^{d}, m=\{1,\dots ,M\}\) using a few labeled samples. Specifically, the manually picked prompt \(h_c\) is replaced by a new learnable one \(\textbf{r}_c\) formed by concatenating the sequence of \(\mathbf {p_m}\) with the word embedding \({\textbf {w}}_c\) of \(\texttt {class\_name}_c\), that is: \(\textbf{r}_c = \{{\textbf {p}}_1, {\textbf {p}}_2, \dots , {\textbf {p}}_M, {\textbf {w}}_c\}\), and, finally, a class specific text feature \(\textbf{t}^r_c = g_T(\textbf{r}_c)\) is obtained. A probability distribution over the class labels is:

$$\begin{aligned} P_r(y|\textbf{x}) = \frac{\exp \Bigl ( \texttt {cos}(\textbf{t}^r_y, \textbf{f}) / \tau \Bigr ) }{\sum _{c=1}^{C} \exp \Bigl (\texttt {cos}(\textbf{t}^r_c, \textbf{f})/\tau \Bigr )}. \end{aligned}$$
(2)

The prompts can be learned by minimizing the cross-entropy loss:

$$\begin{aligned} {\mathcal {L}}_{VL} = - \sum _{c=1}^C \log P_r(c|\textbf{x}) y_{c}. \end{aligned}$$
(3)

Note that the V &L model remains entirely frozen during training. Moreover, as the soft prompts are typically shared across all classes, they can be directly used for zero-shot evaluation on additional novel classes.

3.2 Language-Aware Soft Prompting (LASP)

Despite its strong performance on base classes, vanilla soft prompt learning (see Sect. 3.1) under-performs on novel classes (i.e. zero-shot setting). While CoCoOp  (Zhou et al., 2022) partially alleviates this by conditioning on the image feature, its accuracy for the zero-shot setting is still trailing that of CLIP with hand-crafted prompts. Moreover, it requires passing the prompts for all classes through the text encoder every time a new image is to be classified.

In this work, we propose, for the first time, language-only optimization for vision-language downstream adaption. This is in contrast with prior soft-prompt learning methods that only capture vision-language interactions. Specifically, since the hand-engineered textual prompts outperform the learnable soft prompts for the zero-shot setting, then, in order to avoid base-class overfitting and strengthen generalizability, we propose that the learnable ones should be trained so that they can be correctly classified in language space where the class weights are given by the textual prompts. In other words, the model is forced to correctly classify the learnable prompts into the corresponding hand-crafted ones.

To this end, a second cross entropy loss is used to minimize the distance between the encoded learned soft prompts and the encoded textual ones. Specifically, recall that \(\textbf{t}^h_c = g_T(h_c)\) is the class weight for class c obtained by encoding the corresponding textual prompt. Assuming that L manually defined textual prompts are available,Footnote 1 we have \(\textbf{t}^{h,l}_c, \;l=1,\dots ,L.\) Moreover, \(\textbf{t}^r\) is an encoded learnable prompt to be classified in one of the C classes. Finally, the probability of prompt \(\textbf{t}^r\) being classified as class y is:

$$\begin{aligned} P_{rh}(y|\mathbf {t^r}) = \frac{1}{L}\sum _{l=1}^L \frac{\exp \Bigl ( \texttt {cos}(\textbf{t}_{y}^{h,l}, \textbf{t}^r) / \tau \Bigr ) }{\sum _{c=1}^{C} \exp \Bigl (\texttt {cos}(\textbf{t}_{c}^{h,l}, \textbf{t}^r)/\tau \Bigr )}. \end{aligned}$$
(4)

The language-aware training loss is computed similarly to the vision-language loss:

$$\begin{aligned} {\mathcal {L}}_{TT} = - \sum _{c=1}^C \log P_{rh}(c|\textbf{t}^r) y_{c}, \end{aligned}$$
(5)

with the overall training objective defined as:

$$\begin{aligned} {\mathcal {L}} = \alpha _{VL} {\mathcal {L}}_{VL} + \alpha _{TT} {\mathcal {L}}_{TT}, \end{aligned}$$
(6)

where \(\alpha _{VL}\) and \(\alpha _{TT}\) are user-defined scaling coefficients controlling the magnitude of the \({\mathcal {L}}_{VL}\) and \({\mathcal {L}}_{TT}\) losses, respectively. Overall, we call the proposed learning formulation Language-Aware Soft Prompting (LASP). See also Fig. 1.

Fig. 1
figure 1

Overall idea. While standard prompt learning is based on image-text interactions (\(L_{VL}\) loss; Eq. 3), LASP additionally models text-text interactions using the proposed Text-to-Text loss \(L_{TT}\) (Eq. 5). There are G groups of learned prompts \(\textbf{p}_i^j\) passed through the text encoder to form G text embeddings \(\textbf{t}_j\) summarizing the input. The \(L_{TT}\) loss is then applied over the different groups of the text embeddings and the textual prompts. Moreover, to alleviate data distribution shift and visual-language misalignment, the LN layers of the visual encoder are fine-tuned and the embeddings are “corrected” at the output space by the learnable vector \(\textbf{b}\), shared for all classes. The text encoder remains entirely frozen. Notably, LASP can be trained with virtual classes by including, during training, class names for which no visual samples are available

Interpretations: LASP can be interpreted in a number of ways:

LASP as a regularizer: Although the learned prompts constitute a small number of parameters, especially in the few-shot setting, the resulting models (prompts) are prone to overfitting to base classes (Zhou et al., 2022). As the proposed language-aware loss encourages the learned prompts to be close in embedding space to the textual ones, LASP can be naturally viewed as a regularizer that prevents the learned prompt-conditioned features from diverging too much from the hand-crafted ones.

LASP as language-based augmentation: Current soft prompt learning methods restrict augmentation to the vision domain, where random transformations, such as rotation, colour jittering or scaling, increase the robustness of the system, especially for cases with limited number of training samples. However, no augmentations are performed in the language domain. Ideally, we want the prompt-conditioned text embedding to be robust too, capturing the full space of each class. In practice, we can achieve this by targeted prompting, where we can specify certain characteristics and/or apply text-based transformations to the class name, e.g.: “A sketch of dog” or “A rotated photo of a dog”.

At train time, as reflected by Eq. 4, we compute the class label distribution per l-th template and then average over all templates. Hence, we opt not to mix across templates during training as we want the model to focus on class information solely. For example, the model could distinguish easier between a “a sketch of a dog” and “a photo of a wolf” compared to “a sketch of a dog” and “a sketch of a wolf”, as in the former case, the style could be used as an additional queue. We validated this in preliminary experiments (intermixing the templates was found to hurt performance).

LASP for discriminative class centroids: By optimizing w.r.t both image and text, our method produces class centroids that are more discriminative and have a higher separation margin. This can be visualized in Fig. 2 where we plot the cosine distance between the embeddings of each class. Our approach learns class centroids that have a higher cosine distance than those of our baseline, CoOp.

LASP as data-free distillation: Typically, knowledge distillation requires a training set of images where a teacher network provides a training signal for the student (Hinton, 2015). LASP’s text-to-text loss can be also interpreted as a data-free distillation (i.e. does not use any image data) where the learnable prompts define the “samples”. As CLIP learns a joint vision-language space, similar concepts are close together across both domains. Hence, optimizing against a concept or object in the language domain, using the proposed loss, should also help make a step in the visual domain, improving the classification of the images.

Fig. 2
figure 2

Cosine distance between the class embeddings produced by the CLIP text encoder on Eurosat, DTD, Flowers102 and Food101 for LASP and CoOp. Class centroids situated further apart are more separable, as the underlying image features are identical across both methods. Brighter colors indicate bigger distances. The numbers shown are the average cosine distance between the classes

3.3 Grouped LASP

Grouped convolutions (Krizhevsky et al., 2017) and multi-head attention (Vaswani et al., 2017) have been shown to learn strong representations. The groups or the number of heads, respectively, can be also interpreted as a set of experts that are then combined to produce a strong feature. Drawing inspiration from this, we propose a grouped prompt representation, where each group is optimized with respect to a separate subset of textual prompts. Effectively, the prompts from each group will learn a transformation specialized to its corresponding subset (analogous to the aforementioned techniques that also specialize to a part of the signal). In particular, we split the set of L templates into G equally sized sub-sets. Moreover, each sub-set is associated with a sequence of M prompts \(\textbf{r}^g_c = \{{\textbf {p}}^g_1, \dots , {\textbf {p}}^g_M, {\textbf {w}}_c\}, g=1,\dots ,G\) each producing a class specific text feature \(\textbf{t}^{r,g}_c = g_T(\textbf{r}^g_c)\). Finally, our text-to-text loss in Eq. 5 becomes:

$$\begin{aligned} {\mathcal {L}}_{TT-G} = - \sum _{g=1}^G\sum _{c=1}^C \log P_{rh}^g(c|\textbf{t}^g) y_c, \end{aligned}$$
(7)

with \(P_{rh}^g\) computed for each group similarly to Eq. 4. At test time, the final result is computed by taking the average of the cosine similarity scores between each group and the visual feature \(\textbf{f}\).

Fig. 3
figure 3

LASP-Z is a pure zero-shot variant of LASP which does not use visual samples at all for downstream adaptation. LASP-Z exclusively operates within the language domain by optimizing the prompts using the proposed Text-to-Text loss \(L_{TT}\) (Eqs. 5, 8). During training, in order to explicitly alleviate the domain gap and explore the vicinity of the class embeddings, we propose a to augment the input space using a series of randomly sampled adjectives/attributes and b to augment the output space by injecting additive Gaussian noise to the class embedding. Note in the figure above, the text encoder remains entirely frozen, and its weights are shared

3.4 Re-aligning LASP

Combating data distribution shift: for some downstream tasks, it is possible that there is a data distribution shift between the downstream image dataset and the one used by CLIP during training. Hence, we would like this aspect to be captured by the downstream adaptation method. To this end, some optimization of the visual encoder can be performed; nevertheless this can very easily result in base class overfitting if, after the training, the V &L embeddings are pushed away from the joint space learned by CLIP. For example, preliminary results with visual adapters have shown that they hurt zero-shot accuracy. On the contrary, we found that Layer Normalization (LN) (Ba et al., 2016) fine-tuning is a much more robust way to adapt the visual encoder. Overall, we propose fine-tuning the LN of the CLIP encoder as a way to combat distributional shift.

Combating V &L misalignment: Because after LN fine-tuning the V &L are not guaranteed to continue to be aligned, we also propose to learn a “correction” at the output of the text encoder in the form of a learnable offset (bias) that aims to re-align the two modalities. Let \(\textbf{W}\) be the set of weights of the linear classifier obtained by passing the learned prompts from the text encoder. We propose to learn a vector \(\textbf{b}\in {\mathbb {R}}^d\) that is simply added to \(\textbf{W}\), that is \(\textbf{W} = \textbf{W} + \textbf{b}\). Importantly, the learned offset is shared among all classes, and in this way it can be readily applied for the case of novel classes too.

3.5 LASP with Virtual Classes (LASP-V)

A direct observation that can be drawn from Eq. 4 is that, in practice, we do not have to use only the class names for which we have labelled image data, as the value of \(L_{TT}\) is independent of the input image. To this end, we propose to learn the prompts using both annotated image-text pairs and class names outside of the base set (for which we have no images available). We call this setting as training LASP with virtual classes. Our setting combines the best of both words: the guidance from the few annotated image samples and the zero-shot generalizability of language-based training. As our results show, LASP with virtual classes can significantly improve the robustness of the prompts learned. We refer to this variant of our method as LASP-V.

Note that training with virtual classes does not violate the zero-shot setting (Xian et al., 2017).Footnote 2 Moreover, from a practical perspective, if the novel class names are not known during initial training, the model can be simply retrained in a zero-shot manner when they become available.

4 Zero-Shot LASP (LASP-Z)

The LASP framework presented so far combines vision-language and language-language optimization for both few-shot and in-domain zero-shot accuracy. However, as \({\mathcal {L}}_{VT}\) and \({\mathcal {L}}_{TT}\) can be applied independently, one can fully transition from the few-shot setting to the zero-shot one, where no visual examples at all are available, just the class names. Herein, we attempt to study this zero-shot setting, introducing LASP-Z, a visual data-free approach capable of zero-shot in-domain specialization.

Not entirely surprising, a naive, direct application of Eq. 5 is heavily sensitive to overfitting: Firstly, there is an implicit domain gap between the vision and language modalities within the CLIP embedding space (Liang et al., 2022), hence overly specializing to the textual data amplifies and further expose this dissimilarity. Secondly, for image data, it is common practice to apply random train-time augmentations with the goal of alleviating overfitting. In fact, image augmentation has been shown to be a key component in many state-of-the-art self-supervised representation learning methods (Chen et al., 2020; Caron et al., 2020). Hence, to optimize Eq. 5 without overfitting, one should aim at applying inexpensive, yet effective augmentations in the language domain.

Specifically, we define two “augmentation” inducing functions, applied pre- and post-encoding: \(f_{pre}(.)\) and \(f_{post}(.)\), respectively. The goal of \(f_{pre}\) inserts a set of adjectives or attributes before the class name (e.g. large, small, rotates, pixelated, colorful etc.). This explores, in essence, class-generic appearance variations directly in the text domain, analogous to the image ones. Moreover, \(f_{post}(\textbf{t}) = \textbf{t} + \textbf{x}, \textbf{x} \sim {\mathcal {N}}(\mu ,\sigma ^2)\) adds to the text feature descriptor t a noise vector sampled from a normal distribution. Depending on its magnitude, this allows the model to explore the immediate vicinity of the prompt in the CLIP embedding space, increasing the chance of matching points located in the proximity of true visual samples, mitigating to some extent the domain gap.

Recall that \(\textbf{t}^{h,l}_c = g_T(h^l_c)\) is a feature descriptor obtained by encoding the \(c-\)th class name with the l-th predefined textual template (\(l=1,\dots ,L\)), and \(\textbf{t}^r\) is an encoded learnable prompt to be classified in one of the C classes. Then, the probability of prompt \(\textbf{t}^r\) being classified as class y is:

$$\begin{aligned} P_{rh}(y|\mathbf {t^r}) = \frac{1}{L}\sum _{l=1}^L \frac{\exp \Bigl ( \texttt {cos}({\hat{\textbf{t}}}_{y}^{h,l}, \textbf{t}^r) / \tau \Bigr ) }{\sum _{c=1}^{C} \exp \Bigl (\texttt {cos}({\hat{\textbf{t}}}_{c}^{h,l}, \textbf{t}^r)/\tau \Bigr )}, \end{aligned}$$
(8)

where \({\hat{\textbf{t}}}_{y}^{h,l} = f_{post}\Big (g_T(f_{pre}(t_{y}^l))\Big )\). We call this variant of LASP Zero-shot LASP (LASP-Z) as no visual samples at are used for the downstream adaptation. See Fig. 3 for an overview of LASP-Z.

5 Experiments

Following (Radford et al., 2019; Zhou et al., 2022a), we mainly evaluated the accuracy of our approach on generalization to novel classes (i.e. zero-shot recognition) for 11 datasets in total. Each dataset is split into two equal partitions with disjoint classes, named base and new. We trained our model using text-image pairs from the base classes and test on both base and new classes. To further analyze the performance of our approach, we also report results for the cross-dataset transfer and domain generalization settings.

Datasets: We used 11 in total, namely: ImageNet (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004),

Oxford-Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), FGVC Aircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019) and UCF-101 (Soomro et al., 2012).

Models: For all experiments, unless otherwise specified, we used a pretrained CLIP model with a ViT-B/16 image encoder, \(M=4\) learnable prompts and 16 samples per class. The number of groups G (when used) is set to 3. In all experiments, we report the average across 3 runs.

Fig. 4
figure 4

Comparison between LASP and CoCoOp in terms of number of FLOPs. While the inference cost of LASP remains largely constant with respect to the number of classes, CoCoOp’s cost increases linearly (from around \(\approx \) 20 GFLOPs for 1 class to over 2500 GFLOPs for 1000)

Training: For LASP and LASP-V, largely, we followed the training procedure described in CoOp (Zhou et al., 2022) and CoCoOp (Zhou et al., 2022a) (i.e. same image augmentation, SGD with initial learning rate of 0.002 and a cosine annealing scheduler with 1 epoch of warm-up). In Eq. 6, \(\alpha _{VL}\) was set to 1 and \(\alpha _{T}\) to 20. The number of textual templates L was set to 34. The templates were taken from CoOp and CLIP. For LASP-Z, as no images are used during training, we increase the scheduler length to 50 epochs-equivalent and re-adjust the learning rate to 0.08. All training and testing was done on a single NVIDIA V100 GPU (except for ImageNet where 4 GPUs were used). The code was implemented using PyTorch (Paszke et al., 2017).

Methods compared: We report the performance of LASP and its improved version trained with virtual classes (LASP-V). For LASP-V, the class names only of the novel classes are used during training as virtual classes. We also study the impact of adding other types of virtual classes. The direct baseline that our method is compared with is CoOp (Zhou et al., 2022), as we add the proposed components on top of it. Note that both methods have exactly the same inference (as our method adds in addition a text-to-text loss during training). We also compare with ProDA (Lu et al., 2022) and CoCoOp (Zhou et al., 2022a) which conditions the prompts on image features and hence induces significant additional computation during inference. See also Fig. 4 for a comparison.

5.1 Comparison with State-of-the-Art

Standard setting of Zhou et al. (2022a): Table 1 compares our approach with the current state-of-the-art. We conclude:

  • Conclusion 1: In terms of harmonic mean, LASP outperforms all methods by large margin. It outperforms, on average, the second best (ProDA) by \(>2\%\). The improvement on specific datasets is even bigger (e.g. \(>3\%\) on Flowers102, \(>11\%\) on EuroSAT, \(>3\%\) on UCF101).

  • Conclusion 2: On the novel classes, LASP outperforms all methods by large margin. It is the first reported method outperforming CLIP by 0.68% (but notice that CLIP performs very poorly on the bases classes). It also outperforms ProDA (third best) by \(>2.5\%\). Again, compared to ProDA, the improvement on specific datasets is even bigger (e.g. \(>5\%\) on Flowers102, \(>3\%\) on Food101, \(>11\%\) on EuroSAT, \(>6\%\) on UCF101).

  • Conclusion 3: On new classes, LASP with virtual classes has significant impact for specific datasets. These include datasets with informative class names like EuroSAT and DTD where the improvement over LASP is \(\sim 5.5\%\) and \(\sim 4.0\%\), respectively.

Table 1 Comparison with the state-of-the-art on 11 datasets
Table 2 Performance analysis for LASP-Z
Table 3 Comparison with the state-of-the-art for the generalized zero-shot setting

Generalized zero-shot setting: The current evaluation protocol used in Zhou et al. (2022), Zhou et al. (2022a) computes accuracy, considering the base and new classes in isolation. That is, the two disjoint sets, consisting of \(C_{base}\) and \(C_{novel}\) classes (i.e., \(C=C_{base}+C_{novel}\)), are each evaluated using a \(C_{base}\)-way and a \(C_{novel}\)-way classifier, respectively. A more realistic evaluation protocol should consider the classes across both subsets, base and novel, jointly as in practice one would expect to run the same classifier across the combined sets. In this instance, a C-way classifier, that includes the class prototypes from both the base and new subsets would be used when evaluating either of them. Beyond increasing the difficulty, this setting better expose cases where overfitting to base classes occurs.

We report results using this setting in Table 3. To ground the results, as no pretrained models were available, we retrain CoCoOp using the official code released by the authors. As it can be observed, the same conclusions, previously made using the protocol proposed in Zhou et al. (2022) hold true.

Cross-Dataset Transfer setting: Following (Zhou et al., 2022a), we measure how well the soft prompts learned on ImageNet perform when evaluated on different datasets. In this setting, the training is performed on images from all 1,000 classes, using 16 images for each class. As the results from Table 5 show, our approach surpasses CoOp by 2.5% while outperforming the more computationally demanding CoCoOp (0.8% better on average).

Domain generalization setting: Following the encouraging results reported in Zhou et al. (2022), Zhou et al. (2022a) on domain generalization, herein, we attempt to evaluate whether our approach can improve the quality of the leaned prompts under domain shift too. To this end, we trained LASP on all classes from ImageNet (16-shot setting) and evaluate the learned prompts on 5 datasets with class names compatible with those of ImageNet, but different data distribution. Following (Zhou et al., 2022), we used ImageNet (Deng et al., 2009) as the source dataset, and ImageNetV2 (Recht et al., 2019), ImageNet-Sketech (Wang et al., 2019), ImageNet-A (Hendrycks et al., 2021) and ImageNet-R (Hendrycks et al., 2021) as the test datasets.

As the results from Table 6 show, with the exception of ImageNet-V2, our approach outperforms all prior work, showing strong domain generalization capabilities.

5.2 Zero-Shot Adaptation Setting

Departing from the few-shot adaptation experiments of the previous section, herein, we evaluate the zero-shot V &L learning capabilities of the proposed image-free LASP-Z on the same set of 11 datasets used for few-shot evaluation. While the base/new partitions are no longer meaningful in this case, as no images are used, we report results preserving the data split structure to facilitate comparisons across different settings (i.e. few-shot and zero-shot adaptation). The results reported in Table 2 show that our zero-shot adaptation approach improves upon CLIP by +1.14% on average across 11 datasets, outperforming it on 8/11 datasets by up to 4% (on EuroSAT). Moreover, Table 2 shows the importance of the proposed augmentations in LASP-Z. As it can be observed, without the augmentations, the accuracy of LASP-Z significantly deteriorates. Overall, we conclude:

  • Conclusion 4: Zero-shot LASP (LASP-Z) significantly outperforms CLIP for the zero-shot adaptation setting. For this purpose, the proposed language-based augmentations are necessary.

Table 4 Effect of different LASP components
Table 5 Comparison with state-of-the-art for the cross-dataset transfer setting

5.3 Ablation Studies

Effect of different LASP components: LASP proposes a number of contributions which are evaluated incrementally. The start point is the proposed Text-to-Text loss of Eq. 5. On top of this, we incrementally apply the grouped prompt representation (Eq. 7), and then the re-alignment module (Sect. 3.4). This gives rise to LASP. Finally, we add virtual classes giving rise to LASP-V. Our baseline is CoOp. From the results of Table 4, we conclude:

  • Conclusion 5: Our idea in its plain form (Text-to-Text loss) outperforms its direct baseline (CoOp) by large margin. Specifically, it improves upon CoOp by \(\sim 4.5\%\) on average, demonstrating its effectiveness.

  • Conclusion 6: All components are needed to obtain high accuracy.

Effect of size and content of the textual prompts: Herein, we study the effect of the size L and the content of the set of the textual prompts used by our method in Eq. 4. For simplicity, we report results using our Text-to-Text loss (Eq. 5), only. The hand-crafted templates are increased to 100 by including the rest of the prompts defined in CLIP (Radford et al., 2021), while their number is reduced to 1 by using the following template only: a photo of {}. Random templates are produced by sampling grammatically plausible random sentences that contain incoherent words, with length between 5 and 20 words. The class names are inserted at the end of these random templates. All variations use the same training scheduler and hyperparameters, except for the case of random templates, where \(\alpha _{TT}=5\).

Table 10 shows our results. We importantly note that the accuracy on the base classes remains similar across all settings (not shown in the table). Moreover, we conclude:

  • Conclusion 7: The exact choice of the templates might not be so significant for the few-shot setting.

  • Conclusion 8: For the case of novel classes, both the number and the content of the templates are important to obtain high accuracy.

Effect of type of loss: In Table 7, we vary the choice of loss in LASP, i.e. we replace the Cross-Entropy (CE) with an \(L_2\) and \(L_1\) loss. Again, for simplicity, we report results using our Text-to-Text loss (Eq. 5), only.

  • Conclusion 9: The proposed CE loss based formulation outperforms other losses for LASP.

Table 6 Comparison with state-of-the-art for the domain generalization setting

Effect of out-domain distractors: Motivated by the recent work of Ren et al. (2022) suggesting that CLIP’s performance drops as the number of classes used for testing increases, we introduce a new evaluation setting: Firstly, we select 4 test datasets with clear disjoint domains: EuroSAT (10 satellite terrain types), Food101 (101 food names), Flowers102 (102 flower names) and OxfordPets (37 dog and cat breed names). At test time, we define the classifier across the union of classes across all 4 datasets (250 classes in total). Note that LASP-V  is the only method that benefits from knowledge of this expanded vocabulary during training. From Table 8, we can conclude:

  • Conclusion 10: The models are somewhat robust to out-of-domain distractors. Specifically, the drop in accuracy is moderate (typically 1-2%). The exception is EuroSAT where the number of classes increases \(25\times \). Importantly, LASP-V manages to largely recover the lost accuracy.

Effect of in-domain distractors: Expanding on the idea from the previous section, herein, we propose to test the performance of the current soft prompting methods with in-domain distractors. Unlike the case of out-of-domain distractors, the in-domain distractors are selected such that they are closely related to the current dataset/classes, being part of the same super-category. We performed experiments on two datasets: Food101 and Flowers102. For Flowers102, we added 65 new class names while, for Food101, 53 new classes. Note again that, with the exception of LASP-V, the classes are only used at test time as distractors expanding the C-way classifier by 65 and 53, respectively. From the results of Table 9, we conclude:

  • Conclusion 11: In-domain distractors significantly increase the problem difficulty. Specifically, the drop in accuracy is large (4-7%). LASP-V manages to recover part of the lost accuracy.

Table 7 Effect of type of loss
Table 8 Effect of out-domain distractors
Table 9 Effect of in-domain distractors
Table 10 Effect of dictionary size and content on new classes
Table 11 Impact of noise value \(\tau \) on the overall performance of LASP-Z
Table 12 Impact of number of augmentations on the overall performance of LASP-Z
Table 13 Effect of text-based positional augmentations

Effect of text-based augmentations: As detailed in Sect. 3.2, one way to view the proposed text-to-text component is as a direct extension of image-style augmentations to the language domain. To explore this, we construct a variation of the Oxford Pets dataset in which all test images are rotated by 90 or 180\(^o\). We select Oxford Pets as the rotated pets images are far from the natural distribution of photos. During the training of the LASP model the images are kept under their original rotation (i.e. none) while the textual prompts are augmented with extra keywords such as: “rotate”, “upside down”, “angled” etc. Based on the results from Table 13, we can conclude:

  • Conclusion 12: Text-based augmentations are a viable solution for increased robustness.

Effect of noise level and transformation on LASP-Z. To alleviate the issue that the text features are not a perfect proxy for the vision domain, we explore the points located in their vicinity, hence, increasing the likelihood of overlapping with the data distribution from the vision domain. This is achieved, in practice, by adding Gaussian noise in the output space or by adjusting the prompts in the input space. In Table 11, we analyse the impact of the noise magnitude s (i.e, \(\textbf{x} \sim s{\mathcal {N}}(\mu ,\sigma ^2)\)) on the performance of the model. While the model is overall resilient to the exact value of s, removing it completely leads to performance inferior to that of CLIP. We conclude that adding noise does not only help bridge the domain gap, but also alleviate overfitting.

Similar results can be observed for varying the number of augmentations. Here, we note that a higher number leads to better results as intuitively they allow for the exploration of more points around the class centroid.

Can LASP-Z be used as initialisation for LASP? LASP-Z tries to fully leverage the joint vision-language embedding space learned by CLIP, moving the optimization process fully in the text domain. While the text alone is a good proxy for representing visual data, due to the domain gap that naturally occurs as part of the contrastive training, it is not a full substitute for the visual data. Due to the above, when initialising LASP/LASP-V from LASP-Z weights, we observed no additional gains as the visual samples provided include the queues provided by the text training.

6 Conclusions

In this paper, we introduced LASP - a language aware soft prompting method for V &L adaptation that is shown to outperform prior work by large margin. Specifically, we made the following contributions: Firstly, we introduced a novel text-to-text loss that largely alleviates the problem of base-class overfitting. Secondly, we proposed a grouped language-aware prompting for learning more specialized and stronger prompt representations. Thirdly, we identified a visual-language misalignment within LASP and propose a re-calibration mechanism to address it. Fourthly, we showed that our approach, unlike prior work, is amenable to, including during training, virtual classes, i.e. class names for which no visual samples are available, significantly increasing the robustness of the learned prompts. Fifthly, we presented a zero-shot variant of LASP (LASP-Z) where no visual samples at all are available for the downstream task and showed its superiority over CLIP. We hope that LASP/LASP-V/LASP-Z will serve as a strong baseline for future works in the area of few-shot adaptation for V &L models.