1 Introduction

Vision Transformers (VTs) are increasingly emerging as the dominant Computer Vision architecture, alternative to standard Convolutional Neural Networks (CNNs). VTs are inspired by the Transformer network (Vaswani et al., 2017), which is the de facto standard in Natural Language Processing (NLP) (Kenton & Toutanova, 2019; Radford & Narasimhan, 2018) and it is based on multi-head attention layers transforming the input tokens (e.g., language words) into a set of final embedding tokens. Dosovitskiy et al. (2021) proposed an analogous processing paradigm, called ViT.Footnote 1, where word tokens are replaced by image patches, and self-attention layers are used to model global pairwise dependencies over all the input tokens. As a consequence, differently from CNNs, where the convolutional kernels have a spatially limited receptive field, ViT has a dynamic receptive field, which is given by its attention maps (Naseer et al., 2021). However, ViT heavily relies on huge training datasets (e.g., JFT-300 M (Dosovitskiy et al., 2021), a proprietary dataset of 303 million images), and underperforms CNNs when trained on ImageNet-1K (\(\sim\) 1.3 million images) (Russakovsky et al., 2015) or using smaller datasets (Dosovitskiy et al., 2021; Raghu et al., 2021). The main reason why Transformers are more data hungry than CNNs is that they lack some inductive biases embedded into the CNN architecture, such as translation equivariance and locality, obtained using local filters, parameter sharing and spatial pooling. Hence, Transformers do not generalize well when trained on insufficient amounts of data (Dosovitskiy et al., 2021). To alleviate this problem and mitigate the need for a huge quantity of training data, a recent line of research is exploring the possibility of reintroducing typical CNN mechanisms in VTs (Yuan et al., 2021; Liu et al., 2021; Wu et al., 2021; Yuan et al., 2021; Xu et al., 2021; Li et al., 2021; Hudson & Zitnick, 2021; Hassani et al., 2023). The main idea behind these “hybrid” VTs is that convolutional layers, mixed with the VT self-attention layers, help to embed a local inductive bias in the VT architecture, i.e., to encourage the network to focus on local properties of the image domain. However, the disadvantage of this paradigm is that it requires drastic architectural changes to the original ViT, where the latter has now become a de facto standard in different vision and vision-language tasks (Sect. 2). Moreover, as emphasised by Li et al. (2022), one of the main advantages of CNNs is the large independence of the pre-training from the downstream tasks, which allows to use a uniform backbone, pre-trained only once for different vision tasks (e.g., classification, object detection, etc.). Conversely, the adoption of specific VT architectures breaks this independence and makes it difficult to use the same pre-trained backbone for different downstream tasks (Li et al., 2022).

In this paper, we follow an orthogonal (and relatively simpler) direction: rather than changing the ViT architecture, we propose to include a local inductive bias using an additional pretext task during training, which implicitly “teaches” the network the connectedness property of objects in an image. Specifically, we maximize the probability of producing attention maps whose highest values are clustered in few local regions (of variable size), based on the idea that, most of the time, an object is represented by one or very few spatially connected regions in the input image. This pretext task exploits a locality principle, characteristic of the natural images, and extracts additional (self-supervised) information from images without the need of architectural changes.

Fig. 1
figure 1

ViT attention maps obtained using the [CLS] token query and thresholded to keep \(60\%\) of the mass following (Caron et al., 2021). a Original image. b Standard supervised learning. c DINO. d Training using SAR. In (a), the two largest connected components correspond to the main object (a bear, occluded by a tree), while the third and the fourth largest connected components correspond to the specular reflection of the bear into the river

Our work is inspired by the findings presented in (Caron et al., 2021; Bao et al., 2022; Naseer et al., 2021), in which the authors show that VTs, trained using self-supervision (Caron et al., 2021; Bao et al., 2022) or shape-distillation (Naseer et al., 2021), can spontaneously develop attention maps with a semantic segmentation structure. For instance, Caron et al. (2021) show that the last-layer attention maps of ViT, when this is trained with their self-supervised DINO algorithm, can be thresholded and used to segment the most important foreground objects of the input image without any pixel-level annotation during training (see Fig. 1c). Similar findings are shown in (Bao et al., 2022; Naseer et al., 2021). Interestingly, however, Caron et al. (2021) show that the same ViT architectures, when trained with supervised methods, produce much more spatially disordered attention maps (Fig. 1b). This is confirmed by Naseer et al. (2021), who observed that the attention maps of ViT, trained with supervised protocols, have a widely spread structure over the whole image. The reason why “blob”-like attention maps spontaneously emerge when VTs are trained with some algorithms but not with others, is still unclear. However, in this paper we build on top of these findings and we propose a spatial entropy loss function which explicitly encourages the emergence of locally structured attention maps (Fig. 1d), independently of the main algorithm used for training. Note that our goal is not to use the so obtained attention maps to extract segmentation structure from images or for other post-processing steps. Instead, we use the proposed spatial entropy loss to introduce an object-based local inductive bias in VTs. Since real life objects usually correspond to one or very few connected image regions, then also the corresponding attention maps of a VT head should focus most of their highest values on spatially clustered regions. The possible discrepancy between this inductive bias (an object in a given image is a mostly connected structure) and the actual spatial entropy measured in each VT head, provides a self-supervised signal which alleviates the need for huge supervised training datasets, without changing the ViT architecture.

The second contribution of this paper is based on the empirical results presented by Raghu et al. (2021), who showed that VTs are more influenced by the skip connections than CNNs and, specifically, that in the last blocks of ViT, the patch tokens (see Sect. 3) representations are mostly influenced by the skip connection path. This means that, in the last blocks of ViT, the self-attention layers have a relatively small influence on the final token embeddings. Since our spatial entropy is measured on the last-block attention maps, we propose to remove the skip connections in the last layer (only). We empirically show that this minor architectural change is beneficial for ViT, both when used jointly with our spatial entropy loss, and when used with a standard training procedure.

Our regularization method, which we call SAR (Spatial Attention-based Regularization), can be easily plugged into existing VTs (including hybrid VTs) without drastic architectural changes of their architecture. SAR can be applied to different scenarios, jointly with a main-task loss function. For instance, when used in a supervised classification task, the main loss is the (standard) cross entropy, used jointly with our spatial entropy loss. We empirically show that SAR is sample-efficient: it can significantly boost the accuracy of ViT in different downstream tasks such as classification, object detection or segmentation, and it is particularly useful when training with small-medium datasets.

Our experiments show also that SAR is beneficial when plugged on hybrid VT architectures, especially when the training data are scarce.

In summary, our main contributions are the following: (1) We propose to embed a local inductive bias in ViT using spatial entropy as an alternative to re-introducing convolutional mechanisms in the VT architecture. (2) We propose to remove the last-block skip connections, empirically showing that this is beneficial for the patch token representations. (3) Using extensive experiments, we show that SAR improves the accuracy of different VT architectures, and it is particularly helpful when supervised training data are scarce.

2 Related work

Vision Transformers One of the very first fully-Transformer architectures for Computer Vision is iGPT (Chen et al., 2020), in which each image pixel is represented as a token. However, due to the quadratic computational complexity of Transformer networks (Vaswani et al., 2017), iGPT can only operate with very small-resolution images. This problem has been largely alleviated by ViT (Dosovitskiy et al., 2021), where the input tokens are \(p \times p\) image patches (Sect. 3). The success of ViT has inspired several similar Vision Transformer (VT) architectures in different application domains, such as image classification (Dosovitskiy et al., 2021; Touvron et al., 2021; Yuan et al., 2021; Liu et al., 2021; Wu et al., 2021; Yuan et al., 2021; Li et al., 2021; Xu et al., 2021; d’Ascoli et al., 2021), object detection (Li et al., 2022; Carion et al., 2020; Zhu et al., 2021; Dai et al., 2021), segmentation (Strudel et al., 2021; Rao et al., 2022), human pose estimation (Zheng et al., 2021), object tracking (Meinhardt et al., 2022), video processing (Neimark et al., 2021; Li et al., 2022), image generation (Jiang et al., 2021; Hudson & Zitnick, 2021; Ramesh et al., 2022; Chang et al., 2022), point cloud processing (Guo et al., 2021; Zhao et al., 2021), vision-language foundation models (Bao et al., 2022; Alayrac et al., 2022; Wang et al., 2022) and many others. However, the lack of the typical CNN local inductive biases makes VTs to need more data for training (Dosovitskiy et al., 2021; Raghu et al., 2021). For this reason, many recent works are addressing this problem by proposing hybrid architectures, which reintroduce typical convolutional mechanisms into the VT design (Yuan et al., 2021; Liu et al., 2021; Wu et al., 2021; Yuan et al., 2021; Xu et al., 2021; Li et al., 2021; d’Ascoli et al., 2021; Hudson & Zitnick, 2021; Li et al., 2022; Hassani et al., 2023). In contrast, we propose a different and simpler solution, in which, rather than changing the VT architecture, we introduce an object-based local inductive bias (Sect. 1) by means of a pretext task based on the spatial entropy minimization.

Note that our proposal is different from object-centric learning (Locatello et al., 2020; Goyal et al., 2020; Didolkar et al., 2021; Engelcke et al., 2021; Sajjadi et al., 2022; Herzig et al., 2022; Kang et al., 2022), where the goal is to use discrete objects (usually obtained using a pre-trained object detector or a segmentation approach) for object-based reasoning and modular/causal inference. In fact, although the attention maps produced by SAR can potentially be thresholded and used as discrete objects, our goal is not to segment the image patches or to use the patch clusters for further processing steps, but to exploit the clustering process as an additional self-supervised loss function which helps to reduce the need of labeled training samples (Sect. 1).

Self-supervised learning Most of the self-supervised approaches with still images and ResNet backbones (He et al., 2016) impose a semantic consistency between different views of the same image, where the views are obtained with data-augmentation techniques. This work can be roughly grouped in contrastive learning (van den Oord et al., 2018; Hjelm et al., 2019; Chen et al., 2020; He et al., 2020; Tian et al., 2020; Wang & Isola, 2020; Dwibedi et al., 2021), clustering methods (Bautista et al., 2016; Zhuang et al., 2019; Ji et al., 2019; Caron et al., 2018; Asano et al., 2020; Gansbeke et al., 2020; Caron et al., 2020, 2021), asymmetric networks (Grill et al., 2020; Chen & He, 2021) and feature-decorrelation methods (Ermolov et al., 2021; Zbontar et al., 2021; Bardes et al., 2022; Hua et al., 2021).

Recently, different works use VTs for self-supervised learning. For instance, Chen et al. (2021) have empirically tested different representatives of the above categories using VTs, and they also proposed MoCo-v3, a contrastive approach based on MoCo (He et al., 2020) but without the queue of the past-samples. DINO (Caron et al., 2021) is an on-line clustering method which is one of the current state-of-the-art self-supervised approaches using VTs. BEiT (Bao et al., 2022) adopts the typical “masked-word” NLP pretext task (Kenton & Toutanova, 2019), but it needs to pre-extract a vocabulary of visual words using the discrete VAE pre-trained in (Ramesh et al., 2021). Other recent works which use a “masked-patch” pretext task are: (He et al., 2022; Xie et al., 2022; Wei et al., 2022; Dong et al., 2023; Hua et al., 2023; Chen et al., 2024; Bachmann et al., 2022; El-Nouby et al., 2021; Zhou et al., 2022; Kakogeorgiou et al., 2022).

Yun (2022) use the assumption that adjacent patches usually belong to the same object in order to collect positive patches for a contrastive learning approach. Our inductive bias shares a similar intuitive idea but, rather than a contrastive method where positive pairs are compared with negative ones, our self-supervised loss is based on the proposed spatial entropy which groups together patches even when they are not pairwise adjacent. Generally speaking, in this paper, we do not propose a fully self-supervised algorithm, but we rather use self-supervision (we extract information from samples without additional manual annotation) to speed-up the convergence in a supervised scenario and decrease the quantity of annotated information needed for training. In the Appendix, we also show that SAR can be plugged on top of both MoCo-v3 and DINO, boosting the accuracy of both of them.

Similarly to this paper, Liu et al. (2021) propose a VT regularization approach based on predicting the geometric distance between patch tokens. In contrast, we use the highest-value connected regions in the VT attention maps to extract additional unsupervised information from images and the two regularization methods can potentially be used jointly. Li et al. (2020) compute the gradients of a ResNet with respect to the image pixels to get an attention (saliency) map. This map is thresholded and used to mask-out the most salient pixels. Minimizing the classification loss on this masked image encourages the attention on the non-masked image to include most of the useful information. Our approach is radically different and much simpler, because we do not need to manually set the thresholding value and we require only one forward and one backward pass per image.

Spatial entropy There are many definitions of spatial entropy (Razlighi & Kehtarnavaz, 2009; Altieri et al., 2018). For instance, Batty (1974) normalizes the probability of an event occurring in a given zone by the area of that zone, this way accounting for unequal space partitions. In (Shah et al., 2020), the Shannon entropy, defined over an histogram of a grayscale image, is extended to include different hyperspectral bands. Ceci et al. (2019) use a Parzen-Window Density Estimation to include spatial information in their spatial entropy loss which accounts for the vicinity of different spatially-located sensors. Li et al. (2020) propose an entropy measure to evaluate the spatiotemporal regularity on tensor data which is based on the conditional probability of the similarities between tensor sub-blocks. In (Tupin et al., 2000), spatial entropy is defined over a Markov Random Field describing the image content, but its computation is very expensive (Razlighi & Kehtarnavaz, 2009). In contrast, our spatial entropy loss can be efficiently computed and it is differentiable, thus it can be easily used as an auxiliary regularization task in existing VTs.

3 Background

Given an input image I, ViT (Dosovitskiy et al., 2021) splits I in a grid of \(K \times K\) non-overlapping patches, and each patch is linearly projected into a (learned) input embedding space. The input of ViT is this set of \(n = K^2\) patch tokens, jointly with a special token, called [CLS] token, which is used to represent the whole image. Following a standard Transformer network (Vaswani et al., 2017), ViT transforms these \(n + 1\) tokens in corresponding final \(n + 1\) token embeddings using a sequence of L Transformer blocks. Each block is composed of LayerNorm (LN), Multiheaded Self Attention (MSA) and MLP layers, plus skip connections. Specifically, if the token embedding sequence in the \((l-1)\)-th layer is \(\pmb {z}^{l-1} = [ \pmb {z}_{CLS}; \pmb {z}_1;... \pmb {z}_n ]\), then:

$$\begin{aligned} \pmb {z}' = \text{ MSA }(\text{ LN }(\pmb {z}^{l-1})) + \pmb {z}^{l-1} ,{} & {} l=1,\ldots , L \end{aligned}$$
(1)
$$\begin{aligned} \pmb {z}^l = \text{ MLP }(\text{ LN }(\pmb {z}')) + \pmb {z}' ,{} & {} l=1,\ldots , L \end{aligned}$$
(2)

where the addition (\(+\)) denotes a skip (or “identity”) connection, which is used both in the MSA (Eq. 1) and in the MLP (Eq. 2) layer. The MSA layer is composed of H different heads, and, in the h-th head (\(1 \le h \le H\)), each token embedding \(\pmb {z}_i \in \mathbb {R}^d\) is projected into a query (\(\pmb {q}_i^h\)), a key (\(\pmb {k}_i^h\)) and a value (\(\pmb {v}_i^h\)). Given query (\(Q^h\)), key (\(K^h\)) and value (\(V^h\)) matrices containing the corresponding elements, the h-th self-attention matrix (\(A^h\)) is given by:

$$\begin{aligned} A^h = \text{ softmax } \left( \frac{Q^h (K^h)^T}{\sqrt{d}} \right) . \end{aligned}$$
(3)

Using \(A^h\), each head outputs a weighted sum of the values in \(V^h\). The final MSA layer output is obtained by concatenating all the head outputs and then projecting each token embedding into a d-dimensional space. Finally, the last-layer (L) class token embedding \(\pmb {z}_{CLS}^{L}\) is fed to an MLP head, which computes a posterior distribution over the set of the target classes and the whole network is trained using a standard cross-entropy loss (\(\mathcal{L}_{ce}\)). Some hybrid VTs (see Sect. 2) such as CvT (Wu et al., 2021) and PVT (Wang et al., 2021), progressively subsample the number of patch tokens, leading to a final \(k \times k\) patch token grid (\(k \le K\)). In the rest of this paper, we generally refer to a spatially arranged grid of final patch token embeddings with a \(k \times k\) resolution.

4 Method

Generally speaking, an object usually corresponds to one or very few connected regions of a given image. For instance, the bear in Fig. 1, despite being occluded by a tree, occupies only 2 distinct connected regions of the image. Our goal is to exploit this natural image inductive bias and penalize those attention maps which do not lead to a spatial clustering of their largest values. Intuitively, if we compare Fig. 1b with Fig. 1c, we observe that, in the latter case (in which DINO was used for training), the attention maps are more “spatially ordered”, i.e. there are less and bigger “blobs” (obtained after thresholding the map values (Caron et al., 2021)). Since an image is usually composed of a few main objects, each of which typically corresponds to one or very few connected regions of tokens, during training we penalize those attention maps which produce a large number of small blobs. We use this as an auxiliary pretext task which extracts information from images without additional annotation, by exploiting the assumption that spatially close tokens should preferably belong to the same cluster.

4.1 Spatial entropy loss

For each head of the last Transformer block, we compute a similarity map \(S^h\) (\(1 \le h \le H\), see Sect. 3) by comparing the [CLS] token query (\(\pmb {q}_{CLS}^h\)) with all the patch token keys (\(\pmb {k}_{x,y}^h\), where \((x,y) \in \{1,...,k\}^2\)):

$$\begin{aligned} S_{x,y}^h = <\pmb {q}_{CLS}^h, \pmb {k}_{x,y}^h>/\sqrt{d}, \quad (x,y) \in \{1,...,k\}^2, \end{aligned}$$
(4)

where \(<\pmb {a}, \pmb {b}>\) is the dot product between \(\pmb {a}\) and \(\pmb {b}\). \(S^h\) is extracted from the self-attention map \(A^h\) by selecting the [CLS] token as the only query and before applying the \(\text{ softmax }\) (see Sect. 4.3 for a discussion about this choice). \(S^h\) is a \(k \times k\) matrix corresponding to the final \(k \times k\) spatial grid of patches (Sect. 3), and (xy) corresponds to the “coordinates” of a patch token in this grid.

Fig. 2
figure 2

A schematic illustration of the spatial entropy. a The original image. b The thresholded similarity map \(B^h\) (zero values shown in black) for a specific head. c The 8-connectivity relation used to group non-zero elements in \(B^h\). (d) The resulting two connected components (\(C_1\) and \(C_2\)), each component shown with a different colour. In this case, \(C^h = \{C_1, C_2 \}\)

In order to extract a set of connected regions containing the largest values in \(S^h\), we zero-out those elements of \(S^h\) which are smaller than the mean value \(m = 1/n \sum _{(x,y) \in \{1,...,k\}^2} S^h_{x,y}\):

$$\begin{aligned} B_{x,y}^h = \text{ ReLU }(S_{x,y}^h - m), \quad (x,y) \in \{1,...,k\}^2, \end{aligned}$$
(5)

where thresholding using m corresponds to retain half of the total “mass” of Eq. 4. We can now use a standard algorithm (Grana et al., 2010) to extract the connected componentsFootnote 2 from \(B^h\), obtained using an 8-connectivity relation between non-zero elements in \(B^h\) (see Fig. 2):

$$\begin{aligned} C^h = \{C_1, ..., C_{h_r} \} = \text{ ConnectedComponents }(B^h). \end{aligned}$$
(6)

\(C_j\) (\(1 \le j \le h_r\)) in \(C^h\) is the set of coordinates (\(C_j = \{ (x_1, y_1),..., (x_{n_j}, y_{n_j})\}\)) of the j-th connected component, whose cardinality (\(n_j\)) is variable, and such is the total number of components (\(h_r\)). Given \(C^h\), we define the spatial entropy of \(S^h\) (\(\mathcal {H}(S^h)\)) as follows:

$$\begin{aligned} \mathcal {H}(S^h) = - \sum _{j=1}^{h_r} P^h (C_j) \log P^h (C_j), \end{aligned}$$
(7)
$$\begin{aligned} P^h (C_j) = \frac{1}{|B^h |} \sum _{(x,y) \in C_j} B_{x,y}^h, \end{aligned}$$
(8)

where \(|B^h |= \sum _{(x,y) \in \{1,...,k\}^2} B_{x,y}^h\). Importantly, in Eq. 8, the probability (\(P^h (C_j)\)) of each region \(C_j\) is computed using all its elements, and this makes the difference with respect to a non-spatial entropy which is directly computed over all the individual elements in \(S^h\), without considering the connectivity relation. Note that the less the number of components \(h_r\) or the less uniformly distributed the probability values \(P^h (C_1),... P^h (C_{h_r})\), the lower \(\mathcal {H}(S^h)\). Using Eq. 7, the spatial entropy loss is defined as:

$$\begin{aligned} \mathcal {L}_{se} = \frac{1}{H} \sum _{h=1}^H \mathcal {H}(S^h) . \end{aligned}$$
(9)

\(\mathcal {L}_{se}\) is used jointly with the main task loss. For instance, in case of supervised training, we use: \(\mathcal {L}_{tot} = \mathcal {L}_{ce} + \lambda \mathcal {L}_{se}\), where \(\lambda\) is a weight used to balance the two losses.

4.2 Removing the skip connections

Raghu et al. (2021) empirically showed that, in the last blocks of ViT, the patch token representations are mostly propagated from the previous layers using the skip connections (Sect. 1). We presume this is (partially) due to the fact that only the [CLS] token is used as input to the classification MLP head (Sect. 3), thus, during training, the last-block patch token embeddings are usually neglected. Moreover, Raghu et al. (2021) show that the effective receptive fieldFootnote 3 (Luo et al., 2016) of each block, when computed after the MSA skip connections, is much smaller than the effective receptive field computed before the MSA skip connections. Both empirical observations lead to the conclusion that the MSA skip connections in the last blocks may be detrimental for the representation capacity of the final patch token embeddings. This problem is emphasized when using our spatial entropy loss, since it is computed using the attention maps of the last-block MSA (Sect. 4.1). For these reasons, we propose to remove the MSA skip connections in the last block (L). Specifically, in the L-th block, we replace Eq. 1-2 with:

$$\begin{aligned} \pmb {z}' = \text{ MSA }(\text{ LN }(\pmb {z}^{L-1})) , \end{aligned}$$
(10)
$$\begin{aligned} \pmb {z}^L = \text{ MLP }(\pmb {z}') + \pmb {z}' . \end{aligned}$$
(11)

Note that, in addition to removing the MSA skip connections (Eq. 10), we also remove the subsequent LN (Eq. 11), because we empirically observed that this further improves the VT accuracy (see Sect. 5.1).

4.3 Discussion

In this section, we discuss and motivate the choices made in Sect. 4.1 and Sect. 4.2. First, we use \(S^h\), extracted before the \(\text{ softmax }\) (Eq. 3) because, using the \(\text{ softmax }\), the network can “cheat”, by increasing the norm of the vectors \(\pmb {q}_{CLS}\) and \(\pmb {k}_{x,y}\) (\((x,y) \in \{1,...,k\}^2\)). As a result, the dot product \(<\pmb {q}_{CLS}, \pmb {k}_{x,y}>\) also largely increases, and the \(\text{ softmax }\) operation (based on the exponential function) enormously exaggerates the difference between the elements in \(S^h\), generating a very peaked distribution, which zeros-out non-maxima (xy) elements. We observed that, when using the \(\text{ softmax }\), the VT is able to minimize Eq. 9 by producing single-peak similarity maps which have a 0 entropy, each being composed of only one connected component with only one single token (i.e., \(h_r = 1\) and \(n_j = 1\)).

Second, the spatial entropy (Eq. 7) is computed for each head separately and then averaged (Eq. 9) to allow each head to focus on different image regions. Note that, although computing the connected components (Eq. 6) is a non-differentiable operation, \(C^h\) is only used to “pool” the values of \(B^h\) (Eq. 8), and each \(C_j\) can be implemented as a binary mask (more details in the Appendix, where we also compare \(\mathcal {L}_{se}\) with other solutions). It is also important to note that, although a smaller number of connected components (\(h_r\)) can decrease \(\mathcal {L}_{se}\), this does not force the VT to always produce a single connected component (i.e., \(h_r = 1\)) because of the contribution of the main task loss (e.g., \(\mathcal {L}_{ce}\)). For instance, Fig. 1d shows 4 big connected components which correctly correspond to the non-occluded parts of the bear and their reflections into the river, respectively.

Finally, we remove the MSA skip connections only in the last block (Eqs. 10, 11) because, according to the results reported in (Raghu et al., 2021), removing the skip connections in the ViT intermediate blocks brings to an accuracy drop. In contrast, in Sect. 5.1 we show that our strategy, which keeps the ViT architecture unchanged apart from the last block, is beneficial even when used without our spatial entropy loss. Similarly, in preliminary experiments in which we used the spatial entropy loss also in other intermediate layers (\(l < L\)), we did not observe any significant improvement. In the rest of this paper, we refer to our full method SAR as composed of the spatial entropy loss (Sect. 4.1) and the last-block MSA skip connection and LN removal (Sect. 4.2).

5 Experiments

In Sect. 5.1 we analyse the contribution of the spatial entropy loss and the skip connection removal. In Sect. 5.2 we show that SAR improves ViT in different training–testing scenarios and with different downstream tasks. In Sect. 5.3 we analyse the properties of the attention maps generated using SAR. In the Appendix, we provide additional experiments using multi-label classification and other tasks, and we show how SAR can be used jointly with fully self-supervised learning approaches. We train the models using a maximum of 8 NVIDIA V100 32GB GPUs for the most computationally intensive experiments. For other experiments (e.g. transfer learning), we scale down to lower resources. In the Appendix we report a detailed list of the computational hardware utilized in each experimental setting.

5.1 Ablation study

In this section, we analyse the influence of the \(\lambda\) value (Sect. 4.1), the removal of the skip connections and the LN in the last ViT block (Sect. 4.2), and the use of the spatial entropy loss (Sect. 4.1). In all the ablation experiments, we use ImageNet-100 (IN-100) (Tian et al., 2020; Wang & Isola, 2020), which is a subset of 100 classes of ImageNet, and ViT-S/16, a 22 million parameter ViT (Dosovitskiy et al., 2021) trained with \(224 \times 224\) resolution images and \(14 \times 14\) patches tokens (\(k = 14\)) with a patch resolution of \(16 \times 16\) (Touvron et al., 2021). Moreover, in all the experiments in this section, we adopt the training protocol and the data-augmentations described in (Liu et al., 2021). Note that these data-augmentations include, among other things, the use of Mixup (Zhang et al., 2018) and CutMix (Yun et al., 2019) (which are also used in all the supervised classification experiments of Sect. 5.2), and this shows that our entropy loss can be used jointly with “image-mixing” techniques.

In Table 1a, we train from scratch all the models using 100 epochs and we show the impact on the test set accuracy using different values of \(\lambda\). In the experiments of this table, we use our loss function (\(\mathcal {L}_{tot} = \mathcal {L}_{ce} + \lambda \mathcal {L}_{se}\)) and we remove both the skip connections and the LN in the last block (Eqs. 10, 11), thus the column \(\lambda = 0\) corresponds to the result reported in Table 1c, Row “C" (see below). In the rest of the paper, we use the results obtained with this setting (IN-100, 100 epochs, etc.) and the best \(\lambda\) value (\(\lambda = 0.01\)) for all the other datasets, training scenarios (e.g., training from scratch, fine-tuning, fully self-supervised learning, etc.) and VT architectures (e.g., ViT, CvT, PVT, etc.). In fact, although a higher accuracy can very likely be obtained by tuning \(\lambda\), our goal is to show that SAR is an easy-to-use regularization approach, even without tuning its only hyperparameter.

Table 1 IN-100. (a) Influence of the spatial entropy loss weight \(\lambda\) (100 training epochs). (b) Influence of the number of epochs. (c) Analysis of the different components of SAR (100 epochs)

In Table 1 (c), we train from scratch all the models using 100 epochs and Row “A” corresponds to our run of the original ViT-S/16 (Eq. 1-2). When we remove the MSA skip connections (Row “B”), we observe a \(+0.42\) points improvement, which becomes \(+1.5\) if we also remove the LN (Row “C”). This experiment confirms that the last block patch tokens can learn more useful representations if we inhibit the MSA identity path (Eq. 10-11). However, if we also remove the skip connections in the subsequent MLP layer (Row “D”), the results are inferior to the baseline. Finally, when we use the spatial entropy loss with the original architecture (Row “E”), the improvement is marginal, but using \(\mathcal {L}_{se}\) jointly with Eq. 10-11 (full model, Row “F”), the accuracy boost with respect to the baseline is much stronger. Table 1 (b) compares training with 100 and 300 epochs and shows that, in the latter case, SAR can reach a much higher relative improvement with respect to the baseline (+ 4.42).

5.2 Main results

Sample efficiency In order to show that SAR can alleviate the need of large labeled datasets (Sect. 1), we follow a recent trend of works (Liu et al., 2021; El-Nouby et al., 2021; Cao & Wu, 2021) where VTs are trained from scratch on small-medium datasets (without pre-training on ImageNet). Specifically, we strictly follow the training protocol proposed by El-Nouby et al. (2021), where 5,000 epochs are used to train ViT-S/16 directly on each target dataset. The results are shown in Table 2, which also provides the number of training and testing samples of each dataset, jointly with the accuracy values of the baseline (ViT-S/16, trained in a standard way, without SAR), both reported from El-Nouby et al. (2021). Table 2 shows that SAR can drastically improve the ViT-S/16 accuracy on these small-medium datasets, with an improvement ranging from + 18.17 to + 30.78 points. These results, jointly with the results obtained on IN-100 (Table 1 (b)), show that SAR is particularly effective in boosting the performance of ViT when labeled training data are scarce.

We further analyze the impact of the amount of training data using different subsets of IN-100 with different sampling ratios (ranging from 25 to 75%, with images randomly selected). We use the same training protocol of Table 1 (b) (e.g., 100 training epochs, etc.) and we test on the whole IN-100 validation set. Table 3 shows the results, confirming that, with less data, the accuracy boost obtained using SAR can significantly increase (e.g., with 75% of the data we have a 10.5 points improvement). In the same table, we compare SAR with Dense Relative Localization (DRLoc) loss (Liu et al., 2021), which, similarly to SAR, is based on an auxiliary self-supervised task used to regularize VTs training (Sect. 2). DRLoc encourages the VT to learn spatial relations within an image by predicting the relative distance between the (xy) positions of randomly sampled output embeddings from the \(k \times k\) grid of the last layer L. Table 3 shows that SAR largely outperforms DRLoc, especially in a low-data regime (e.g., with 75% of the data, the difference between SAR and DRLoc is 6.9 points).

Table 2 Training from scratch on small-medium datasets. The baseline (ViT-S/16) results are reported from (El-Nouby et al., 2021)
Table 3 IN-100 experiments with different sampling ratios (100 epochs).  These results were obtained by us using the publicly available code taken from (Liu et al., 2021)

Training on ImageNet-1K We extend the previous results training ViT on ImageNet-1K (IN-1K), and comparing SAR with the baseline (ViT-S/16, trained in a standard way, without SAR), and with DRLoc. Table 4 shows that SAR can boost the accuracy of ViT of almost 1 point without any additional learnable parameters or drastic architectural changes, and this gain is higher than DRLoc. The reason why the relative improvement is smaller with respect to what was obtained with smaller datasets is likely due to the fact that, usually, regularization techniques are mostly effective with small(er) datasets (Balestriero et al., 2022). Nevertheless, Fig. 3 shows that SAR can be used jointly with large datasets to significantly speed-up training. For instance, ViT-S/16 + SAR, with 100 epochs, achieves almost the same accuracy as the baseline trained with 150 epochs, while we surpass the final baseline accuracy (79.8% at epoch 300) with only 250 training epochs (79.9% at epoch 250). From a computational point of view, reducing the number of epochs needed for convergence by one sixth on a large dataset may be a significant acceleration, also considering that, on average, the overall computational overhead of SAR (with non-optimized code) is only + 2.9% (further details on Sect. A). Finally, the two regularization approaches (SAR and DRLoc) can potentially be combined, but we leave this for future work.

Table 4 IN-1K experiments (300 training epochs)
Fig. 3
figure 3

IN-1K, validation set accuracy with respect to the number of training epochs

Table 5 Object detection on PASCAL VOC 2007, evaluated using mean Average Precision (mAP)
Table 6 Semantic segmentation on PASCAL VOC 2007, evaluated using mean Intersection over Union (mIoU)
Table 7 Transfer learning results (100 epochs fine-tuning). The first row corresponds to a standard fine-tuning protocol, while the other configurations include SAR either in the pre-training or in the fine-tuning stage

Transfer learning with object detection and image segmentation tasks We further analyze the quality of the models pre-trained on IN-1K using object detection and semantic segmentation downstream tasks. Specifically, we use ViTDet (Li et al., 2022), a recently proposed object detection/segmentation framework in which a (standard) pre-trained ViT backbone is adapted only at fine-tuning time in order to generate a feature pyramid to be used for multi-scale object detection or image segmentation. Note that, as mentioned in Sect. 1, hybrid approaches which are based on ad hoc architectures are not suitable from this framework, because they need to redesign their backbone and introduce a feature pyramid also in the pre-training stage (Li et al., 2022; Wang et al., 2021). Conversely, we use the pre-trained networks whose results are reported in Table 4, where the baseline is ViT-S/16 and our approach corresponds to ViT-S/16 + SAR. For the object detection task, following (Girshick, 2015), we use the trainval set of PASCAL VOC 2007 and 2012 (Everingham et al., 2010) (16.5K training images) to fine-tune the two models using ViTDet, and the test set of PASCAL VOC 2007 for evaluation. The results, reported in Table 5, show that the model pre-trained using SAR outperforms the baseline of more than 2 points, which is an increment even larger that the boost obtained in the classification task used during pre-training (Table 4). Similarly, for the segmentation task, we use PASCAL VOC-12 trainval for fine-tuning and PASCAL VOC 2007 test for evaluation. Table 6 shows that the model pre-trained with SAR achieves an improvement of more than 2.5 mIoU points compared to the baseline. These detection and segmentation improvements confirm that the local inductive bias introduced in ViT using SAR can be very useful for localization tasks, especially when the fine-tuning data are scarce like in PASCAL VOC.

Transfer learning with different fine-tuning protocols In this battery of experiments, we evaluate SAR in a transfer learning scenario with classification tasks. We adopt the four datasets used in Dosovitskiy et al. (2021); Touvron et al. (2021); Chen et al. (2021); Caron et al. (2021): CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), Oxford Flowers102 (Nilsback & Zisserman, 2008), and Oxford-IIIT-Pets (Everingham et al., 2010). The standard transfer learning protocol consists in pre-training on IN-1K, and then fine-tuning on each dataset. This corresponds to the first row in Table 7, where the IN-1K pre-trained model is ViT-S/16 in Table 4. The next three rows show different pre-training/fine-tuning configurations, in which we use SAR in one of the two phases or in both (see the Appendix for more details). All the configurations lead to an overall improvement of the accuracy with respect to the baseline, and show that SAR can be used flexibly. For instance, SAR can be used when fine-tuning a VT trained in a standard way, without the need to re-train it on ImageNet.

Out-of-distribution testing We test the robustness of our ViT trained with SAR when the testing distribution is different from the training distribution. Specifically, following (Bai et al., 2021), we use two different testing sets: (1) ImageNet-A (Hendrycks et al., 2021), which are real-world images but collected from challenging scenarios (e.g., occlusions, fog scenes, etc.), and (2) ImageNet-C (Hendrycks & Dietterich, 2019), which is designed to measure the model robustness against common image corruptions.

Table 8 Out-of-distribution testing on ImageNet-A (IN-A) and ImageNet-C (IN-C)

Note that training is done only on IN-1K. Thus, in Table 8, ViT-S/16 and ViT-S/16 + SAR correspond to the models we trained on IN-1K whose results on the IN-1K standard validation set are reported in Table 4. ImageNet-A and ImageNet-C are used only for testing, hence they are useful to assess the behaviour of a model when evaluated on a distribution different from the training distribution (Bai et al., 2021). The results reported in Table 8 show that SAR can significantly improve the robustness of ViT (note that, with the mCE metric, the lower the better (Bai et al., 2021)). We presume that this is a side-effect of our spatial entropy loss minimization, which leads to heads usually focusing on the foreground objects and, therefore, reducing the dependence with respect to the background appearance variability distribution.

Different VT architectures Finally, we show that SAR can be used with VTs of different capacities and with architectures different from ViT. For this purpose, we plug SAR into the following VT architectures: ViT-S/16 (Touvron et al., 2021), T2T (Yuan et al., 2021), PVT (Wang et al., 2021) and CvT (Wu et al., 2021). Specifically, T2T, PVT and CvT are hybrid architectures, which use typical CNN mechanisms to introduce a local inductive bias into the VT training (Sects. 1 and 2). We omit other common frameworks such as, for instance, Swin (Liu et al., 2021) because of the lack of a [CLS] token in their architecture. Although the [CLS] token used, e.g, in Sect. 4.1 to compute \(S^h\), can potentially be replaced by a vector obtained by average-pooling all the patch embeddings, we leave this for future investigations. Moreover, for computational reasons, we focus on small-medium capacity VTs (see Table 10 for details on the number of parameters of each VT). Importantly, for each tested method, we use the original training protocol developed by the corresponding authors, including, e.g., the learning rate schedule, the batch size, the VT-specific hyperparameter values and the data-augmentation type used to obtain the corresponding published results, both when we train the baseline and when we train using SAR. Moreover, as usual (Sect. 5.1), we keep fixed the only SAR hyperparameter (\(\lambda = 0.01\)). Although better results can likely be obtained by adopting the common practice of hyperparameter tuning (including the VT-specific hyperparameters), our goal is to show that SAR can be easily used in different VTs, increasing their final testing accuracy. The results reported in Table 9 and Table 10 show that SAR improves all the tested VTs, independently of their specific architecture, model capacity or training protocol. Note that both PVT and CvT have a final grid resolution of \(7 \times 7\), which is smaller than the \(14 \times 14\) grid used in ViT and T2T, and this probably has a negative impact on our spatial based entropy loss.

Overall, the results reported in Tables 9 and 10: (1) Confirm that SAR is mostly useful with smaller datasets (being the relative improvements on IN-100 significantly larger than those obtained on IN-1K). (2) Show that the object-based inductive bias introduced when training with SAR is (partially) complementary with respect to the local bias embedded in the hybrid VT architectures, as witnessed by the positive boost obtained when these VTs are used jointly with SAR. (3) Show that, on IN-1K, the accuracy of ViT-S/16 + SAR is comparable with the hybrid VTs (without SAR). However, the advantage of ViT-S/16 + SAR is its simplicity, which does not need drastic architectural changes to the original ViT architecture, where the latter is quickly becoming a de facto standard in many vision and vision-language tasks (Sect. 2).

Table 9 IN-100 experiments with different VTs (100 training epochs). The results of the baselines have been obtained by us using the corresponding publicly available code
Table 10 IN-1K experiments with different VTs (300 training epochs). All results but ours are reported from the corresponding paper. The number of parameters is in millions
Fig. 4
figure 4

A qualitative comparison between the attention maps generated by ViT-S/16 and ViT-S/16 + SAR. For each image, we show all the 6 attention maps (\(A^h\)) corresponding to the 6 last-block heads, computed using only the [CLS] token query

5.3 Attention map analysis

This section qualitatively and quantitatively analyses the attention maps obtained using SAR. Note that, as mentioned in Sects. 1 and 2, we do not directly use the attention map clusters for segmentation tasks or as input to a post-processing step. Thus, the goal of this analysis is to show that the spatial entropy loss minimization effectively results in attention maps with spatial clusters, leaving their potential use for a segmentation-based post-processing as a future work.

Table 11 A comparison of the segmentation properties of the attention maps on PASCAL VOC-12, measuring the Jaccard similarity

Figure 4 visually compares the attention maps obtained with ViT-S/16 and ViT-S/16 + SAR. As expected, standard training generates attention maps with a widely spread structure. Conversely, using SAR, a semantic segmentation structure clearly emerges. In the Appendix, we show additional results.

For a quantitative analysis, we follow the protocol used in (Caron et al., 2021; Naseer et al., 2021), where the Jaccard similarity is used to compare the ground-truth segmentation masks of the objects in PASCAL VOC-12 with the thresholded attention masks of the last ViT block. Specifically, the attention maps of all the heads are thresholded to keep 60% of the mass, and the head with the highest Jaccard similarity with the ground-truth is selected (Caron et al., 2021; Naseer et al., 2021). Table 11 shows that SAR significantly improves the segmentation results, quantitatively confirming the qualitative analysis in Fig. 4.

6 Conclusions

In this paper we proposed SAR, a regularization method which exploits the connectedness property of the objects to introduce a local inductive bias into the VT training. By penalizing spatially disordered attention maps, an additional self-supervised signal can be extracted from the sample images, thereby reducing the reliance on large numbers of labeled training samples. Using different downstream tasks and training–testing protocols (including fine-tuning and out-of-distribution testing), we showed tha SAR can significantly boost the accuracy of a ViT backbone, especially when the training data are scarce. Although SAR can also be used jointly with hybrid VTs, its main advantage over the latter is the possibility to be easily plugged into the original ViT backbone, whose architecture is widely adopted in many vision and vision language tasks.

Future work SAR can be extended to VT for videos. In fact, a “temporal inductive bias” contained in videos is that natural objects usually move smoothly and, thus, they can be represented by few connected 3D regions in, e.g., a sequence of T consecutive frames. Thus, Equation (4) can be extended e.g., by comparing the [CLS] token query with all the patch token keys contained in these T frames, keeping the rest of the algorithm unchanged.

Another promising direction for a future work is combining SAR with DRLoc (Liu et al., 2021): they are both training regularization approaches for VTs and their joint use can lead to a further improvement of the sample efficiency.

Limitations Since training VTs is very computationally expensive, in our experiments we used only small/medium capacity VTs. We leave the extension of our empirical analysis to larger capacity VTs for the future. For the same computational reasons, we have not tuned hyperparameters on the datasets. However, we believe that the SAR accuracy improvement, obtained in all the tested scenarios without hyperparameter tuning, further shows its robustness and ease to use.