Keywords

1 Introduction

Deep metric learning has been actively researched recently. In deep metric learning, feature embedding function is modeled as a deep neural network. This feature embedding function embeds input images into feature embedding space with a certain desired condition. In this condition, the feature embeddings of similar images are required to be close to each other while those of dissimilar images are required to be far from each other. To satisfy this condition, many loss functions based on the distances between embeddings have been proposed [3, 4, 6, 14, 25, 27,28,29, 33, 37]. Deep metric learning has been successfully applied in image retrieval task on popular benchmarks such as CARS-196 [13], CUB-200-2011 [35], Stanford online products [29], and in-shop clothes retrieval [18] datasets.

Ensemble is a widely used technique of training multiple learners to get a combined model, which performs better than individual models. For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners which often leads to better embedding space under given constraints on the distances between image pairs. The keys to success in ensemble are high performance of individual learners as well as diversity among learners. To achieve this objective, different methods have been proposed [22, 39]. However, there has not been much research on optimal architecture to yield diversity of feature embeddings in deep metric learning.

Our contribution is to propose a novel framework to encourage diversity in feature embeddings. To this end, we design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained. They are regularized with divergence loss which aims to differentiate the feature embeddings from different learners. Equipped with it, we present M-way attention-based ensemble (ABE-M) which learns feature embedding with M diverse attention masks. The proposed architecture is represented in Fig. 1(b). We compare our model to our M-heads ensemble baseline [16], in which different feature embedding functions are trained for different learners (Fig. 1(a)), and experimentally demonstrate that the proposed ABE-M shows significantly better results with less number of parameters.

Fig. 1.
figure 1

Difference between M-heads ensemble and attention-based ensemble. Both assume shared parameters for bottom layers (S). (a) In M-heads ensemble, different feature embedding functions are trained for different learners (\(G_1, G_2, G_3\)). (b) In attention-based ensemble, single feature embedding function (G) is trained while each learner learns different attention modules (\(A_1, A_2, A_3\))

2 Related Works

Deep Metric Learning and Ensemble. The aim of the deep metric learning is to find an embedding function \(f: \mathcal {X} \rightarrow \mathcal {Y}\) which maps samples x from a data space \(\mathcal {X}\) to a feature embedding space \(\mathcal {Y}\) so that \(f({x}_i)\) and \(f({x}_j)\) are closer in some metric when \({x}_i\) and \({x}_j\) are semantically similar. To achieve this goal, in deep metric learning, contrastive [4, 6] and triplet [25, 37] losses are proposed. Recently, more advanced losses are introduced such as lifted structured loss [29], histogram loss [33], N-pair loss [27], and clustering loss [14, 28].

Recently, there has been research in networks incorporated with ensemble technique, which report better performances than those of single networks. Earlier deep learning approaches are based on direct averaging of the same networks with different initializations [15, 24] or training with different subsets of training samples [31, 32]. Following these former works, parameter sharing is introduced by Bachman et al. [2] which is called pseudo-ensembles. Another parameter sharing ensemble approach is proposed by Lee et al. [16]. Dropout [30] can be interpreted as an ensemble approach which takes exponential number of networks with high correlation. In addition to dropout, Veit et al. [34] state that residual networks behave like ensembles of relatively shallow networks. Recently the ensemble technique has been applied in deep metric learning as well. Yuan et al. [39] propose to ensemble a set of models with different complexities in cascaded manner. They train deeply supervised cascaded networks using easier examples through earlier layers of the networks while harder examples are further exploited in later layers. Opitz et al. [22] use online gradient boosting to train each learner in ensemble. They try to reduce correlation among learners using re-weighting of training samples. Opitz et al. [21] propose an efficient averaging strategy with a novel DivLoss which encourages diversity of individual learners.

Attention Mechanism. Attention mechanism has been used in various computer vision problems. Earlier researches utilize RNN architectures for attention modeling [1, 19, 26]. These RNN based attention models solve classification tasks using object parts detection by sequentially selecting attention regions from images and then learning feature representations for each part. Besides RNN approaches, Liu et al. [17] propose fully convolutional attention networks, which adopts hard attention from a region generator. And Zhao et al. [40] propose diversified visual attention networks, which uses different scaling or cropping of input images for different attention masks. However, our ABE-M is able to learn diverse attention masks without relying on a region generator. In addition, ABE-M uses soft attention, therefore, the parameter update is straightforward by backpropagation in a fully gradient-based way while previous approaches in [1, 17, 19, 26, 40] use hard attention which requires policy gradient estimation.

Jaderberg et al. [11] propose spatial transformer networks which models attention mechanism using parameterized image transformations. Unlike aforementioned approaches, their model is differentiable and thus can be trained in a fully gradient-based way. However, their attention is limited to a set of predefined and parameterized transformations which could not yield arbitrary attention masks.

3 Attention-Based Ensemble

3.1 Deep Metric Learning

Let \(f:\mathcal {X} \rightarrow \mathcal {Y}\) be an isometric embedding function between metric spaces \(\mathcal {X} \) and \(\mathcal {Y}\) where \(\mathcal {X} \) is a \(N_\mathcal {X} \) dimensional metric space with an unknown metric function \(d_\mathcal {X} \) and \(\mathcal {Y}\) is a \(N_\mathcal {Y}\) dimensional metric space with a known metric function \(d_\mathcal {Y} \). For example, \(\mathcal {Y}\) could be a Euclidean space with Euclidean distance or the unit sphere in a Euclidean space with angular distance.

Our goal is to approximate f with a deep neural network from a dataset \(\mathcal {D}=\{(x^{(1)},x^{(2)},d_\mathcal {X} (x^{(1)},x^{(2)} ))|x^{(1)},x^{(2)}\in \mathcal {X} \}\) which are samples from \(\mathcal {X} \). In case we cannot get the samples of metric \(d_\mathcal {X} \), we consider the label information from the dataset with labels as the relative constraint of the metric \(d_\mathcal {X} \). For example, from a dataset \(\mathcal {D}_\mathcal {C}=\{(x,c)|x\in \mathcal {X} ,c\in \mathcal {C}\}\) where \(\mathcal {C}\) is the set of labels, for \((x_i,c_i ),(x_j,c_j )\in \mathcal {D}_\mathcal {C} \) the contrastive metric constraint could be defined as the following:

$$\begin{aligned} {\left\{ \begin{array}{ll} d_\mathcal {X} (x_i,x_j ) =0 , &{}\text {if}\,c_i=c_j ; \\ d_\mathcal {X} (x_i,x_j )>m_c, &{}\text {if}\,c_i\ne c_j, \end{array}\right. } \end{aligned}$$
(1)

where \(m_c\) is an arbitrary margin. The triplet metric constraint for \((x_i,c_i ),\) \((x_j,c_j ),\) \((x_k,c_k )\) \(\in \mathcal {D}_\mathcal {C} \) could be defined as the following:

$$\begin{aligned} d_\mathcal {X} (x_i,x_j) +m_t <d_\mathcal {X} (x_i,x_k), ~~c_i=c_j\,\text {and}\,c_i \ne c_k , \end{aligned}$$
(2)

where \(m_t\) is a margin. Note that these metric constraints are some choices of how to model \(d_\mathcal {X} \), not those of how to model f.

An embedding function f is isometric or distance preserving embedding if for every \(x_i,x_j \in \mathcal {X}\) one has \(d_\mathcal {X} (x_i,x_j )=d_\mathcal {Y} (f(x_i),f(x_j) )\). In order to have an isometric embedding function f, we optimize f so that the points embedded into \(\mathcal {Y}\) produce exactly the same metric or obey the same metric constraint of \(d_\mathcal {X}\).

3.2 Ensemble for Deep Metric Learning

A classical ensemble for deep metric learning could be the method to average the metric of multiple embedding functions. We define the ensemble metric function \(d_{\mathrm {ensemble}} \) for deep metric learning as the following:

$$\begin{aligned} d_{\mathrm {ensemble},{(f_1,\dots ,f_{M})}} (x_i,x_j )=\frac{1}{M} \sum _{m=1}^{M} d_\mathcal {Y} (f_m(x_i ),f_m(x_j)), \end{aligned}$$
(3)

where \(f_m\) is an independently trained embedding function and we call it a learner.

In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function \(s:\mathcal {X} \rightarrow \mathcal {Z}\) which is an isometric embedding function between metric spaces \(\mathcal {X} \) and \(\mathcal {Z}\) where \(\mathcal {X} \) is a \(N_\mathcal {X}\) dimensional metric space with an unknown metric function \(d_\mathcal {X} \) and \(\mathcal {Z}\) is a \(N_\mathcal {Z} \) dimensional metric space with an unknown metric function \(d_\mathcal {Z}\). And we consider the isometric embedding \(g:\mathcal {Z} \rightarrow \mathcal {Y}\) where \(\mathcal {Y}\) is a \(N_\mathcal {Y}\) dimensional metric space with a known metric function \(d_\mathcal {Y}\). If we combine them into one function \(b(x)=g(s(x)),x \in \mathcal {X} \), the combined function is also an isometric embedding \(b:\mathcal {X} \rightarrow \mathcal {Y}\) between metric spaces \(\mathcal {X} \) and \(\mathcal {Y}\).

Like the parameter sharing ensemble [16], with the independently trained multiple \(g_m\) and a single s, we can get multiple embedding functions \(b_m:\mathcal {X} \rightarrow \mathcal {Y}\) as the following:

$$\begin{aligned} b_m(x)=g_m(s(x)) . \end{aligned}$$
(4)

We are interested in another case where there are multiple embedding functions \(b_m:\mathcal {X} \rightarrow \mathcal {Y}\) with multiple \(s_m\) and a single g as the following:

$$\begin{aligned} b_m(x)=g(s_m(x)) . \end{aligned}$$
(5)

Note that a point in \(\mathcal {X} \) can be embedded into multiple points in \(\mathcal {Y}\) by multiple learners. In Eq. (5), \(s_m\) does not have to preserve the label information while it only has to preserve the metric. In other words, a point with a label could be mapped to multiple locations in \(\mathcal {Z}\) by multiple \(s_m\) and finally would be mapped to multiple locations in \(\mathcal {Y}\). If this were the ensemble of classification models where g approximates the distribution of the labels, all \(s_m\) should be label preserving functions because the outputs of \(s_m\) become the inputs of one classification model g.

For the embedding function of Eq. (5), we want to make \(s_m\) attends to the diverse aspects of data x in \(\mathcal {X}\) while maintaining a single embedding function g which disentangles the complex manifold \(\mathcal {Z}\) into Euclidean space. By exploiting the fact that a point x in \(\mathcal {X}\) can be mapped to multiple locations in \(\mathcal {Y}\), we can encourage each \(s_m\) to map x into distinctive points \(z_m\) in \(\mathcal {Z}\). Given an isometric embedding \(g:\mathcal {Z} \rightarrow \mathcal {Y}\) , if we enforce \(y_m\) in \(\mathcal {Y}\) mapped from x to be far from each other, \(z_m\) in \(\mathcal {Z}\) mapped from x will be far from each other as well. Note that we cannot apply this divergence constraint to \(z_m\) because metric \(d_z\) in \(\mathcal {Z}\) is unknown. We train each \(b_m\) to be isometric function between \(\mathcal {X} \) and \(\mathcal {Y}\) while applying the divergence constraint among \(y_m\) in \(\mathcal {Y}\). If we apply the divergence constraint to classical ensemble models or multihead ensemble models, they do not necessarily induce the diversity because each \(f_m\) or \(g_m\) could arbitrarily compose different metric spaces in \(\mathcal {Y}\) (Refer to experimental results in Sect. 6.2). With the attention-based ensemble, union of metric spaces by multiple \(s_m\) is mapped by a single embedding function g.

Fig. 2.
figure 2

Illustration of feature embedding space and divergence loss. Different car brands are represented as different colors: red, green and blue. Feature embeddings of each learner are depicted as a square with different mask patterns. Divergence loss pulls apart the feature embeddings of different learners using same input (Color figure online)

3.3 Attention-Based Ensemble Model

As one implementation of Eq. (5), we propose the attention-based ensemble model which is mainly composed of two parts: feature extraction module F(x) and attention module A(x). For the feature extraction, we assume a general multi-layer perceptron model as the following:

$$\begin{aligned} F(x)=h_l(h_{l-1}( \cdots (h_2(h_1(x)))) \end{aligned}$$
(6)

We break it into two parts with a branching point at i, \(S(\cdot )\) includes \(h_l\), \(h_{l-1}, \dots , h_{i+1}\), and \(G(\cdot )\) includes \(h_i\), \(h_{i-1}\), \(\dots \), \(h_1\). We call \(S(\cdot )\) a spatial feature extractor and \(G(\cdot )\) a global feature embedding function with respect to the output of each function. For attention module, we also assume a general multi-layer perceptron model which outputs a three dimensional blob with channel, width, and height as an attention mask. Each element in the attention masks is assumed to have a value from 0 to 1. Given aforementioned two modules, the combined embedding function \(B_m(x)\) for the learner \(m\) is defined as the following:

$$\begin{aligned} B_m(x) = G(S(x) \circ A_m(S(x)) ), \end{aligned}$$
(7)

where \(\circ \) denotes element-wise product (Fig. 1(b)).

Note that, same feature extraction module is shared across different learners while individual learners have their own attention module \(A_m(\cdot )\). The attention function \(A_m(S(x))\) outputs an attention mask with same size as output of S(x). This attention mask is applied to the output feature of S(x) with an element-wise product. Attended feature output of \(S(x) \circ A_m(S(x))\) is then fed into global feature embedding function \(G(\cdot )\) to generate an embedding feature vector. If all the elements in the attention mask are 1, the model \(B_m(x)\) is reduced to a conventional multi-layer perceptron model.

3.4 Loss

The loss for training aforementioned attention model is defined as:

$$\begin{aligned} L(\{(x_i,c_i)\}) = \sum _mL_{\mathrm {metric}, (m)}(\{(x_i,c_i)\}) + \lambda _{\mathrm {div}} L_{\mathrm {div}}(\{x_i\}), \end{aligned}$$
(8)

where \(\{(x_i,c_i)\}\) is a set of all training samples and labels, \(L_{\mathrm {metric}, (m)}(\cdot )\) is the loss for the isometric embedding for the \(m\)-th learner, \(L_{\mathrm {div}}(\cdot )\) is regularizing term for diversifying the feature embedding of each learner \(B_m(x)\) and \(\lambda _{\mathrm {div}}\) is the weighting parameter to control the strength of the regularizer. More specifically, divergence loss \(L_{\mathrm {div}}\) is defined as the following:

$$\begin{aligned} L_{\mathrm {div}}(\{x_i\}) = \sum _i \sum _{p,q} \max (0, m_{\mathrm {div}} - d_{\mathcal {Y}}(B_p(x_i), B_q(x_i))^2), \end{aligned}$$
(9)

where \(\{x_i\}\) is set of all training samples, \(d_{\mathcal {Y}}\) is the metric in \(\mathcal {Y}\) and \(m_{\mathrm {div}}\) is a margin. A pair \((B_p(x_i), B_q(x_i))\) represents feature embeddings of a single image embedded by two different learners. We call it self pair from now on while positive and negative pairs refer to pairs of feature embeddings with same labels and different labels, respectively.

The divergence loss encourages each learner to attend to the different part of the input image by increasing the distance between the points embedded by the input image (Fig. 2). Since the learners share the same functional module to extract features, the only differentiating part is the attention module. Note that our proposed loss is not directly applied to the attention masks. In other words, the attention masks among the learners may overlap. And also it is possible to have the attention masks some of which focus on small region while other focus on larger region including small one.

4 Implementation

We perform all our experiments using GoogLeNet [32] as the base architecture. As shown in Fig. 3, we use the output of max pooling layer following the inception(3b) block as our spatial feature extractor \(S(\cdot )\) and remaining network as our global feature embedding function \(G(\cdot )\). In our implementation, we simplify attention module \(A_m(\cdot )\) as \(A_m'(C(\cdot ))\) where \(C(\cdot )\) consists of inception(4a) to inception(4e) from GoogLeNet, which is shared among all M learners and \(A_m'(\cdot )\) consists of a convolution layer of 480 kernels of size 1 \(\times \) 1 to match the output of \(S(\cdot )\) for the element-wise product. This is for efficiency in terms of memory and computation time. Since \(C(\cdot )\) is shared across different learners, forward and backward propagation time, memory usage, and number of parameters are decreased compared to having separate \(A_m(\cdot )\) for each learner (without any shared part). Our preliminary experiments showed no performance drop with this choice of implementation.

We study the effects of different branching points and depth of attention module in Sect. 6.3. We use contrastive loss [3, 4, 6] as our distance metric loss function which is defined as the following:

$$\begin{aligned} \begin{aligned} L_{\mathrm {metric}, (m)}(\{(x_i,c_i)\})&= \frac{1}{N}\sum _{i,j} (1-y_{i,j})[m_{c}-D_{m,i,j}^2]_+ + y_{i,j} D_{m,i,j}^2, \\ D_{m,i,j}&=d_{\mathcal {Y}}(B_m(x_i), B_m(x_j)), \end{aligned} \end{aligned}$$
(10)

where \(\{(x_i,c_i)\}\) is set of all training samples and corresponding labels, N is the number of training sets, \(y_{i,j}\) is a binary indicator of whether or not the label \(c_i\) is equal to \(c_j\), \(d_{\mathcal {Y}}\) is the euclidean distance, \([\cdot ]_+\) denotes the hinge function \(\max (0,\cdot )\) and \(m_{c}\) is the margin for contrastive loss. Both of margins \(m_{c}\) and \(m_\mathrm {div}\) (in Eq. 8) is set to 1.

Fig. 3.
figure 3

The implementation of attention-based ensemble (ABE-M) using GoogLeNet

We implement the proposed ABE-M method using caffe [12] framework. During training, the network is initialized from a pre-trained network on ImageNet ILSVRC dataset [24]. The final layer of the network and the convolution layer of attention module are randomly initialized as proposed by Glorot et al. [5]. For optimizer, we use stochastic gradient descent with momentum optimizer with momentum as 0.9, and we select the base learning rate by tuning on validation set of the dataset.

We follow earlier works [29, 38] for preprocessing and unless stated otherwise, we use the input image size of 224 \(\times \) 224. All training and testing images are scaled such that their longer side is 256, keeping the aspect ratio fixed, and padding the shorter side to get 256 \(\times \) 256 images. During training, we randomly crop images to 224 \(\times \) 224 and then randomly flip horizontally. During testing, we use the center crop. We subtract the channel-wise mean of ImageNet dataset from the images. For training and testing images of cropped datasets, we follow the approach in [38]. For CARS-196 [13] cropped dataset, 256\(\times \)256 scaled cropped images are used; while for CUB-200-2011 [35] cropped dataset, 256 \(\times \) 256 scaled cropped images with fixed aspect ratio and shorter side padded are used.

We run our experiments on nVidia Tesla M40 GPU (24GBs GPU memory), which limits our batch size to 64 for ABE-8 model. Unless stated otherwise, we use the batch size of 64 for our experiments. We sample our mini-batches by first randomly sampling 32 images and then positive pairs for first 16 images and negative pairs for next 16 images, thus making the mini-batch of size 64. Unless mentioned otherwise, we report the results of our method using embedding size of 512. This makes the embedding size for individual learners to be 512/M.

5 Evaluation

We use all commonly used image retrieval task datasets for our experiments and Recall@K metric for our evaluation. During testing, we compute the feature embeddings for all the test images from our network. For every test image, we then retrieve top K similar images from the test set excluding test image itself. Recall score for that test image is 1 if at least one image out of K retrieved images has the same label as the test image. We compute the average over whole test set to get Recall@K. We evaluate the model after every 1000 iteration and report the results for the iteration with highest Recall@1.

We show the effectiveness of the proposed ABE-M method on all the datasets commonly used in image retrieval tasks. We follow same train-test split as [29] for fair comparison with other works.

  • CARS-196 [13] dataset contains images of 196 different classes of cars and is primarily used for our experiments. The dataset is split into 8,144 training images and 8,041 testing images (98 classes in both).

  • CUB-200-2011 [35] dataset consists of 11,788 images of 200 different bird species. We use the first 100 classes for training (5,864 images) and the remaining 100 classes for testing (5,924 images).

  • Stanford online products (SOP) [29] dataset has 22,634 classes with 120,053 product images. 11,318 classes are used for training (59,551 images) while other 11,316 classes are for testing (60,502 images).

  • In-shop clothes retrieval [18] dataset contains 11,735 classes of clothing items with 54,642 images. Following similar protocol as [29], we use 3,997 classes for training (25,882 images) and other 3,985 classes for testing (28,760 images). The test set is partitioned into the query set of 3,985 classes (14,218 images) and the retrieval database set of 3,985 classes (12,612 images).

Since CARS-196 and CUB-200-2011 datasets consist of bounding boxes too, we report the results using original images and cropped images both for fair comparison.

6 Experiments

6.1 Comparison of ABE-M with M-heads

To show the effectiveness of our ABE-M method, we first compare the performance of ABE-M and M-heads ensemble (Fig. 1(a)) with varying ensemble embedding sizes (denoted with superscript) on CARS-196 dataset. As show in Table 1 and Fig. 4, our method outperforms M-heads ensemble by a significant margin. The number of model parameters for ABE-M is much less compared to M-heads ensemble as the global feature extractor \(G(\cdot )\) is shared among learners. But, ABE-M requires higher flops because of extra computation of attention modules. This difference becomes increasingly insignificant with increasing values of M.

ABE-1 contains only one attention module and hence is not an ensemble and does not use divergence loss. ABE-1 performs similar to 1-head. We also report the performance of individual learners of the ensemble. From Table 1, we can see that the performance of ABE-\(M^{512}\) ensemble is increasing with increasing M. The performance of individual learners is also increasing with increasing M despite the decrease in embedding size of individual learners (512/M). The same increase is not seen for the case of M-heads. Further, we can refer to ABE-1\(^{64}\), ABE-2\(^{128}\), ABE-4\(^{256}\) and ABE-8\(^{512}\), where all individual learners have embedding size 64. We can see a clear increase in recall of individual learners with increasing values of M.

Table 1. Recall@K(%) comparison with baseline on CARS-196. Superscript denotes ensemble embedding size
Fig. 4.
figure 4

Recall@1 comparison with baseline on CARS-196 as a function of (a) number of parameters and (b) flops. Both of ABE-M and M-heads has embedding size of 512

Fig. 5.
figure 5

Histograms of cosine similarity of positive (blue), negative (red), self (green) pairs trained with different methods. Self pair refers to the pair of feature embeddings from different learners using same image. (a) Attention-based ensemble (ABE-8) using proposed loss, (b) attention-based ensemble (ABE-8) without divergence loss, (c) 8-heads ensemble, (d) 8-heads ensemble with divergence loss. In the case of attention-based ensemble, divergence loss is necessary for each learner to be trained to produce different features by attending to different locations. Without divergence loss, one can see all learners learn very similar embedding. Meanwhile, in the case of M-heads ensemble, there is no effect of applying divergence loss. (Color figure online)

Table 2. Recall@K(%) comparison in ABE-M ensemble without divergence loss \(L_\mathrm {div}\) on CARS-196
Fig. 6.
figure 6

The attention masks learned by each learner of ABE-8 on CARS-196. Due to the space limitation, results from only three learners out of eight and three channels out of 480 are illustrated. Each column shows the result of different input images. Different learners attend to different parts of the car such as upper part, bottom part, roof, tires, lights and so on

6.2 Effects of Divergence Loss

ABE-M Without Divergence Loss. To analyze the effectiveness of divergence loss in ABE-M, we conduct experiments without divergence loss on CARS-196 and show the results in Table 2. As we can see, ABE-M without divergence loss performs similar to its individual learners whereas there is significant gain in ensemble performance of ABE-M compared to its individual learners.

We also calculate the cosine similarity between positive, negative, and self pairs, and plot in Fig. 5. With divergence loss (Fig. 5(a)), all learners learn diverse embedding function which leads to decrease in cosine similarity of self pairs. Without divergence loss (Fig. 5(b)), all learners converge to very similar embedding function so that the cosine similarity of self pairs is close to 1. This could be because all learners end up learning similar attention masks which leads to similar embeddings for all of them.

We visualize the learned attention masks of ABE-8 on CARS-196 in Fig. 6. Due to the space limitation, results from only three learners out of eight and three channels out of 480 are illustrated. The figure shows that different learners are attending to different parts for the same channel. Qualitatively, our proposed loss successfully diversify the attention masks produced by different learners. They are attending to different parts of the car such as upper part, bottom part, roof, tires, lights and so on. In 350th channel, for instance, learner 1 is focusing on bottom part of car, learner 2 on roof and learner 3 on upper part including roof. At the bottom of Fig. 6, the mean of the attention masks across all channels shows that the learned embedding function focuses more on object areas than the background.

Divergence Loss in M -heads. We show the result of experiments of 8-heads ensemble with divergence loss in Table 3. We can see that the divergence loss does not improve the performance in 8-heads. From Fig. 5(c), we can notice that cosine similarities of self pairs are close to zero for M-heads. Figure 5(d) shows that the divergence loss does not affect the cosine similarity of self pairs significantly. As mentioned in Sect. 3.2, we hypothesize this is because each of \(G_m(\cdot )\) could arbitrarily compose different metric spaces in \(\mathcal {Y}\).

Table 3. Recall@K(%) comparison in M-heads ensemble with divergence loss \(L_\mathrm {div}\) on CARS-196

6.3 Ablation Study

To analyze the importance of various aspects of our model, we performed experiments on CARS-196 dataset of ABE-8 model, varying a few hyperparameters at a time and keeping others fixed. (More ablation study can be found in the supplementary material.)

Sensitivity to Depth of Attention Module. We demonstrate the effect of depth of attention module by changing the number of inception blocks in it. To make sure that we can take the element wise product of the attention mask with the input of attention module, the dimension of attention mask should match the input dimension of attention module. Because of this we remove all the pooling layers in our attention module. Figure 7(a) shows Recall@1 with varying number of inception blocks in attention module starting from 1 (inception(4a)) to 7 (inception(4a) to inception(5b)) in GoogLeNet. We can see that the attention module with 5 inception blocks (inception(4a) to inception(4e)) performs the best.

Fig. 7.
figure 7

Recall@1 while varying hyperparameters and architectures: (a) number of inception blocks used for attention module \(A_k(\cdot )\), (b) branching point of attention module, and (c) weight \(\lambda _\mathrm {div}\). Here, inception(3a) is abbreviated as in(3a)

Sensitivity to Branching Point of Attention Module. The branching point of the attention module is where we split the network between spatial feature extractor \(S(\cdot )\) and global feature embedding function \(G(\cdot )\). To analyze the choice of branching point of the attention module, we keep the number of inception blocks in attention module same (\(i.e .\) 5) and change branching points from pool2 to inception(4b). From Fig. 7, we see that pool3 performs the best with our architecture.

We carry out this experiment with batch size 40 for all the branching points. For ABE-M model, the memory requirement for the \(G(\cdot )\) is M times compared to the individual learner. Since early branching point increases the depth of \(G(\cdot )\) while decreasing the depth for \(S(\cdot )\), it would consequently increase the memory requirement of the whole network. Due to the memory constraints of GPU, we started the experiments from branching points pool2 and adjusted the batch size.

Sensitivity to \(\lambda _\mathrm {div}\). Figure 7 shows the effect of \(\lambda _\mathrm {div}\) on Recall@K for ABE-M model. We can see that \(\lambda _\mathrm {div}=1\) performs the best and lower values degrades the performance quickly.

6.4 Comparison with State of the Art

We compare the results of our approach with current state-of-the-art techniques. Our model performs the best on all the major benchmarks for image retrieval. Tables 4, 6 and 7 compare the results with previous methods such as LiftedStruct [29], HDC [39], MarginFootnote 1 [38], BIER [22], and A-BIER [22] on CARS-196 [13], CUB-200-2011 [35], SOP [29], and in-shop clothes retrieval [18] datasets. Results on the cropped datasets are listed in Table 5.

Table 4. Recall@K(%) score on CUB-200-2011 and CARS-196
Table 5. Recall@K(%) score on CUB-200-2011 (cropped) and CARS-196 (cropped)
Table 6. Recall@K(%) score on Stanford online products dataset (SOP)
Table 7. Recall@K(%) score on in-shop clothes retrieval dataset

7 Conclusion

In this work, we present a new framework for ensemble in the domain of deep metric learning. It uses attention-based architecture that attends to parts of the image. We use multiple such attention-based learners for our ensemble. Since ensemble benefits from diverse learners, we further introduce a divergence loss to diversify the feature embeddings learned by each learner. The divergence loss encourages that the attended parts of the image for each learner are different. Experimental results demonstrate that the divergence loss not only increases the performance of ensemble but also increases each individual learners’ performance compared to the baseline. We demonstrate that our method outperforms the current state-of-the-art techniques by significant margin on several image retrieval benchmarks including CARS-196 [13], CUB-200-2011 [35], SOP [29], and in-shop clothes retrieval [18] datasets.