Keywords

1 Introduction

Deep metric learning aims to learn a similarity or distance metric which enjoys a small intra-class variation and a large inter-class variation [41]. Triplet loss is a popular loss function for deep metric learning and has made a great success in many computer vision tasks, such as fine-grained image classification [38], image retrieval [16, 21], person re-identification [5, 13], and face recognition [30, 33]. Recently, deep metric learning approaches employing triplet loss have attracted a lot of attention due to their efficiency for dealing with enormous of labels, e.g., the extreme multi-label classification problem [31]. More specifically, for conventional classification approaches, the number of parameters will increase linearly with the number of labels, and it is impractical to learn an N-way softmax classifier with millions of labels [28]. However, with triplet loss, deep metric learning is able to efficiently deal with an extreme multi-label classification problem by learning a compact embedding, which is known as the large margin nearest neighbor (LMNN) classification [41]. As a result, deep metric learning exploiting triplet loss is very efficient for applications with enormous labels, e.g., the number of objects in image retrieval [16], the number of identities in face recognition [33] and person re-identification [13].

To learn a discriminative feature embedding, triplet loss maximizes the margin between the intra-class distance and the inter-class distance. As a result, for each triplet \((x^a, x^p, x^n)\), where \(x^a\) is called the anchor point, \(x^p\) is called the positive point having the same label with \(x^a\), and \(x^n\) is called the negative point having a different label, the intra-class distance \(d(x^a, x^p)\) will be smaller than the inter-class distance \(d(x^a, x^n)\) in the learned embedding space. As the number of triplets grows cubically with the size of training data, triplet selection thus is indispensable for efficiently training with triplet loss. Specifically, triplet selection usually works in an online manner, i.e., triplets are constructed within each mini-batch [33], and we describe a typical pipeline of deep metric learning using triplet loss in Fig. 1.

Fig. 1.
figure 1

The pipeline of triplet loss based deep metric learning. In the first stage, a mini-batch is sampled from the training data, which usually contains k identities with several images per identity. Deep neural networks then are used to learn a feature embedding, e.g., a 128-D feature vector. In the third stage, a subset of triplets are selected using some triplet selection methods. Lastly, the loss is evaluated using the selected triplets.

However, the performance of triplet loss is heavily influenced by triplet selection methods [5, 13], i.e., training with randomly selected triplets almost does not converge while training with the hardest triplets often leads to a bad local solution [33]. To ensure fast convergence, it is crucial to select “good” hard triplets [33] and a variety of triplet selection methods have been designed in different applications [13, 16, 33, 38]. Although selecting hard triplets leads to fast convergence, it has the risk of introducing a selection bias, which is an essential problem for learning. A triplet selection method thus needs to balance the trade-off between mining hard triplets and introducing selection bias. In contrast to struggling with this trade-off by carefully selecting triplets, we address this problem by directly minimizing the selection bias. More specifically, let \(D_\mathcal {T}\) denote all possible triplets and \(D_\mathcal {S}\) indicate the subset of selected triplets from \(D_\mathcal {T}\). If the triplet selection is unbiased, \(D_\mathcal {S}\) and \(D_\mathcal {T}\) then share the same distribution. Otherwise, we can correct the bias in triplet selection by minimizing the distribution shift between \(D_\mathcal {S}\) and \(D_\mathcal {T}\).

Fig. 2.
figure 2

An example illustrating the distribution shift in triplet selection. In online triplet selection, all triplets \({D}_\mathcal {T}\) are constructed from each mini-batch and will induce a dataset \(\hat{D}_\mathcal {T}\). For the selected triplets \({D}_\mathcal {S}\), they also induce a dataset \(\hat{D}_\mathcal {S}\). We evaluate the distribution shift between \(D_\mathcal {S}\) and \(D_\mathcal {T}\) using the distribution shift between \(\hat{D}_\mathcal {S}\) and \(\hat{D}_\mathcal {T}\).

The problem of distribution shift falls within the scope of domain adaption [3, 15], which arises when learning a predictor from the source domain \(\mathcal {S}\) while the target domain \(\mathcal {T}\) changes. In learning with triplet loss, the model is trained using selected triplets \(D_\mathcal {S}\) while the target is to learn a model using all possible triplets \(D_\mathcal {T}\). To measure the distribution shift between \(D_\mathcal {S}\) and \(D_\mathcal {T}\), we define a set of triplet-induced data, i.e., given a set of triplets, e.g., \(D_\mathcal {S}\), the triplet-induced data \(\hat{D}_{\mathcal {S}}\) is defined as follows:

$$\begin{aligned} \hat{D}_{\mathcal {S}} = \{(x^a_i, y^a_i),(x^p_i, y^p_i),(x^n_i, y^n_i)|~\forall (x^a_i, x^p_i, x^n_i) \in D_{\mathcal {S}}\}, \end{aligned}$$
(1)

where \(y_i\) are the corresponding labels of \(x_i\). The induced data \(\hat{D}_{\mathcal {T}}\) can be defined similarly. We give an example of the distribution shift between \(D_\mathcal {S}\) and \(D_\mathcal {T}\) in Fig. 2. To deal with the problem of distribution shift, distribution matching approaches learn a domain invariant representation and have been widely employed [2, 3, 29]. Due to triplet loss often involves lots of labels and inspired by the methods in [10, 47], we try to minimize the distribution shift between \(\hat{D}_\mathcal {S}\) and \(\hat{D}_\mathcal {T}\) by learning a conditional invariant representation \(\varPhi (X)\), i.e., \( P^{\mathcal {S}}(\varPhi (X)|Y) \approx P^{\mathcal {T}}(\varPhi (X)|Y)\), where X and Y stand the random variables for data and label, respectively. More specifically, we propose a distribution matching loss function by employing Maximum Mean Discrepancy (MMD) [15], which measures the difference between \(P^{\mathcal {S}}(\varPhi (X)|Y)\) and \(P^{\mathcal {T}}(\varPhi (X)|Y)\). As a result, we learn a discriminative and conditional invariant embedding by jointly training with the triplet loss and the distribution matching loss.

In this paper, we first introduce the problem of triplet selection bias for learning with triplet loss. We then address this problem by reducing distribution shift between the triplet-induced data \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_{\mathcal {T}}\). As the proposed distribution matching loss adaptively corrects the distribution shift, we refer to this new variant of triplet loss as adapted triplet loss. Lastly, we conduct a number of experiments on MNIST [22] and Fashion-MNIST [44] for image classification, on CARS196 [19], CUB200-2011 [37], and Stanford Online Products [28] for image retrieval. The experimental results demonstrate the effectiveness of the proposed method.

2 Related Work

Deep Metric Learning and Triplet Loss. Many problems in machine learning and computer vision depend heavily on learning a distance metric [41]. Inspired by the great success of deep learning [20], deep neural networks have been widely used to learn a discriminative feature embedding [14, 38]. Deep metric learning employing triplet loss raises a lot of attention due to its impressive performance on FaceNet [33] for face verification and recognition. After that, triplet loss has been widely used to learn a discriminative embedding for a variety of applications, such as image classification [38] and image retrieval [11, 16, 21, 46, 48]. A majority of applications for triplet loss lies in visual object recognition, such as action recognition [32], vehicle recognition [25], place recognition [1], 3d pose recognition [42], face recognition [8, 30, 33], and person re-identification [4, 5, 9, 13, 24, 45].

Triplet Selection Methods. Triplet selection is the key for the success of triplet loss and a variety of triplet selection methods have been used in different applications [6, 14, 30, 33, 38, 39]. More specifically, in the deep ranking model proposed by [38], triplets are selected according to the pair-wise relevance score. In [39], the triplets are selected using the top k triplets in each mini-batch based on the margin \(d(x^a, x^p) - d(x^a, x^n)\). In [14], it selects only hard triplets, i.e., \(d(x^a, x^p) < d(x^a, x^n)\), while [30, 33] select semi-hard triplets which violate the triplet constraint, i.e., \(d(x^a, x^p) + \alpha < d(x^a, x^n)\), where \(\alpha \) is a positive scalar. Unlike [33], which defines semi-hard triplet using moderate negatives, [34] select semi-hard triplets based on moderate positives. [6] proposes an online hard negative mining method for triplet selection to boost the performance on triplet loss. In [13], it proposes a batch-hard triplet selection method, i.e., it first select a set of hard anchor-positive pairs, and it then select hardest negatives within the mini-batch. Recently, [43] proposes a weighted sampling method to address the sampling matters in deep metric learning.

Domain Adaptation. Domain adaptation methods can be divided into four categories due to different assumptions about how the distribution shifts across domains. (1) Covariate shift [15] assumes the marginal distribution P(X) changes across domains while the conditional distribution P(Y|X) stays the same. (2) Model shift [40] assumes that both P(X) and P(Y|X) independently change across domains. (3) Target shift [47] assumes that the marginal distribution P(Y) shifts wile P(X|Y) stays the same. (4) Generalized target shift [10, 23, 26] assumes that both P(Y) and P(X|Y) independently change. Since triplet loss is widely used for extreme multi-label classification problems, we model the triplet selection bias by the change of P(X|Y) in this paper.

3 Formulation

In this section, we first introduce triplet loss for deep metric learning and a widely used triplet selection method, i.e., semi-hard triplets [33]. We then formulate the problem of triplet selection bias as the distribution shift problem on triplet induced data. To minimize the distribution shift, we propose a distribution matching loss, which jointly works with the triplet loss to adaptively correct the distribution shift. As a result, we refer to this new triplet loss as adapted triplet loss.

3.1 Triplet Loss for Deep Metric Learning

Let XY denote two random variables, which indicate data and label, respectively. Let D denote a set of training data sampled from P(XY), i.e., \(D = \{ (x_i, y_i)|~(x_i, y_i) \sim P(X,Y) \}\). Metric learning aims to learn a distance function that assigns small (or large) distance to a pair of similar (or dissimilar) examples. A widely used distance metric, i.e., the Mahalanobis distance, is defined as follows:

$$\begin{aligned} d_K^2(x_i, x_j) = (x_i - x_j)^{\top } K (x_i - x_j), \end{aligned}$$
(2)

where K is a symmetric positive semi-definite matrix. As K can be decomposed as \(K = L^\top L\), we then have

$$\begin{aligned} d_K^2(x_i, x_j) = \Vert L(x_i-x_j) \Vert _2^2 = \Vert x'_i-x'_j \Vert _2^2, \end{aligned}$$
(3)

where \(x'_i = Lx_i\) and \(x'_j=Lx_j\). Inspired by this, deep metric learning uses deep neural networks to learn a feature embedding \(x' = \varPhi (x)\), which generalizes the linear transformation \(x' = Lx\) to a non-linear transformation \(\varPhi (x)\). That is, the learned distance metric is

$$\begin{aligned} d^2_K(x_i, x_j) = ||\varPhi (x_i) - \varPhi (x_j)||_2^2. \end{aligned}$$
(4)

To learn a discriminative feature embedding \(\varPhi (x)\), i.e., intra-class distance is smaller than inter-class distance [41], triplet loss is defined as follows:

$$\begin{aligned} \mathcal {L}_{triplet}^* = \sum \limits _{(x^a, x^p, x^n)\in D_\mathcal {T}} [d_K^2(x^a, x^p) - d_K^2(x^a, x^n) + \alpha ]_+, \end{aligned}$$
(5)

where \([\cdot ]_+ = \max (0, \cdot )\), \(\alpha \ge 0\) is the margin, and \(D_\mathcal {T}\) is a set of triplets constructed from the original training data D, i.e.,

$$\begin{aligned} D_\mathcal {T} = \{(x^a, x^p, x^n) |~y^a = y^p~\text {and}~y^a \not = y^n\}. \end{aligned}$$
(6)

3.2 Triplet Selection Bias

Let \(D_{\mathcal {T}}\) denote all possible triplets constructed within a mini-batch and \(D_{\mathcal {S}}\) denote the subset of selected triplets, i.e., \(D_{\mathcal {S}} \subseteq D_{\mathcal {T}}\). More specifically, given a mini-batch of training data with k identities and c images per identity, there will be \(k(k-1)c^2(c-1)\) possible triplets in total. As the number of triplets grows cubically with the number of training data, triplet loss usually is evaluated using only a selected subset of the total triplets. A typical triplet selection methods used in [33], which is referred to as semi-hard triplet selection, can be described as follows: it uses all possible anchor-positive pairs, i.e., \(kc(c-1)\) pairs in total. For each anchor-positive pair \((x^a, x^p)\), a semi-hard negative \(x^n\) then is randomly selected from all negatives under the constraint

$$\begin{aligned} d_K^2(x^a, x^p) \le d_K^2(x^a, x^n) < d_K^2(x^a, x^p) + \alpha . \end{aligned}$$
(7)

That is, triplet loss is evaluated on \(D_{\mathcal {S}}\), i.e.,

$$\begin{aligned} \mathcal {L}_{triplet} = \sum \limits _{(x^a,x^p,x^n)\in D_\mathcal {S}} [d_K^2(x^a, x^p) - d_K^2(x^a, x^n) + \alpha ]_+, \end{aligned}$$
(8)

As a result, there will be always a distribution shift between \(D_{\mathcal {S}}\) and \(D_{\mathcal {T}}\). To measure the distribution shift between \(D_{\mathcal {S}}\) and \(D_{\mathcal {T}}\), we define the triplet-induced data \(\hat{D}_{\mathcal {S}}\) for \(D_{\mathcal {S}}\) as follows:

$$\begin{aligned} \hat{D}_{\mathcal {S}} = \{(x^a, y^a),(x^p, y^p),(x^n, y^n)|~\forall (x^a, x^p, x^n) \in D_{\mathcal {S}}\}. \end{aligned}$$
(9)

Similarly, we also define \(\hat{D}_{\mathcal {T}}\) as the data induced by \(D_{\mathcal {T}}\). If \(D_{\mathcal {S}}\) and \(D_{\mathcal {T}}\) shares the same distribution \(P(x^a, x^p, x^n)\), we then have, \(\forall x \in \hat{D}_{\mathcal {S}}\),

$$\begin{aligned} \begin{aligned} P^{\mathcal {S}}(x)&= \sum \limits _{i \in \{a,p,n\}} P(x^i)*\mathbf {1}\{x=x^i\} = P^{\mathcal {T}}(x). \end{aligned} \end{aligned}$$
(10)

where \(P^{\mathcal {S}}(x)\) and \(P^{\mathcal {T}}(x)\) are the probability density functions for \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_{\mathcal {T}}\) respectively. That is, there will be no distribution shift between the triplet-induced data \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_{\mathcal {T}}\). As a result, we use the difference between \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_{\mathcal {T}}\) as a measure the distribution shift in triplet loss.

Fig. 3.
figure 3

An example illustrating the conditional invariant representation. There is a distribution shift between the source domain and the target domain in the input space, i.e., \(P^{\mathcal {S}}(X|Y) \not = P^{\mathcal {T}}(X|Y)\). By learning a conditional invariant representation \(\varPhi (x)\), both source domain and target domain shares similar distribution in the embedding space, i.e., \(P^{\mathcal {S}}(\varPhi (X)|Y)=P^{\mathcal {T}}(\varPhi (X)|Y)\). That is, the embedding \(\varPhi (x)\) generalizes well on the target domain while it is learned on source domain. Intuitively, the source domain consists of the selected triplets \(D_{\mathcal {S}}\) while the target domain consists of all triplets \(D_{\mathcal {T}}\). That is, we learn a conditional invariant embedding using selected triplets and it will generalize well on all triplets.

3.3 Adapted Triplet Loss

To correct the triplet selection bias, we thus try to minimize the distribution shift between \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_{\mathcal {T}}\) during learning the representation. Let \(x \in X\) denote an input data and \(\varPhi (x)\) denote the representation learned by deep neural networks, i.e., a dimension fixed feature vector. Inspired by [10, 47], we learn a shared conditional invariant representation between \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_\mathcal {T}\), i.e.,

$$\begin{aligned} P^{\mathcal {S}}(\varPhi (X)|Y) = P^{\mathcal {T}}(\varPhi (X)|Y). \end{aligned}$$
(11)

See Fig. 3 for an example of the conditional invariant representation. Maximum Mean Discrepancy (MMD) has been widely used to estimate the difference between two distributions [15] and we thus use the conditional mean feature embedding to estimate the difference between \(P^{\mathcal {S}}(\varPhi (X)|Y)\) and \(P^{\mathcal {T}}(\varPhi (X)|Y)\). As a result, the distribution matching loss can be defined as follows:

$$\begin{aligned} \mathcal {L}_{match} = \sum \limits _{y} \Vert \varPhi _y^\mathcal {S} - \varPhi _y^\mathcal {T} \Vert ^2_2, \end{aligned}$$
(12)

where \(\varPhi _y^\mathcal {S}\) and \(\varPhi _y^\mathcal {T}\) are class-specific mean feature embeddings on \(\hat{D}_{\mathcal {S}}\) and \(\hat{D}_{\mathcal {T}}\) respectively, i.e.,

$$\begin{aligned} \varPhi _y^\mathcal {S} = \sum \limits _{(X,Y=y) \in \hat{D}_{\mathcal {S}}} P^{\mathcal {S}}(\varPhi (X)|Y) * \varPhi (X) \end{aligned}$$
(13)

and

$$\begin{aligned} \varPhi _y^\mathcal {T} = \sum \limits _{(X,Y=y) \in \hat{D}_{\mathcal {T}}} P^{\mathcal {T}}(\varPhi (X)|Y) * \varPhi (X). \end{aligned}$$
(14)

To correct the distribution shift in learning with triplet loss, we thus learn a discriminative and conditional invariant feature embedding by jointly minimizing the triplet loss as well as the distribution matching loss, i.e.,

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{triplet} + \lambda * \mathcal {L}_{match}, \end{aligned}$$
(15)

where \(\lambda \) is a trade-off parameter. Considering that this new variant of triplet loss adaptively corrects the triplet selection bias, we refer to it as adapted triplet loss.

3.4 Semi-supervised Adapted Triplet Loss

Unlabeled data are usually very helpful for domain adaptation. We believe that the unlabeled data will also be helpful for correcting the triplet selection bias. To scale the adapted triplet loss for exploiting large scale unlabeled data, we extend it for the semi-supervised setting.

Given a set of labeled data \(D_1\) and a set of unlabeled data \(D_2\). Let \(D_{\mathcal {T}_1}\) denote the all triplets constructed from \(D_1\) and \(D_\mathcal {S}\) denote the subset of selected triplets, i.e., \(D_{\mathcal {S}} \subseteq D_{\mathcal {T}_1}\). Let \(D_{\mathcal {T}_2}\) be the latent triplets constructed using the unlabeled data \(D_2\), which is actually unavailable since we do not know the latent labels of \(D_2\). Different from the supervised setting, in which we learn a conditional invariant representation among \(D_{\mathcal {S}}\) and \(D_{\mathcal {T}_1}\), we consider how to learn a conditional invariant representation between \(D_{\mathcal {S}}\), \(D_{\mathcal {T}_1}\), and \(D_{\mathcal {T}_2}\), i.e.,

$$\begin{aligned} P^{\mathcal {S}}(\varPhi (X)|Y) = P^{\mathcal {T}_1}(\varPhi (X)|Y) = P^{\mathcal {T}_2}(\varPhi (X)|Y). \end{aligned}$$
(16)

Given the target \(P^{\mathcal {S}}(\varPhi (X)|Y) = P^{\mathcal {T}_2}(\varPhi (X)|Y)\), we then have

$$\begin{aligned} {\begin{matrix} \sum _{y} P^{\mathcal {T}_2}(\varPhi (X)|Y) P^{\mathcal {T}_2}(Y) = \sum _{y} P^{\mathcal {S}}(\varPhi (X)|Y) P^{\mathcal {T}_2}(Y). \end{matrix}} \end{aligned}$$
(17)

That is, if we know the class ratio \(P^{\mathcal {T}_2}(Y)\) for triplet-induced data \(\hat{D}_{\mathcal {T}_2}\), we are able to estimate the difference between \(P^{\mathcal {S}}(\varPhi (X)|Y)\) and \(P^{\mathcal {T}_2}(\varPhi (X)|Y)\). Inspired by [17], we estimate the class ratio \(P^{\mathcal {T}_2}(Y)\) by converting it into an optimization problem, i.e.,

$$\begin{aligned} \theta ^{\mathcal {T}_2} = \underset{\theta }{\arg \min } \Vert \sum _y \theta _y^{\mathcal {T}_2}*\varPhi _y^\mathcal {S} - \varPhi ^{\mathcal {T}_2} \Vert _2^2,~s.t. \sum _y \theta _y = 1, \end{aligned}$$
(18)

where \(\varPhi ^{\mathcal {T}_2} = \mathbb {E}_{P_X^{\mathcal {T}_2}}\left[ \varPhi (X)\right] \) and \(\theta ^{\mathcal {T}_2}_y=P^{\mathcal {T}_2}(Y=y)\). This optimization problem can be solved as follows:

$$\begin{aligned} \theta ^{\mathcal {T}_2}_{1:|Y|-1} = (A^\top A)^{-1} A^\top B,~\text {and}~\theta ^{\mathcal {T}_2}_0 = 1 - \sum \limits _{y=1}^{|Y|-1} \theta ^{\mathcal {T}_2}_y, \end{aligned}$$
(19)

where \(A = [\varPhi ^\mathcal {S}_1 - \varPhi ^\mathcal {S}_0, \dots , \varPhi ^\mathcal {S}_{|Y|-1} - \varPhi ^\mathcal {S}_0]\) and \(B=\varPhi ^{\mathcal {T}_2}\). We can then define the distribution matching loss in a semi-supervised manner, i.e.,

$$\begin{aligned} \mathcal {L}_{semi\text {-}match} = \Vert \sum _y \theta _y^{\mathcal {T}_2}*\varPhi _y^\mathcal {S} - \varPhi ^{\mathcal {T}_2} \Vert _2^2. \end{aligned}$$
(20)

4 Experiment

In this section, we evaluate the proposed adapted triplet loss function on image classification and retrieval. For image classification, we use MNIST [22] and Fashion-MNIST [44] datasets. The MNIST dataset contains 60,000 training examples and 10,000 test examples, in which all examples are \(28\times 28\) grayscale images of handwritten digits. The Fashion-MNIST dataset contains a set of \(28\times 28\) grayscale article images and shares the same structure with the MNIST dataset, i.e., 60,000 training examples and 10,000 test examples. For image retrieval, we use three popular datasets, CARS196 [19], CUB200-2011 [37], and Stanford Online Products [28]. The CARS196 dataset contains 16,185 images of 196 different car models, the CUB200-2011 dataset contains 11,788 images of 200 different species of birds, and the Stanford Online Products contains 120,053 images of 22,634 different products.

4.1 Implementation Details

We implement the proposed method using Caffe [18]. Following [33], we always use a L2 normalization layer before the triplet loss layer. We use the margin \(\alpha =0.2\) in all experiments. We train our models using the stochastic gradient descent (SGD) algorithm with momentum 0.9 and weight decay 2e−5. For experiments on MNIST and Fashion-MNIST datasets, we learn 64-D feature embeddings using a modified LeNet [22]. More specifically, we use \(3\times 3\) filters in all convolutional layers and replace all activation layers with the PReLU [12] layer. The batch size is set to 256, which is large enough for both MNIST and Fashion-MNIST datasets, i.e., we are able to select enough triplets from each mini-batch. We use the learning rate 0.001 and the maximum iterations are set to 20k and 50k for MNIST and Fashion-MNIST, respectively.

For experiments on CARS196, CUB200-2011, and Stanford Online Products, we use GoogLeNet [36] as our base network and all layers except the last fully connected layer are initialized from the model trained on ImageNet [7]. The last fully connected layer is changed to learn 128-D feature embeddings and is initialized with random weights. All training images are resized to \(256\times 256\) and randomly cropped to \(224\times 224\). We use a learning rate 0.0005 with the batch size 120 and the maximum training iterations are set to 15k iterations on CARS196, 20k iterations on CUB200-2011, and 50k iterations on Stanford Online Products datasets. To ensure enough triplets in each mini-batch, we prepare the training data using a similar method with [33], i.e., each mini-batch is randomly sampled from 20 classes with 6 images per class.

4.2 Experiment on Image Classification

In this subsection, we describe the experimental results on MNIST and Fashion-MNIST datasets. To demonstrate the effectiveness of the proposed method, we compare the classification accuracy of models trained using the original triplet loss function (baseline) and the adapted triplet loss function. The evaluation metric can be described as follows: to learn a fixed dimensional feature embedding \(\varPhi (x)\), we train our models using the original triplet loss function and the adapted triplet loss function respectively.

Fig. 4.
figure 4

Results on MNIST dataset. In figure (a), we use \(\lambda =2.0\) for adapted triplet loss and compare its performance with the original triplet loss for every 100 iterations. In figure (b), we compare the test accuracy for using different \(\lambda \).

For testing, we first evaluate the conditional mean embedding \(\mathbb {E}\left[ \varPhi (x)|y\right] \), i.e., the mean point in embedding space, for each class y using the training data. For each input x in test set, we then assign it to a class \(\hat{y}\) according to the nearest mean point, i.e.,

$$\begin{aligned} \hat{y} = \underset{y}{\arg \min }~\Vert \varPhi (x) - \mathbb {E}[(\varPhi (x)|y)] \Vert _2^2 \end{aligned}$$
(21)

We demonstrate the results on MNIST dataset in Fig. 4. More specifically, we find that: in figure (a), the adapted triplet loss brings improvement after 5k iterations. Possible explanations for this improvement can be described as follows: for the original triplet loss, the gradient might be dominated by the noise triplets or the hard triplets from some specific classes while the distribution matching loss can adaptively corrects the triplet selection bias between selected triplets and all possible triplets. That is, the adapted triplet loss will generate more balanced gradients for each iteration. Another reason is that the distribution matching loss acts as a regularizer for the original triplet loss, which reduces the risk of overfitting. We evaluate the performance for the adapted triplet loss using different loss weight \(\lambda \), i.e., \(\lambda = 0,0.1,0.5,1.0,2.0,5.0\) in figure (b). Specifically, we use \(\lambda =0\) for the original triplet loss, which is a special case of the adapted triplet loss. We find that a trade-off on \(\lambda \) are required for using adapted triplet loss to learn a discriminative and conditional invariant embedding. Furthermore, we demonstrate similar results on Fashion-MNIST in Fig. 5.

Fig. 5.
figure 5

Results on Fashion-MNIST dataset. In figure (a), we use \(\lambda =2.0\). For the original triplet loss, the test accuracy is reduced after 40k iterations, while the adapted triplet loss does not suffer from the problem of overfitting.

To demonstrate the feature embeddings learned by both the original triplet loss and the adapted triplet loss, we use t-SNE [27], which has been widely used for the visualization of high dimensional data, to convert embeddings into 2D space. In Fig. 6, we show the embeddings learned by the adapted triplet loss. Comparing with the embeddings learned by the original triplet loss, we find that the embedding learned by the adapted triplet loss forms uniform margins between different classes, while the embedding learned by the original triplet loss fails to keep a clear margins between some classes.

Fig. 6.
figure 6

An example for the visualization of feature embeddings learned by the adapted triplet loss as well as the baseline model, i.e., the original triplet loss. We use the model trained on the training set of MNIST and show the learned embeddings on the test set. For the embedding learned by the original triplet loss, the margin between the two classes in the dash-line rectangle area is not large enough.

Table 1. Recall rate on CARS196, CUB200-2011, and Stanford Online Products datasets. For the adapted triplet loss, we train multiple models on all datasets using different \(\lambda \), i.e., we use \(\lambda =0.001,0.005,0.01,0.05,0.1\) on CARS196, \(\lambda =0.005,0.01,0.1,0.5\) on CUB200-2011, and \(\lambda =0.01,0.05,0.1\) on Stanford Online Products. For the original triplet loss, we use the adapted triplet loss with \(\lambda =0\).
Fig. 7.
figure 7

Retrieval results on CARS196 and CUB200-2011. The first column is the query image. For each query image, the first row contains 10 nearest neighbors for the original triplet loss; the second row contains 10 nearest neighbors for the adapted triplet loss. We highlight false positive examples with a white/black cross (best view in color).

4.3 Experiment on Image Retrieval

In this subsection, we evaluate the adapted triplet loss on image retrieval. For CARS196, CUB200-2011, and Stanford Online Products datasets, we use similar train/test splits with [28]. More specifically, for CARS196 dataset, we use all 8054 images from the first 98 classes as the training data and the rest as test data (8131 images); For CUB200-2011 dataset, we use the data from the first 100 classes as the training data (5864 images) and the rest 5924 images for test; For Stanford Online Products dataset, we use the standard train/test split provided in the dataset, i.e., 59,551 images of the first 11,318 classes for training and the rest 60,502 images of 11,316 classes for testing.

For all experiments on image retrieval, we use the standard Recall@K metric, i.e., the same protocol used in [28]. More specifically, the Recall@K metric can be described as follows: given a query image and its K nearest neighbors, if at least one example hit the query image, i.e., with the same label, the recall rate is equal to 1, otherwise the recall rate is 0. We then report the mean recall rate on all query images. For CARS196, CUB200-2011, and Stanford Online Products datasets, we train all models using only the training split and use all test images as the query images to evaluate the recall rate.

We demonstrate the recall rate on CARS196, CUB200-2011, and Stanford Online Products datasets in Table 1. We can see that the adapted triplet loss outperforms the baseline with all different K values. The maximum improvement usually appears at \(K=1\), which is the most valuable component for image retrieval. Another observation is that the adapted triplet loss usually recalls more positive neighbors. Furthermore, we demonstrate the retrieval results on CARS196 and CUB200-2011 datasets in Fig. 7. More specifically, we select four query images and 10 retrieval results for each query image using the adapted triplet loss and the original triplet loss respectively.

5 Conclusion

In this paper, we address the problem of triplet selection bias for triplet loss by using a domain adaption method. We propose an adapted triplet loss, which adaptively corrects the selection bias for the original triplet loss. Considering that the selection bias is common in deep metric learning, the proposed method can be extended to a variety of loss functions, e.g., pair-based [35], triplet-based [28], and quadruplet-based [4] loss functions, which will be the subject of future study.