Keywords

1 Introduction

Image retrieval is one of the greatly worthwhile computer vision tasks. It pays more attention to the image similarity, and is suitable for retrieving images from a massive image database, Content based Image Retrieval (CBIR) [5,6,7] is the mainstream method for image retrieval at present. The idea of CBIR is as follows. First, extract features from the query image. After that, use features to calculate the similarity between the query images and the images in database. And then sort the images in descending order by similarity. Finally, regard the result as the feedback to further improve the performance of feature extraction. Particularly, feature extraction and similarity measurement are two important processes in CBIR, which determine the accuracy and efficiency of image retrieval methods.

CBIR approaches are often trained through a reduction that converts image retrieval into an image classification problem, and then take an intermediate bottleneck layer as extracted feature representation used to retrieving images. Although classification loss could increase the difference between classes, it cannot narrow the inner-class distance effectively. To solve this dilemma, FaceNet [1] proposes a novel triplet loss which calculates loss by learning a Euclidean embedding, it makes a great progress especially in face recognition. This method uses the squared L2 distance according to the image similarity: the larger the distance between two images, the less the similarity. The key of the method is to enlarge the distance between the images from different categories and narrow the distance between the images from the same category. However, new problem has arisen: the training set is distinguished by a large imbalance between easy examples and hard examples, where easy examples are the images with distinguishable deep features from different categories, while hard examples are the images with similar deep features from confusing categories.

Many studies [2, 3] have shown that hard examples (samples which are difficult to be distinguished by models) are beneficial to network convergence, since network frequent misclassify hard examples, which propagates back more loss. Thus, mining hard examples play an important role in model training. However, [1] selects training dataset in a random way, which might be inefficient to get hard examples, because the proportion of hard examples is small. The straightforward way to choose hard examples is to traverse the entire dataset, but the complexity of this method is too high to be directly applied. To tackle this issue, Hermans et al. [4] propose a variant triplet loss that provides a new hard examples selection method, which select hard examples by traversing a batch of data. Although it reduces the randomness of sample selection, it only considers the images in each batch and the complexity of this method is still too high. Wherefore mining hard examples is still a challenge for network training in triplet loss. In addition, triplet loss only utilizes the associated information between images, which does not make full use of the classification information.

To circumvent the limitations embedded in the existing triplet loss networks, we propose a novel network for image retrieval called mixed triplet loss with hard examples feedback network (MHEF-TripNet). Different from triplet networks, our method introduces a sample selection probability matrix to select hard examples. The matrix is used to select a different category that has the maximum similarity of the known category. After each iteration, we adjust the sample selection probability matrix according to the feedback of test results, and then the matrix can select hard examples more accurately. Also, we propose a mixed loss function [17], which combines triplet loss with category loss to extract discriminative features. The two main contributions of proposed method can be summarized as follows.

  • A probability matrix of sample selection is introduced to choose hard example pairs. Additionally, the probability matrix will be updated according to the test results of the model after each iteration.

  • The proposed mixed triple loss takes advantage of association information between images and category information simultaneously, so that more distinctive features can be learned.

The remainder of this paper is organized as follows: we summarize the related work of image retrieval in Sect. 2. The formulation of proposed MHEF-TripNet is described in Sect. 3. Section 4 shows our experimental results and their corresponding analysis. At last, we give the conclusion of this work in Sect. 5.

2 Related Work

Image feature extraction is very important for image retrieval. The ability of extracting features has made significant advances riding on the wave of convolutional neural network (CNN), which has achieved great success in target recognition [8], target detection, image segmentation [9], natural language understanding [10] and other fields. Previous image retrieval approaches based on deep networks use a classification layer [11,12,13] trained over a set of known categories and then take an intermediate bottleneck layer as a representation used to retrieving images. The downsides of this approach are its indirectness and its inefficiency: the bottleneck representation cannot generalize well to new categories. Triplet loss is proposed to solve this dilemma.

Triplet loss [1] is a metric learning method which is first introduced by Google along with FaceNet. The sketch of triple loss learning target is illustrated in Fig. 1. Triplet consists of the following components: a sample which is randomly selected from the training set is regarded as an anchor, other two samples which have the same class as the anchor denotes a positive and with different class represents a negative. Before training, the relationship between the three may be similar to the left part of the figure, where the negative is closer to the anchor than the positive. After training, the positive becomes much closer to the anchor, like the right part of the figure. In a word, the triplet loss aims to narrow the distance between an anchor and a positive and enlarge the distance between the anchor and a negative. The traditional triplet model randomly selects three samples from the training set, which is simple but too random. The key point in triple training is to find out the hard triplets, that is, an anchor together with a remote positive (hard positive) and a close negative (hard negative). Since the proportion of hard triplets is low, randomly selection might not effectively get the hard triplet, causing poor performance.

Fig. 1.
figure 1

Sketch of triple loss learning target

Hermans et al. [4] propose a variant of triplet loss that provides a new hard examples selection based on batch training. The main idea of this method is to select hard triplet from batches. First, randomly select \( P \) classes, and then randomly select \( K \) images from each class, thus forming a batch of \( P*K \) images. As for each sample in the batch, a selected triplet can be consisted of the sample and its hardest negative and hardest positive within the batch. In this way, the selection randomness of the traditional method can be reduced to a certain extent, making it more conducive to model training. Although this approach avoids the randomness of triple sampling to a certain extent, it only enlarges the sampling range locally, and cannot guarantee that the difficult sample pair is optimal.

3 MHEF-TripNet

We propose MHEF-TripNet for effective image retrieval. We argue that single triplet loss is inefficient and the current way of triple sampling is suboptimal. Our method introduces a sample selection probability matrix to select hard examples and a mixed loss function combines triplet loss with category loss to extract discriminative features.

The framework of the proposed method is shown in Fig. 2, and the network is called MHEF-TripNet. In the training stage, similar to the traditional triple training, three images are transmitted at the same time, but the selection of the three images is different. Specifically, an image is randomly selected from the training data as an anchor. The category of the positive is consistent with the anchor, while the category of the negative is selected according to the probability matrix. The probability matrix is an \( N \times N \) matrix, where \( N \) denotes the number of categories, the element \( V_{ij} \) denotes the probability of choosing \( j \) as a category of a negative when the anchor category is \( i \). And each row adds up to 1. The selected three images are simultaneously fed into the same convolution neural network for feature extraction. After that, the network parameters are optimized by the mixed loss which consists of triplet loss and classification loss.

Fig. 2.
figure 2

The framework of MHEF-TripNet

In order to better demonstrate the experimental results of our method, Table 1 presents details of a simple network architecture which is the backbone of MHEF-TripNet.

Table 1. Network architecture: the backbone of MHEF-TripNet.

3.1 Sample Selection Probability Matrix

Although there exist some methods for hard example sampling in recent years, most of these methods focus on expanding the sampling range locally which only partly improve the sampling performance. Instead, in this paper, the selection is guided by f sample selection probability matrix globally to get hard examples with high generalization and pertinent. The training set and validation set are regarded as the input, while the network parameters for feature extraction are considered as the output.

Firstly, the images are preprocessed, including size clipping and normalization. In addition, the sample selection probability matrix is initialized as follows:

$$ \left[ {\begin{array}{*{20}l} 0 \hfill & {\frac{1}{N - 1}} \hfill & \ldots \hfill & {\frac{1}{N - 1}} \hfill \\ {\frac{1}{N - 1}} \hfill & 0 \hfill & \ldots \hfill & {\frac{1}{N - 1}} \hfill \\ \ldots \hfill & \ldots \hfill & 0 \hfill & \ldots \hfill \\ {\frac{1}{N - 1}} \hfill & \ldots \hfill & \ldots \hfill & 0 \hfill \\ \end{array} } \right] $$
(1)

Where \( N \) denotes the number of categories of training images,\( V_{ij} \) denotes the probability of choosing \( j \) as a category of a negative when the anchor category is \( i \). \( V_{ii} = 0 \), \( i \in \left( {1,N} \right) \), \( V_{ij} = \frac{1}{N - 1} \), \( i \ne j \), \( i,j \in \left( {1,N} \right) \), since the probability is uniformly distributed.

After that is the iterative process of training. In each iteration, the triples are sampled by the current probability matrix. More specifically, a sample is randomly selected from the training data as an anchor. Suppose the category of the anchor is \( i \), the probability of sampling a negative with \( j \) category is \( V_{ij} \). A positive is selected from images with \( i \) category. This three images form a triple. The number of selected triples in each iteration is consistent to the size of training batch.

The triples are fed into the feature extraction network to optimize the parameters, and the feature extraction network parameters of the current iteration times are obtained. The training set image features are extracted from the current feature extraction network as a temporary feature database. For the verification set, the same feature is extracted. The image retrieval tests are carried out on the feature database one by one, and the relevant feedback data are counted. Feedback data refers to the statistical results of the misclassification of each image in this round of image retrieval test. The result of misclassification is analyzed and the probability matrix is updated. The updated formula is as follows:

$$ \begin{array}{*{20}c} {V_{ij} = P\left( W \right) \times \frac{{Num_{j} }}{M}, i \ne j } \\ {V_{io} = \frac{P\left( R \right)}{N - 1 - M}, i \ne o, j \ne o} \\ \end{array} $$
(2)

Where \( N \) denotes the number of categories of training images, \( i \) denotes the category of the current test sample, \( M \) denotes the number of categories which is result sequence retrieval, \( P\left( R \right) \) is the accuracy, \( P\left( W \right) \) is the error, \( Num_{j} \) represents the number of images which category is \( j \).

Take a test sample belonging to \( i \) category as example, suppose there are \( W \) images being misclassified among \( K \) images from two categories \( \left( {M = 2} \right) \), and \( W_{1} \) images belong to the \( p \) category, \( W_{2} \) images belong to the \( q \) category \( \left( {W = W_{1} + W_{2} } \right) \). The probability of correct classification is \( P\left( R \right) = \frac{K - W}{K} \), and the probability of misclassification is \( P\left( W \right) = \frac{W}{K} \), where \( p \) and \( q \) account for \( \frac{{W_{1} }}{W} \) and \( \frac{{W_{2} }}{W} \), respectively. Consequently, the probability matrix is updated as follows:

$$ \begin{array}{*{20}c} {V_{ip} = P\left( W \right) \times \frac{{W_{1} }}{W}} \\ {V_{iq} = P\left( W \right) \times \frac{{W_{2} }}{W}} \\ {V_{io} = \frac{P\left( R \right)}{N - 1 - 2}} \\ \end{array} $$
(3)

The total probability is:

$$ \begin{aligned} Total & = V_{ip} + V_{iq} + V_{io} \times \left( {N - 1 - 2} \right) \\ & = P\left( W \right) + P\left( R \right) \\ & = 1 \\ \end{aligned} $$
(4)

Each modification of the probability matrix is a positive feedback of the test results. In this way, MHEF-TripNet focuses more on enlarge the gap between the confusing categories, thus improve the distinction of features, and finally improves the accuracy of image retrieval.

3.2 Mixed Triplet Loss

After extracting image features, the traditional triple method iteratively updates the network parameters by using the loss of distance comparison between feature vectors. This method is suitable for the situation that the number of image categories is constantly changing, or the training data does not contain the category label information, only whether the two images belong to the same category of comparative information. For most of image datasets, the number of image categories is fixed, and all of them have image label information. Therefore, MHEF-TripNet considers to integrate the category loss of images into the training process of the network, and combines the comparative loss to form a hybrid loss training network.

The classification loss is defined as follows:

$$ C_{loss} = J\left( p \right) + J\left( n \right) $$
(5)

Where \( J \) denotes softmax loss, \( p \) and \( n \) represent the features of the positive and the negative, respectively. And the mixed loss is defined as follows:

$$ M_{loss} = \alpha T_{loss} + \rho C_{loss} $$
(6)

Where \( T_{loss} \) and \( C_{loss} \) denote triplet loss and classification loss, respectively. \( \alpha \), \( \rho \) are the corresponding weights. Particularly, in order to balance this two losses, we set α = 2.0, β = 1.0, since the classification loss consists of two parts of softmax loss.

4 Experiments and Analysis

To evaluate the performance of MHEF-TripNet, we design a set of image retrieval experiments on two datasets: UC Merced Land Use and Kdelab Airplane. UC Merced Land Use dataset is land use image dataset meant for research purposes. It is a remote sensing dataset provide by [14] and include 21 classes. Kdelab Airplane dataset is created by our laboratory focus on retrieval airplane, which contains 11 different airplane types. We choose mean average precision (mAP), top 5 precision (P@5), top 10 precision (P@10), top 50 precision (P@50) and 100 precision (P@100) as evaluation criteria. In our retrieval results tables, TripNet denotes the image retrieval network based on triplet loss, M-TripNet denotes the TripNet with mixed triplet loss, HEF-TripNet denotes the TripNet with Hard Example Feedback and MHEF-TripNet represents the combination of the last two. The parameters of the training stage are set as follows: batch size is 64, iteration number is 100, learning rate is \( 10^{ - 4} \).

4.1 Retrieval Results on UC Merced Land Use

The UC Merced Land Use is one of the most widely used remote sensing image datasets in the field of remote sensing. The dataset contains 2100 images, covering 21 different remote sensing scene categories, each with 100 images. The size of each image is 256 * 256. Figure 3 shows sample images of the dataset. The experiment is divided into two parts. The first part is the comparison with methods based on deep networks use a classification layer. The second part is ablation experiments based on the method proposed in this paper.

Fig. 3.
figure 3

Sample images of UC Merced Land Use dataset

To further evaluate the power of these methods, we have fine-tuned the networks to the remote sensing domain by using 80% of AID [16] dataset as training set. AID is a remote sensing dataset which is made of aerial image dataset collected from Google Earth imagery. It has a number of 10000 images within 30 classes and about 200 to 400 samples of size 600 * 600 in each class. Afterwards we use 100% of the UC Merced Land Use dataset as the test set. The experimental results are shown in Table 2. VGG16, VGG19, GoogleNet, ResNet-50, ResNet-101, and ResNet-152 are all experimental data from the [15] which is a summary of CBIR in remote sensing images. After comparative analysis, MHEF-TripNet proposed in this paper is the best in the most evaluation criteria, and the mAP is about 1.0% higher than other method. This indicates that MHEF-TripNet extracts discriminative features effectively on the UC Merched Land Use dataset.

Table 2. UC Merced Land Use dataset retrieval results of different approaches

The second part of the experiment used 80% of the UC Merced Land Use data as the training set and 20% of the UC Merced Land Use data as the test set. The experimental results are shown in Table 3.

Table 3. UC Merced Land Use test retrieval results of ablation experiments

The results of three modified algorithms are analyzed as follows:

  1. 1.

    M-TripNet: compared with TripNet, the mAP of M-TripNet is about 3% higher, and the accuracy of the first 5 and 10 are about 5% higher. The result indicates that M-TripNet performs better while training, and the additional label information in M-Triplet help to extract more robust features from remote sensing images.

  2. 2.

    HEF-TripNet: the mAP of HEF-TripNet is about 2% higher than that of TripNet. The P@5 and P@10 are about 2% and 1% higher than TripNet, respectively. It shows that the feedback plays a role in guiding feature extraction network to train hard examples by enlarge the distance of negative samples which are difficult to distinguish, hence improving the distinction of features.

  3. 3.

    MHEF-TripNet: compared to TripNet, the mAP of MHEF-TripNet is about 4.6% higher. P@5 and P@10 are about 6% and 5% higher, respectively. This means that MHEF-TripNet effectively integrates the above two modifications.

4.2 Retrieval Results on UC Merced Land Use

Due to the small number of images in UC Merced Land Use, where only 100 images of each class are available, we cannot measure P@100 in ablation experiments on UC Merced Land Use dataset. Therefore, ablation experiments on Kdelab Airplane dataset are added. The Kdelab Airplane dataset is created by the Kdelab Laboratory of the University of Science and Technology of China. The dataset contains 2,200 images covering 11 different aircraft types, and 200 images of size 128 * 128 in each class. Figure 4 shows sample images of the dataset. Kdelab Airplane dataset is used to ablation experiments.

Fig. 4.
figure 4

Sample images of Kdelab Airplane dataset

In this experiment, 80% of the Kdelab Airplane dataset are used as training set and 20% as test set. Since there are 160 training images in each class of the dataset, the overall retrieval performs better than the UC Merced Land Use dataset. The result of retrieval is shown in Table 3 in detail. Comparing the mentioned three algorithms with TripNet, M-TripNet gets 3% higher in mAP, and about 1.5% higher in both P@5 and P@10. As for HEF-TripNet, the mAP is about 1% higher than that of TripNet, and the P@5 and P@10 are about 1.2% and 0.7% higher than TripNet, respectively. Regarding THEF-TripNet, it gets 4.4% higher in mAP, and about 2.2% higher in P@5 as well as in P@10. As can be seen, THEF-TripNet outperforms among other algorithms in all evaluation indicators, which suggests its effectiveness for feature extraction on Kdelab Airplane dataset (Table 4).

Table 4. Kdelab Airplane test retrieval results of ablation experiments

As a result, THEF-TripNet can effectively improve the performance of TripNet in image retrieval. For one thing, the mixed loss not only considers the triplet loss, but also takes the label information of the image into account, which elaborately introduces the label information into the training of the triple. In fact, according to the label information, the cluster center is utilized to make training easier. For another, the feedback-based triple is a process of continuous iteration and adjustment. After each round of training, there is an image retrieval validation. THEF-TripNet evaluates the generalization of the current model, and finds out the hard examples at present, which will be improved pertinently in the next round. Therefore, the model can learn the most discriminative information from the triple, so as to improve the feature availability. Differentiation improves the performance of target retrieval.

5 Conclusion

In this paper, we propose mixed triplet loss with hard example feedback network. The method extracts more discriminative features based on mixed triple loss, focus on the correlation information and category information of images. At the same time, it introduces sample selection probability matrix to select hard triplets according to probability matrix. After each iteration, it adjusts the probability matrix according to the test results of the model, and then improves the effect of difficult sample selection from a global perspective. The experimental results show that this method is superior to the traditional triple method and can effectively improve the accuracy of remote sensing image retrieval.