Keywords

1 Introduction

Image similarity search in the era of big data has attracted wide attention in different applications such as information retrieval, data mining and pattern recognition. In order to efficiently store and real-time match millions of images, images discriminative representation in huge datasets has become an important research direction. Many existing methods represent images as binary hash codes by hashing function, so that the similarity search of high-dimensional images is replaced by calculating Hamming distance [18]. In addition, hashing functions are robust to various image transformations such as rotation, translation, scale and lightning, since they are carefully designed to extract distinctive patterns from images.

Many methods of learning to hash [8, 22] have been proposed to achieve efficient image retrieval. But the traditional learning-based approach can not effectively represent images feature [16, 17, 30, 32]. Recently, deep learning based hashing methods [13, 27, 34] have shown that deep neural networks can enable end-to-end representation learning and hash coding with nonlinear hash functions. And it achieve advanced performance. However, in the real world, the retrieval images is not necessarily the same quality as the image of the training model, and the images resolution often decreases due to some factors. Existing deep hashing approaches cannot effectively represent LR images with binary hash codes, which may result in poor hash retrieval. In order to solve the abovementioned problems, this paper presents an end-to-end deep hash model (DSRHN) to generate efficient binary hash codes directly from LR images. To the best of our knowledge, our approach is the first attempt to use an end-to-end multitask framework [19] for LR images hashing tasks. We not only learn the intensity similarity mapping between HR images and LR images, but also explore the mapping of images in Hamming space. DSRHN consists of two parts: a super-resolution network (SRNet) and a hash encoding network (HashNet). SRNet are trained to produce SR images from LR images. And the HashNet is trained to generate binary hash code from images, which is also used to constrain the SR images generated by SRNet and the corresponding HR images to be consistent on hash semantics. Because of this limitation, the retrieval result of LR images are close to that of HR images.

In short, we make contributions of our work. (1) We propose a novel end-to-end learning [28, 29, 38] framework for LR images retrieval. This method allows LR images to get more semantic information via high-efficiency repair ability of super-resolution network. Thus achieving efficient LR images retrieval. (2) We conduct extensive experiments on two benchmark datasets and achieve state-of-the-art performance. The rest of the paper is organized as follows: Sect. 2 describes the related work. Sections 3 and 4 introduce the proposed method and experimental results respectively. Section 5 concludes the paper.

2 Related Work

2.1 Image Hashing Method

In general, hashing methods can be divided into data-independent and data-dependent methods. Locality Sensitive Hash (LSH) [7] was one of the data-independent method of early research. LSH uses random linear projections to map the original data to a low-dimensional feature space, and then obtains binary hash codes. LSH and its several variants (e.g., kernel LSH [12] and p-norm LSH [4]) are widely used for large-scale image retrieval. However, data-independent methods are affected by some factors such as low efficiency and the need for longer hash codes, so it cannot be effectively used in practical applications. Due to the limitations of data-independent methods, the current research on hash function mainly uses various machine learning techniques for given datasets. Data-independent methods can be further divided into supervised, semi-supervised, and unsupervised methods. The unsupervised hash method directly learns the hash function from unlabeled data points and represents the data points as binary codes. Typical learning criteria methods include reconstruction error minimization [1, 33, 37] and graph structure learning [23, 26]. Iterative Quantization (ITQ) [8] is an unsupervised method of generating binary codes by achieving better quantization rather than random projection. The semi-supervised hashing method improves the quality of the hash code by leveraging the supervisory information into learning process. Semi-Supervised Hashing (SSH) [25] is one of the semi-supervised methods that uses pairwise information on labeled samples to preserve semantic similarity. Compared to unsupervised and semi-supervised methods, supervised methods use semantic labels to improve performance. One of the representatives is Kernel-based Supervised Hashing (KSH) [22] to generate high quality hash codes by using pairwise relationships between data points.

Recently, the deep learning hashing methods [23, 31, 37] have achieved breakthrough results in images retrieval datasets due to the powerful learning ability of deep networks. The first deep hash model (CNNH) [27] decomposes hash learning into approximate hash code learning and subsequent hash functions and image feature learning. DNNH [13] improved CNNH to learn feature representation [15] and hash code simultaneously via triple loss optimization model. DHN [34] preserving pairwise semantic similarity and controlling the quantization error by simultaneously optimizing the pairwise cross-entropy loss and quantization loss improved DNNH. DPSH [20] learns feature representation and hash codes simultaneously with pairwise data points via an end-to-end approach. Deep cauchy hashing (DCH) [2] utilize a pairwise cross-entropy loss based on Cauchy distribution to learn binary hash code. However, existing learning-based hashing methods are only for a given dataset of the same resolution. This is an ideal situation in the real world and many images are low-resolution. When reduced-resolution images is processed with a trained hash model, the search results will be poor. This paper proposes a high quality binary hash code generation method for LR images. We apply images super-resolution technology to images hashing, and use multi-task framework to deal with hashing problem.

2.2 Image Super Resolution Method

Image Super Resolution technology (ISR) is to estimate HR images from LR images. With the rapid development of deep learning, the ISR method based on Convolutional Neural Network (CNN) [6, 35, 36] shows excellent performance. Dong et al. [5] first proposed the image super-resolution framework SRCNN based on deep learning, and achieved excellent performance compared with the previous method. A faster network framework FSRCNN [6] was proposed to accelerate the training and testing of SRCNN. Ledig et al. [14] introduced the deep network ResNet [10] into the super-resolution task and used the perceptual loss and generative adversarial network (GAN) [9] for photo-realistic SR. Since the image super-resolution technology can make the LR images get more semantic information, we combine the images super-resolution task to deal with the hash retrieval problem of the LR images.

Fig. 1.
figure 1

The framework of the proposed deep super-resolution hashing network (DSRHN).

3 Our Method

3.1 Overview

In order to solve the LR images hashing problem in an end-to-end learning framework, this paper proposes an advanced end-to-end multi-task deep learning framework (DSRHN), as shown in Fig. 1. DSRHN mainly consists of two parts: a super-resolution network (SRNet) and a hash encoding network (HashNet). Our framework is an alternating learning process. First, we fix HashNet to train super-resolution networks via two loss functions: the last convolutional layer features of HashNet for \(x_{i}\) and \(x_{i}^{SR}\) produce perceptual loss, the output of SRNet and HR images produce pixel-wise mean square error (MSE) loss. Second, we fix SRNet to train hash network utilizes two losses: a hash semantic loss that preserves the hash semantic similarity and a discriminant loss that maintains the robustness of the HashNet. For test process, we directly input the LR images into SRNet, then input the SR images of SRNets’ output into HashNet, and finally represent the output of HashNet as a hash code through the sign function. The sign function is defined as:

$$\begin{aligned} sign(x)=\left\{ \begin{array}{ll} 1, &{} if\,x\,\ge \,0 \\ -1. &{} otherwise \end{array}\right. \end{aligned}$$
(1)

3.2 SRNet

The top half of Fig. 1 shows our super-resolution network SRNet. The main purpose of SRNet is to learn a mapping function of LR images to HR images. The input and output of SRNet are LR images \(\{x_{i}^{LR},x_{j}^{LR}\}\) and SR images \(\{x_{i}^{SR},x_{j}^{SR}\}\). This process can be defined as:

$$\begin{aligned} X^{SR}=F_{SR}(X^{LR}) \end{aligned}$$
(2)

where \(F_{SR}(\cdot )\) denotes the super-resolution function. The main part of SRNet is 16 residual blocks. For the residual blocks we use two convolutional layers with small \(3\times 3\) kernels and 64 feature maps combined batch-normalization layers and ParametricReLU activation function. In order to improve the resolution of the input images, we adopt two trained sub-pixel convolution layers proposed by Shi et al. [24]. In particular, unlike traditional image super-resolution methods, we super-resolution LR images on hash semantics. We make content loss for the features of SR and HR images in the final convolution layer of HashNet. This method can ensure that the SR images have the similar binary hash codes to the HR images.

Considering the feature preservation of SR and HR images, the following loss functions are used to learn HR images:

$$\begin{aligned} L_{mse}=\left\| X^{HR}-X^{SR} \right\| _{2}^{2} \end{aligned}$$
(3)

where the \(X_{LR}\) denotes the input LR images, and \(X_{SR}\) is ground truth HR images. \(\left\| \cdot \right\| _{2}\) denotes the L2 norm. Considering the SR images and corresponding HR images should keep similarity on hash semantics, we adopt perceptual loss based on the feature of the last convolutional layer of the HashNet:

$$\begin{aligned} \begin{array}{ll} L_{per}&{}=\left\| F_{cov}(X^{HR})-F_{cov}(X^{SR}) \right\| _{2}^{2}\\ &{}=\left\| F_{cov}(X^{HR})-F_{cov}(F_{SR}(X^{LR})) \right\| _{2}^{2} \end{array} \end{aligned}$$
(4)

where \(F_{cov}(\cdot )\) denotes the HashNet convolutional layer. Overall, combining Eqs. (3) and (4), the loss of the SRNet can be written as:

$$\begin{aligned} L_{SR}=L_{per}+\lambda L_{mse} \end{aligned}$$
(5)

where \(\lambda \) is the hyper-parameter.

3.3 HashNet

The bottom half of Fig. 1 is our hash encoding network denoted as HashNet. The purpose of HashNet is to learn nonlinear hash function \(f:x \mapsto h\in \{-1,1\}^K\) from an input space \(R^D\) to Hamming space \(\{1,1\}^K\) via deep neural networks. Alexnet [11] is adopted as the basic network of HashNet. HashNet consists of five convolution layers (c1-c5) and two fully connected layers (fc6,fc7), which are pre-trained on the ImageNet dataset. In order to obtain hash codes, we add a K-node hash layer called fch, each node of fch layer corresponds to a target hash code. Using fch layer, the previous representation is converted to k-dimensional representation. HashNet can encode each data point into a K-bit binary hash code with similar information S preserved. For HashNet, the input is SR images pairs, HR images pairs, and pairwise similarity relation \(\{x_{i},x_{i}^{SR},x_{j},x_{j}^{SR},S_{ij} \}\). \(S=\{ s_{ij}\}\) is defined as:

$$\begin{aligned} S_{ij}=\left\{ \begin{array}{ll} 0, &{} if\,images\,x_{i}\,and\,x_{j}\,share\,same\,class\,label \\ 1. &{} otherwise \end{array}\right. \end{aligned}$$
(6)

After the HashNet is trained, the binary hash code B is calculated through the trained hashing network:

$$\begin{aligned} b_{i}=sign(F_{hash}(x_{i}|\theta )) \end{aligned}$$
(7)

where \(b_{i}\) is hash code, \(F_{hash}(\cdot )\) denotes the hash function, \(\theta \) denotes the parameters of the HashNet, \(x_{i}\) represents the input images. In addition, the hash function we learned has the ability to distinguish SR from HR images. It is contrary to SRNet’s expectation that SR and HR images remain hash semantically identical. This learning method enables SRNet and HashNet to learn better.

Usually calculate the distance between binary hash codes using Hamming distance, it can be calculated as:

$$\begin{aligned} dist_{H}=\frac{1}{2}(K-\left\langle b_{i},b_{j} \right\rangle ) \end{aligned}$$
(8)

where K is the length of the hash code, \(\left\langle \cdot \right\rangle \) denotes the inner product between hash codes. From Eq. (8) we know that Hamming distance is related to inner product. The larger the inner product of the two hash codes, the smaller the Hamming distance and vice versa. Therefore, we can learn the discriminative hash code by replacing the similarity of semantic labels with the inner product between hash codes. Based on this fact, we use the following loss function as DPSH [20] does. Firstly, it is hoped that pairwise label can be fitted by similarity (inner product) between sample binary codes.

$$\begin{aligned} p(s_{ij}|b_{i},b_{j})=\left\{ \begin{array}{ll} \sigma (\left\langle b_{i},b_{j} \right\rangle ), &{} s_{ij}=1 \\ 1-\sigma (\left\langle b_{i},b_{j} \right\rangle ). &{} s_{ij}=0 \end{array}\right. \end{aligned}$$
(9)

where \(\sigma (x) = 1/(1 + e^{-x})\) is the sigmoid activation function, \(\left\langle b_{i},b_{j} \right\rangle =\frac{1}{2}b_{i}^Tb_{j}\). \(p(\cdot )\) is the conditional probability of \(s_{ij}\) given the pair of corresponding hash codes \([b_{i},b_{j}]\). Based on the above, the following loss function is used to preserves the similarity of hash semantics:

$$\begin{aligned} \begin{array}{llll} L_{hs} &{}=-\log p(S|B)=-_{s_{ij}\in S}^{\,\,\,\,\sum }\log p(s_{ij}|\left\langle b_{i},b_{j} \right\rangle )\\ &{}=-_{s_{ij}\in S}^{\,\,\,\,\sum }(s_{ij}\left\langle b_{i},b_{j} \right\rangle - \log (1+e^{\left\langle b_{i},b_{j} \right\rangle })) \end{array} \end{aligned}$$
(10)

Equation (10) is negative log likelihood loss function, which represents the Hamming distance of two similar images as small as possible and the Hamming distance of two different images as large as possible. However, the hash code \(b_{i}\in \{-1,+1\}\) is discrete. This is hard to optimize. To solve this problem, equality constraints can be optimized by moving them to regularization terms. So use the following loss function to optimize:

$$\begin{aligned} \begin{array}{ll} L_{hs} =-_{s_{ij}\in S}^{\,\,\,\,\sum }(s_{ij}\left\langle v_{i},v_{j} \right\rangle - \log (1+e^{\left\langle v_{i},v_{j} \right\rangle }))+\gamma \sum \nolimits _{1}^{n}\left\| b_{i}-v_{i} \right\| \end{array} \end{aligned}$$
(11)

where v is the binary code after relaxation. The last part (\(\gamma \sum _{1}^{n}\left\| b_{i}-v_{i} \right\| \)) of the equation is the regularization term, \(\gamma \) is the hyper-parameter.

In order to ensure that the hash codes learned by the hash function can distinguish SR and HR images and make the HashNet robust, we design a discriminative loss for optimizing the hashing network:

$$\begin{aligned} L_{dis}=\max (m-\left\| F_{hash}(x_{i}^{HR})- F_{hash}(x_{i}^{SR}) \right\| _{2}^{2},0) \end{aligned}$$
(12)

where \(F_{hash}(\cdot )\) denotes hash network (HashNet), and \(m>0\) is a margin threshold parameter. Overall, combining Eqs. (11) and (12), the loss of the hash network can be written as:

$$\begin{aligned} L_{hash}=L_{hs}+\alpha L_{dis} \end{aligned}$$
(13)

where \(\alpha \) is the hyper-parameter.

4 Experiments

To validate the performance of our proposed method, we conducted extensive experiments on two widely used benchmark datasets (i.e. NUS-WIDE and MS-COCO) to verify the effectiveness of our approach. We start with introducing the datasets and then present our experimental results.

4.1 Datasets and Evaluation Metrics

In the experiments, we conduct hashing on two experiments widely-used datasets: NUS-WIDE [3] and MS-COCO [21].

  • NUS-WIDE dataset consists of 269,648 web images associated with the tag. It is a multi-label dataset where each image can be annotated with multiple tags. Following DPSH [20], we selected only 195,834 images belonging to the 21 most common concepts. For NUS-WIDE datasets, if two images share at least one public tag, they will be defined as similar. We randomly sampled 100 images per class (i.e., a total of 2,100 images) as a test set, with 500 images per class (i.e., a total of 10,500 images) as a training set. The rest of the image is considered a gallery during the testing phase.

  • MS-COCO datasets we used consists 82,783 training images and 40,504 validation images, each of which is labeled by some of the 80 semantic concepts. Following DCH [2], we randomly extracted 5,000 images as query points, the rest as databases, and 10,000 images are randomly sampled from the database for training. For MS-COCO dataset, two images will be defined as a ground-truth neighbor (similar pair) if they share at least one common label.

Evaluation Metrics: We use mean average precision (mAP), precision, and recall to evaluate the performance of DSRHN compared to the aforementioned unsupervised hashing functions for low-resolution image retrieval. In particular, our low-resolution images are obtained from experimental datasets by interpolation, that is, the original dataset images are high-quality images and interpolated to obtain low-resolution images. We report the results of image retrieval with map@5000, mAP@5000 is mAP calculated over the top 5000 ranked images from the gallery set.

4.2 Ablation Study

We performe an ablation experiment to examine the true validity of our framework for low-resolution images hashing. We evaluated this experiment with DSRHN, HashNet-LR, HashNet-HR, and HashNet-SRGAN. The HashNet-LR approach is to directly use the low resolution images as input to our hash network without using a super resolution network; The HashNet-HR method inputs high resolution images into our hash network; The HashNet-SRGAN approach uses SRGAN work instead of our super-resolution network for a non-end-to-end LR images hashing.

As shown in Table 2. First of all, we can see from HashNet-SRGAN and HashNet-LR that super-resolution network does have effect on LR images hash retrieval. In particular, our DSRHN works best. Secondly, we can see that the result of HashNet-HR directly retrieving HR images and DSRHN retrieving LR images is very close, which shows that our framework is very good at learning the mapping of LR images to HR images on hash semantics. A comparison of DSRHN and HashNet-SRGAN shows that our end-to-end framework can better enhance the semantics of LR images on hash retrieval tasks.

Table 1. LR images retrieval results (mAP@5000) of DSRHN and Its Variants, HashNet-LR, HashNet-HR, HashNet-SRGAN on two Datasets.
Table 2. LR images retrieval results (mAP@5000) of DSRHN on NUS-WIDE and MS-COCO datasets, when the number of hash bits are 12, 24, 32 and 48.
Fig. 2.
figure 2

The results of DSRHN and comparison methods on the NUS-WIDE dataset.

4.3 Comparisons

For image retrieval, we compare our method with the previous deep supervised hash functions including deep pairwise-supervised hashing (DPSH) [20], deep hashing network (DHN) [34], and deep cauchy hashing (DCH) [2].

Table 1 shows the DSRHN and other optional models’ mAP@5000 results on different hash bits. The results show that our model is always superior to other models with significant effects in terms of different bit numbers, datasets and metrics. The main reason is our framework directly enhances the hash semantics of low resolution images. Figure 2 show the results of precision-recall curves with 12 bits, and mAP@5000 with 48 bits w.r.t. different numbers of top returned images. It can be seen that our method not only works best, but also has excellent retrieval stability.

4.4 Implementation Details

The proposed network consists of two parts: SRNet and HashNet. The SRNet is trained specially for 4\(\times \) scale factor super-resolution. We randomly crop the \(224\times 224\) patch in each images as the ground truth, and downsample it to \(56\times 56\) as the input LR patch for training. We use 16 residual blocks with two convolutional layers with small \(3\times 3\) kernels and 64 feature maps combined batch-normalization layers and ParametricReLU activation function. SRNet is optimized by Adam with the learning rate 0.001. For the HashNet, we use AlexNet network directly. And it is optimized by SGD with the learning rate 0.01. The \(\lambda \) of \(L_{SR}\) is set as 0.1, and the \(\alpha \) of \(L_{hash}\) is set as 0.01. The training process is stopped when the training reaches 150 epochs and we select the best model for comparison. Experiments are performed on a NVIDIA TitanXp GPU.

5 Conclusion

This paper proposes a novel images hash retrieval framework (DSRHN) based on deep super-resolution. The framework consists of two main components: a super-resolution network and a hash encoding network. Super-resolution networks trained on large-scale image retrieval datasets can recover low-resolution images to high-resolution images while providing more abundant semantic information. The hash encoding network can represent the recovered images as a binary code. We use binary code for image retrieval experiments. Experimental results show that the proposed DSRHN algorithm has achieved the state-of-the-art performance.