1 Introduction

With the huge surge in multimedia content over the Internet, as well as high speed Internet connectivity, most of the websites now contain large volume of images [33, 34, 46]. This triggers the development of hashing systems, where the goal is to build a hashing system, which enable the accuracy and efficiency of object recognition [9, 38, 39], image retrieval [6, 21, 47], image matching [7, 37], image tagging [48] and so on. Recently, research [3, 8, 10, 15, 19, 27, 32, 35, 41, 43, 49] has been done in computer vision community aiming to encode high-dimension image features into low-dimension short codes. These codes obtain efficiency in both searching and storage [31]. In fact, the basic idea of hashing is to establish hash functions to project high-dimensional feature vectors into a set of compact binary codes, thus similar data samples can be mapped into similar hash codes.

In general, existing hashing-based methods can be generally divided into two categories: data-independent and data-dependent [42]. For the first category, applying stochastic projections to establish randomized hash functions is mainly employed in the early exploration of hashing. One of the most popular method is Locality Sensitive Hashing (LSH) [5], and its derivations [20] is proposed by integrating with kernels or additional constraint. Theoretically, LSH-based methods are focusing on confirming long codes asymptotically preserving original metrics in Hamming spaces, thus to achieve the desired performance, longer codes are usually required. However, long code is not only able to reach low recall, but improves the Hamming codes storage as well. For the data-dependent approaches, they learn hash functions, which are able to generate short hash codes to conduct efficient large-scale data search. With short hash codes, fast approximate nearest neighbor (ANN) search can be achieved in a constant time complexity as long as neighbors of a data sample are well-preserved in the Hamming space. Moreover, in a huge database, short hash codes are extremely effective in saving storage. Many classic hash functions including PCA iterative quantization (PCA-ITQ)) [10], K-means hashing (KMH) [12], spectral hashing(SH) [43], binary reconstructive embedding (BRE) [19], minimal loss hashing (MLH) [29], and sequential projection learning hashing (SPLH) [41] etc., belong to this category. Nevertheless, nonlinear relationship among data samples cannot be well addressed in most of these hashing methods, thus [3, 11] propose approaches to integrate kernels to deal this issue. However, the scalability problem is still inevitable.

In this work, to take the full advantage of data relationships to learn effective short hash codes, we develop a novel deep learning framework: Multiple Hierarchical Deep Hashing (MHDH). The fundamental idea is shown in Fig. 1. Unlike most existing hashing methods, which aim to look for a simple linear projection and then project each data point into a binary vector, we present a novel deep neural network to explore multiple hierarchical nonlinear projections to make a full use of data relationships. This neural network is integrated with quantization layers to obtain effective binary codes to enhance efficient large-scale image search. Given labels of data samples, hash codes can be obtained by encoding the output of the last latent layer of the neural network. The output represents potential concepts and those concepts are connected to class labels. Comparing with conventional hash approaches, our method learns binary hash codes in a manner of point-wised way and is scalable for large-scale datasets. Abundant experimental results conducted on two popular datasets demonstrate the superiority and extensibility of our MHDH over both supervised and unsupervised hashing methods.

Fig. 1
figure 1

The overview of MHDH, which aims to learn efficient short hash code to support large-scale image retrieval. This network consists of three types of layers: 1) conv layers (marked with yellow) that seek to project image to a rich description with multiple hierarchical nonlinear projections; 2) A latent layer or hash layer (blue) is used to further encode the previous generated description, and 3) a softmax layer (the final red layer)

2 Related work

2.1 Deep learning

There is an increasing interest in deep learning for solving large-scale vision problems. Deep learning derived from machine learning, including a set of algorithms that model high-level abstractions from data by using multiple layers with complex structures or multiple non-linear transformations [1, 16, 22, 24]. In recent years, various deep learning architectures such as deep neural networks [14], convolutional deep neural networks [16, 28] and recurrent neural networks [44] have been widely applied to computer vision fields [2] and natural language processing [4], where they have been fully demonstrated the superiority on various visual application.

2.2 Learning-based hashing

There are numerous learning-based hashing methods being proposed, and they can be divided into three categories: unsupervised, semi-supervised and supervised. For unsupervised approach, they do not need any label or class information about the training data set. To date, projection and quantization are widely used for generate hash function. The classic unsupervised hash method PCA Hashing (PCAH) [40] and its variants use quantization and prove that quantization performs better than random projection. Later researchers find out that orthogonal projections may benefit to quantify. This is because data variance decreases rapidly. Therefore, Weiss et al. [43] propose a spectral hashing (SH), which learns short binary codes through quantization with nonlinear functions along the data PCA direction. In addition, ITQ [10] is proposed by simultaneously maximizing the variance of each hash code and minimizing the quantization loss to approximate the Euclidean distance in the Hamming distance. KMH [12] is focusing on minimizing the hamming distance loss between the quantized centers and the cluster centers. Compared with unsupervised hashing method, supervised and semi-supervised methods make use of full or partial label information to facilitate the generation of better hash codes. For example, Semi-Supervised Hashing (SSH) [41] is proposed by maximally reducing the empirical loss on the labeled training data and maximizing variances of the labeled and unlabeled training sets. Kulis and Darrell [19] introduces a binary re-constructive embedding (BRE) method, which tries to minimize the reconstruction error between the original Euclidean distance and the trained hamming distance. Norouzi and Fleet [29] illustrates a minimal loss hashing (MLH) method by minimizing the loss between the trained Hamming distance and quantization error among similar data samples. Strecha et al. [37] develops a LDA hashing through maximizing the within-class variations and minimizing the between-class variations of binary codes.

2.3 Deep hashing

Some deep hashing methods [25, 26, 36, 45, 50] have already been proposed, and these deep hashing methods have achieved a great improvement in similar data retrieval. The basic idea of conventional deep hashing [45] is to exploit CNN networks to extract feature vectors, which contains more representative information than traditional hand-crafted visual descriptors (e.g., GIST and HOG). On the basis of existing convolutional deep neural networks, most of deep hashing methods transform CNN to their own neural networks. For example, Lin et al. [25] inserts a latent layer as a hash layer before the last fully-connected layer in the pretrained Alexnet [18]. However, this network is time-consuming when training phase since it contains multiple CNN. Liong et al. [26] proposed a new deep hash method which including supervised and unsupervised hash. In this framework, it presents three layers neural network, and last layer as hash layer, in addtion, for supervised manner, it generate pair-wised label matrix as supervised information. Different from above deep hash, we consider if class label information is available, we present a new novel deep framework for hashing, and our deep hash is not only easy to train, but also achieves a better perfromance.

3 Proposed method

In this section, we first present the preliminary concepts for image retrieval, and then propose our MHDH method.

3.1 Preliminary concepts

Let the training data represents as \(\mathbf {X} = [\mathbf {x}_{1},\mathbf {x}_{2},\cdot \cdot \cdot ,\mathbf {x}_{N}] \in \mathbb {R}^{d \times N}\), which contains N data samples, where \(\mathbf {x}_{n} \in \mathbb {R}^{d}(1 \leq n \leq N)\) is the n-th sample in X. The purpose of hashing is to learn a series of hash functions f : \(\mathbb {R}^{d} \rightarrow \{-1,1\}\). Footnote 1 Suppose we have to learn K hashing functions, the learned functions map each x n into a compact binary vector and represented as b n =[b n1,b n2,⋅⋅⋅,b n K ] ∈ {−1, 1}K. Specifically, the k-th hash bit b n k of x n is calculated as follows:

$$ b_{nk} = f_{k}(x_{n}) = sgn({w_{k}^{T}}\mathbf{x}_{n}) $$
(1)

where f k is the k-th hashing function, \(w_{k} \in \mathbb {R}^{d}\) is the projection vector in f k , and s g n(v) is a sign function, which returns 1 if v>0 otherwise returns −1.

Let \(\mathbf {W} = [w_{1},w_{2},\cdot \cdot \cdot ,w_{K}] \in \mathbb {R}^{d \times K} \) be a projection matrix. The projection of x n can be calculated as: W T x n , which is used to compute hash codes with a quantization:

$$ b_{n} = sgn(\mathbf{W}^{T}\mathbf{x}_{n}) $$
(2)

3.2 Our MHDH

Figure 1 shows our proposed MHDH framework. In this work, we present a novel deep neural network by seeking multiple hierarchical nonlinear transformations to learn binary codes to facilitate large-scale image search. In our model, the outputs of the classification layer depend on a set of hidden attributes, where each attribute is on or off. The learned hash codes should guarantee that similar binary codes should have the same labels. Based on this assumption, we design a latent layer that represented as a fully connected layer. In our method, activation functions in the latent layer are hyperbolic tangent functions, thus the activations can approximate to {−1,1}.

Next, given a data sample x n , we learn a binary code b n by inputing the data into the network and going through multiple nonlinear transformations. First, we assume that there are L+1 layers in our deep neural network, and there are p l units for the l-th layer. Given a data sample \(\mathbf {x}_{n}\in \mathbb {R}^{d}\), the output of the first layer is: \(\mathbf {h}_{n}^{1} = \varphi (\mathbf {W}^{1}\mathbf {x}_{n} + \mathbf {b}^{1}) \in \mathbb {R}^{p^{1}}\), where \(\mathbf {W}^{1} \in \mathbb {R}^{p^{1} \times d} \) is a projection matrix , \(\mathbf {b}^{1} \in \mathbb {R}^{p^{1}}\) is a bias term, and φ(⋅) is a nonlinear activation function. Then, the output of the first layer is regarded as the input of the second layer, and \(\mathbf {h}_{n}^{2} = \varphi (\mathbf {W}^{2}\mathbf {h}_{n}^{1} + \mathbf {b}^{2}) \in \mathbb {R}^{p^{2}}\), where \(\mathbf {W}^{2} \in \mathbb {R}^{p^{2} \times p^{1}} \) is a projection matrix and \(\mathbf {b}^{2} \in \mathbb {R}^{p^{2}}\) is a bias vector of the second layer. Similarly, the output of the l-th layer is: \(\mathbf {h}_{n}^{l} = \varphi (\mathbf {W}^{l}\mathbf {h}_{n}^{l-1} + \mathbf {b}^{l}) \in \mathbb {R}^{p^{l}}\). Then, the final layer of the network is:

$$ \mathbf{h}_{n}^{{L}} = softmax(\mathbf{W}^{{L}}\mathbf{h}_{n}^{{L}-1} + \mathbf{b}^{{L}})\in\mathbb{R}^{{C}} $$
(3)

where the s o f t m a x(⋅) is a classification function, and C is the number of images categories. Next, we can get the binary codes though performing quantization at the latent layer which is the former layer before the last layer:

$$ \mathbf{b}_{n} = sgn(\mathbf{h}_{n}^{{L}-1}) $$
(4)

Let B = [b 1,b 2,⋅⋅⋅,b n ] ∈ {−1, 1}K × N be the generated binary codes. Given an input vector x i that is a member of class c, a value of stochastic variable Y, can be written as:

$$ \mathit{P}({Y} = \mathbf{c}|\mathbf{x}_{i}, \mathbf{W}, b ) = softmax_{c}(\mathbf{W}\mathbf{x}_{i} + b) $$
(5)

where x i stands for the data items that belongs to the class c. Next, we calculate the maximum probability y p r e d , which ensures that when x i belongs to c class, we get the maximum probability. Here, we define y p r e d as below:

$$ y_{pred} = argmax_{c}\mathit{P}({Y} = \mathbf{c}|\mathbf{x}_{i}, \mathbf{W}, b ) $$
(6)

Because the final layer is a softmax layer, it can be seen as a case of multi-class logistic regression. So, it is naturally for us to use the negative log-likelihood as our loss function, thus the aforementioned optimization problem to learn the parameters is defined as follows:

$$ \begin{array}{lllllll} \ell(\theta = \{\mathbf{W},b\},\mathbb{D}) = &- {\Sigma}_{i} log(\mathit{P}({Y} = \mathbf{c}|\mathbf{x}_{i}, \mathbf{W}, b )) \\ + &\frac{\lambda}{2}{\Sigma}_{\mathit{l}}(\lVert\mathbf{W}^{\mathit{l}}{\lVert_{F}^{2}} + \lVert b^{\mathit{l}} {\lVert_{F}^{2}}) \end{array} $$
(7)

Where \(\mathbb {D}\) is a training set, 𝜃 is a set of all parameters. The second term is a regularization term, which is designed for avoiding over-fitting. The λ is a parameter to balance the impact of the second term. For training process, we adopt the stochastic gradient descent (SGD) technique to learn parameters \(\{\mathbf {W}^{l},\mathbf {b}^{l}\}_{l = 1}^{{L}}\) and solve this optimization problem.

4 Experiments

In this section, we evaluate the performance of MHDH and compare it with the state-of-the-art hash methods. We used two frequently-used datasets in our experiments.

4.1 Experimental settings

First, we introduce our experimental settings as follow:

CIFAR-10

Krizhevsky [17] is composed of 10 object classes and 6,000 images are included in each class. For each image, the size is 32 × 32. Following the same setting in [41], the dataset is divided into a test set and a training set. For test dataset, it contains 1,000 data samples and 100 sample for each class. In addition, these 1,000 samples are randomly selected. The rest 59,000 images are used as training data. Moreover, we extract 512-D GIST feature vectors [30] to represent images in the CIFAR-10 dataset. Some sample images from CIFAR-10 dataset are shown in Fig. 2.

Fig. 2
figure 2

Some sample images from the CIFAR-10 dataset. There are 10 object classes and each class shows 10 sample images

MNISTS

Lecun and Cortes [23] consists of 10 catagories of handwirtten digits. There are 70,000 images in this dataset and the size of each image is 28 × 28. Specifically, we randomly sample 1,000 data samples as test images, and the remaining 69,000 images are used as training data. Each image is represented as a 784-D gray-scale feature vector.

Evaluation metrics

Three evaluation metrics are used to evaluate the performance of different methods as follow: 1) mean average precision (mAP), which is the mean of the average precision scores for each query image; 2) precision at N samples, it is the ratio of true similar images among top N retrieved images; 3) Hamming look-up result, it measures the precision over all the points whose Hamming distance compared with query hash codes is less than or equal to r, here r = 2 [26].

Baselines

In our experiments, we compare our approach both unsupervised hashing approaches: PCA-ITQ [10], KMH [12], SpH [13], SH [43], PCAH [40], LSH [5], and DH [26]; and supervised approaches: MLH [29], BRE [19], SDH [26].

Network settings

In the experiments, we designed our deep framework with 4 layers. Empirically, the neuron number of final output layer neuron is set as 10. Also, we set dimensions for these middle layers as [60 → 30 → 16], [80 → 50 → 32], and [100 → 80 → 64] for the 16, 32 and 64 bits, respectively. The parameter λ was empirically set as 0.01. The weights of all network layers are randomly initialized, and we set the learning rate as 0.01 for our SGD optimization method . Furthermore, the hyperbolic tangent functions are used as activation functions in our framework.

4.2 Experimental results

4.2.1 Results on CIFAR-10

Table 1 shows our results compared with other hashing methods on the CIFAR-10 dataset. From Table 1, we can have the following observations:

  • Compared with unsupervised approaches, our MHDH performs best on three evaluation metrics. In terms of mAP, the best counterpart DH achieves 16.17 %, 16.62 % and 16.96 % with 16-bit, 32-bit and 64-bit, respectively, while our approach has 22.26 %, 23.85 % and 25.45 % increases on 16-bit, 32-bit and 64-bit, respectively. This is mainly because we used label information.

  • Compared with the state-of-the-art supervised deep hash methods, MHDH also performs best on three evaluation metrics. Specifically, in terms of mAP, MHDH nearly doubles the best counterpart SDH with 19.36 % increase on 16-bit, 19.64 % on 32-bit, and 19.9 % on 64-bit. In terms of P@N=1,000, MHDH performs best with 37.91 % on 16-bit, 41.63 % on 64-bit. When P@ r=2, MHDH also gains the best performance.

The precision and recall obtained by the compared methods on the CIFAR-10 dataset with varying code length from 16 bits, 32 bits to 64 bits are show in Fig. 3. We can see from Fig.3 that with short code lengths, our method achieves the best results. In addition, as the increase of code length, the corresponding performance increases. When code length is 64 bit, our method achieves the best results. Figure 4 shows the comparison results in terms of precision on the CIFAR-10 dataset with varying code length with respect to the number of retrieved samples. From Fig. 4, we can see that our algorithm performs best. As the number of retrieved samples increases, our MHDH are relatively more stable than other methods. Figure 5 exhibits some query examples on the CIFAR-10 dataset. For each query, each method returns the top-6 query results by using 64-bit hash codes.

Table 1 Results on the CIFAR-10 dataset
Fig. 3
figure 3

The recall vs. precision curve on the CIFAR-10 dataset. The hash code length varies from 16, 32 to 64 bits

Fig. 4
figure 4

The retrieval precision results with variable number of retrieved samples on the CIFAR-10 dataset. The hash code length varies from 16, 32 to 64 bits

Fig. 5
figure 5

On the CIFAR-10 dataset, top 6 retrieved images of 4 query images were returned by different hashing methods retrieval. The image at the first column are queries and we use 64 bit hash codes for retrieval

4.2.2 Results on MNIST

In this subsection, we compare our method with the state-of-the-art methods on the MNIST dataset. The results are reported in Table 2. From Table 2, we can see that the proposed MHDH achieves the best mAP with 16 bit, 32b it and 64 bit. With 1,000 retrieved samples, MHDH still gains the hight precision scores in terms of 16 bit, 32 bit and 64 bit. When r=2, MHDH performs best 93.56 % with 16 bit and 94.36 with 32 bit, while the best compared method SDH only achieves 63.92 % with 16 bit and 77.07 % with 32 bit. This indicates the efficiency of our proposed MHDH. Figure 6 shows recall vs. precision curves generated by our method and the state-of-the-art methods. By observing the curves, we can see that MHDH significantly performs better than the compared methods. In addition, Fig. 7 demonstrates the precision with varying of retrieved samples. The number of retrieved number ranges from 0 to 1,000. For 16 bit, 32 bit and 64 bit, MHDH performs best with all kinds of settings.

Table 2 Results of comparison on the MNIST dataset
Fig. 6
figure 6

The recall vs. precision curve on the MNIST dataset. The hash code length varies from 16, 32 to 64 bits

Fig. 7
figure 7

The retrieval precision results with variable number of retrieved samples on the MNIST dataset. The hash code length varies from 16, 32 to 64 bits

5 Conclusion

In this paper, we have proposed a novel deep hashing method named MHDH for supporting large scale image retrieval. A latent layer is firstly integrated into the deep neural network. Next, a quantization layer is applied on to the latent layer to generate hash codes. The MHDH is focusing on obtaining good multiple hierarchical non-linear projections to facilitate hash code generation. Experimental results conducted on two popular datasets demonstrate the superiority of our MHDH over the state-of-the-art (i.e., both supervised and unsupervised) hashing methods.