Keywords

1 Introduction

With the explosive growth of online images in recent years, large-scale image retrieval has attracted increasing attention [4, 16, 21], which tries to return visually similar images that match the visual query from a large database. However, retrieving the exact nearest neighbors is computationally impracticable when the reference database becomes very large. To alleviate this problem, hashing [20] has been widely used to speed up the query process. The basic idea of hashing approach is to transform the high dimensional data into compact binary codes, which preserve the semantic information. Then the distance between data points in high dimensional space can be approximated by Hamming distance.

Existing hashing methods can be classified into two categories: unsupervised hashing methods and supervised hashing methods. Unsupervised hashing methods only exploit unlabeled data to learn hash functions, such as Locality Sensitive Hashing (LSH) [1], Spectral Hashing (SH) [22], Iterative Quantization (ITQ) [2], Spherical Hashing (SpH) [9], and Discrete Graph Hashing (DGH) [13]. Different from unsupervised methods, supervised methods utilize label information to learn the hash codes, which can preserve the similarity relationships among data points in Hamming space, like Binary Reconstructive Embedding (BRE) [7], Minimal Loss Hashing (MLH) [17], and Kernel-based Supervised Hashing (KSH) [14]. However, these traditional hashing methods learn hash functions using hand-crafted features, like GIST [18] or SIFT [15], which are not sufficient to represent the visual content, hence these methods result in suboptimal hashing codes.

In recent years, with the rapid development of deep learning [6], it has become a hot topic to combine deep learning and hashing methods [27]. Some deep hashing methods, like CNNH [23], DNNH [8], DHN [28] and DPSH [10], have shown better performance than the traditional hashing methods because the deep architectures generate more discriminative feature representations.

However, most existing deep hashing methods are designed in the supervised scenario, which relies on the label information of data points to preserve the semantic similarity. With the rapid growth of visual data on the web, most of the online data do not have label annotations. In addition, labeling data is also very expensive, which means it is difficult to acquire sufficient semantic labels in real applications. Thus it is more acceptable to develop unsupervised hashing methods which can exploit unlabeled data directly. But the existing unsupervised hashing methods depend on hand-crafted features which cannot effectively capture the semantic information of images.

To alleviate the above problems, in this paper, we propose a novel unsupervised deep hashing method, named Deep Graph Laplacian Hashing (DGLH), to simultaneously learn the deep feature representation and the compact binary codes. We design a graph-based objective function [24,25,26] at the top layer of the deep network to preserve the adjacency relation of images without using the label information. We also minimize the quantization loss between the real-valued Euclidean distance and the Hamming distance. Moreover, we enforce the bits to be balanced and uncorrelated, which makes the learned hash codes more effective. Finally, the model parameters are learned by back propagation algorithm. The proposed DGLH is an end-to-end learning method, which means that we can directly obtain hash codes from the input images by utilizing DGLH directly. We compare the proposed DGLH method with state-of-the-art unsupervised hashing methods on several commonly-used image retrieval benchmarks, and the experimental results demonstrate the effectiveness of the proposed method.

The rest of this paper is organized as follows: The proposed DGLH method is introduced in Sect. 2. Section 3 presents the experimental results and analysis, followed by a conclusion in Sect. 4.

Fig. 1.
figure 1

Deep Graph Laplacian Hashing (DGLH) model with a hash layer fch. The fc7 layer extracts the deep features for the input images, which are also used for graph constructing. We enforce four objectives on the neurons at the top layer of the network to learn compact binary codes: (1) we use graph Laplacian criterion to preserve the adjacency relation of images from the original feature space to the Hamming space, (2) the bits should be uncorrelated, (3) the bits should be balanced, and (4) the quantization loss should be minimized.

2 The Proposed Approach

In this paper, we use bold and uppercase letters, like \(\mathbf {X}\), to denote matrices. Bold and lowercase letters, like \(\mathbf {x}\), are used to denote vectors. 1 and 0 are used to denote the vectors with all ones and zeros, respectively. \( \mathbf I _k \) is used to denote the \( k\times k \) identity matrix. Further, the Frobenius norm \( \Vert \cdot \Vert _F \) is written as \( \Vert \cdot \Vert \).

2.1 Deep Hashing Model

Given a set of n data points \( \mathbf {X}=[\mathbf {x}_1, \mathbf {x}_2, \dots ,\mathbf {x}_n] \) where \( \mathbf {x}_i \in R^D \) is the feature vector of the i-th data point, we aim to learn a set of nonlinear hash functions to map the \( \mathbf {X} \) into compact k-bit hash codes \( \mathbf {H}=[\mathbf {h}_1, \mathbf {h}_2, \dots , \mathbf {h}_n] \) where \( \mathbf {h}_i\in \{-1,1\}^k \). During the projection, we should preserve the similarity relation among the data points in the Hamming space, which means the similar data points should have the similar hash codes. We denote \( \mathbf {S}_{ij} \) as the similarity between \( \mathbf {x}_i \) and \(\mathbf {x}_j\). So the similarity constraints are preserved as:

$$\begin{aligned} \mathbf {S}_{ij}<\mathbf {S}_{kl} \Rightarrow D(\mathbf {h}_i, \mathbf {h}_j)<D(\mathbf {h}_k,\mathbf {h}_l), \end{aligned}$$
(1)

where \( D(\mathbf {h}_i, \mathbf {h}_j) \) is the Hamming distance between \( \mathbf {h}_i \) and \( \mathbf {h}_j \).

In order to simultaneously learn robust feature representation and compact hash codes, we utilize a deep neural network, the AlexNet [6], in this work. Figure 1 shows the framework of the proposed method. AlexNet [6] consists of five convolutional layers (conv1–conv5) and three fully connected layers (fc6–fc8). Each fc layer learns a nonlinear mapping as follows:

$$\begin{aligned} \mathbf {f}_i^l=p^l(\mathbf {W}^l\mathbf {f}_i^{l-1}+\mathbf {b}^l), \end{aligned}$$
(2)

where \( \mathbf {f}_i^l \), \( \mathbf {W}^l \), \( \mathbf {b}^l \), and \( p^l \) are the output, the weight and bias parameters and the activation function of the l-th layer respectively. To learn compact hash codes, we change the fc8 layer with a new full-connected layer fch of k hidden units to compact the 4096-dimensional representation \( \mathbf {f}_i^7 \in R^{4096\times 1} \) of fc7 layer to k-dimensional representation \( \mathbf {y}_i \in R^{k} \).

$$\begin{aligned} \mathbf {y}_i=\mathbf {W}^{fch}\mathbf {f}_i^7+\mathbf {b}^{fch}. \end{aligned}$$
(3)

Finally we can obtain the binary code \( \mathbf {h}_i \in R^{k} \) by following function:

$$\begin{aligned} \mathbf {h}_i = sgn(\mathbf {y}_i), \end{aligned}$$
(4)

where

$$\begin{aligned} sgn(i)= \left\{ \begin{array}{rc} 1 ,&{} i>0 \\ -1 ,&{} i\le 0 \end{array}\right. . \end{aligned}$$
(5)

2.2 Objective Function

In this work, we learn the compact hashing codes by a deep neural network with graph Laplacian constraint. In our method, we use spectral hashing method [22] to map the deep feature extracted from the each training sample \(\mathbf {x}_i\) to a binary hash code \(\mathbf {h}_i\). Besides, to obtain more robust feature, the learned hash codes should be balanced and uncorrelated. To this end, our objective function is defined as follows:

$$\begin{aligned} \arg \min _{\mathbf {W},\mathbf {b}}{\mathbf {C}}= \arg \min _{\mathbf {W},\mathbf {b}} \left\{ \mathbf {G} + \alpha \mathbf {B} + \beta \mathbf {U}\right\} . \end{aligned}$$
(6)

There are three terms in Eq. (6). The first term is used to preserve the similarity among the data points in Hamming space:

$$\begin{aligned} \mathbf {G}=\sum _{i,j=1}^n\Vert \mathbf {h}_i-\mathbf {h}_j\Vert ^2\mathbf {S}_{ij}, \end{aligned}$$
(7)

where \(\mathbf {S}_{ij}\) is the similarity matrix. For an image dataset, it could be computed by:

$$\begin{aligned} \mathbf {S}_{ij}=\exp ^{\frac{-\Vert \mathbf {f}_i^7-\mathbf {f}_j^7\Vert ^2}{\rho }}. \end{aligned}$$
(8)

In Eq. (8), \( \rho >0 \) is the bandwidth parameter, \( \mathbf {f}_i^7 \) and \( \mathbf {f}_j^7 \) are the 4096-dimensional representation extracted from \( fc7 \) layer for \( \mathbf {x}_i \) and \( \mathbf {x}_j \) respectively.

For similarity, we define the graph Laplacian matrix: \( \mathbf {L}=\mathbf {D}-\mathbf {S} \), where \(\mathbf {D}\) is the diagonal degree matrix with \( \mathbf {D}_{ii}=\sum _j\mathbf {S}_{ij} \). And Eq. (7) could be rewritten as:

$$\begin{aligned} \mathbf {G}=\sum _{i,j=1}^n\Vert \mathbf {h}_i-\mathbf {h}_j\Vert ^2\mathbf {S}_{ij}=tr(\mathbf {HLH}^T). \end{aligned}$$
(9)

The second term in Eq. (6) is to enforce the hash codes to be balanced. According to the information theory, to maximize the information capacity of the hash codes, each bit should have a 50% chance to be 1 or \(-1\). In another words, an ideal hashing code \(\mathbf {h}_i\) has \( \sum _{i=1}^n\mathbf {h}_i=0\). On the learned hashing matrix H, we define the unbalanced loss as follows:

$$\begin{aligned} \mathbf {B}=\Vert \sum _{i=1}^n\mathbf {h}_i\Vert ^2=\Vert \mathbf {H}{} \mathbf 1 \Vert ^2. \end{aligned}$$
(10)

Moreover, the hash codes should be uncorrelated and different hashing code is desired to be independent to each other, to minimize the redundancy among the bits.

We use the last term in Eq. (6) to measure the correlation among the hashing codes, which could be defined as:

$$\begin{aligned} \mathbf {U}=\Vert \sum _{i=1}^n\mathbf {h}_i\mathbf {h}_i^T-n\mathbf I _k\Vert ^2=\Vert \mathbf {HH}^T-n\mathbf I _k\Vert ^2. \end{aligned}$$
(11)

With \(\mathbf {G}\), \(\mathbf {B}\) and \(\mathbf {U}\), the objective function could be rewritten as:

$$\begin{aligned} \min _{\mathbf {W},\mathbf {b}} \mathbf {C}=\min _{\mathbf {W},\mathbf {b}} \left\{ \mathbf {G}+\alpha \mathbf {B}+\beta \mathbf {U}\right\} =tr(\mathbf {HL}\mathbf {H}^T)+\alpha \Vert \mathbf {H}{} \mathbf 1 \Vert ^2+\beta \Vert \mathbf {H}\mathbf {H}^T-n\mathbf I _k\Vert ^2 \end{aligned}$$
(12)
$$\begin{aligned} s{.}t{.}\quad \mathbf {H}\in \{-1,1\}^{k\times n}, \end{aligned}$$

where the \( \alpha >0 \) and \( \beta >0 \) are the balanced parameters of regularization terms.

Note that the above-mentioned optimization problem defined in Eq. (12) is a discrete optimization problem (NP hard problem) that is hard to solve. To tackle the challenging problem, inspired by [10], we relax Eq. (4) as \( \mathbf {h}_i=\mathbf {y}_i \). Then, the discrete optimization problem can be solved by a constraint approach. We reformulate the problem as:

$$\begin{aligned} \min _{\mathbf {W},\mathbf {b}} \mathbf {C}=tr(\mathbf {YLY}^T)+\alpha \Vert \mathbf {Y}{} \mathbf 1 \Vert ^2 +\beta \Vert \mathbf {YY}^T-n\mathbf I _k\Vert ^2+\varphi \Vert \mathbf {Y}-\mathbf {H}\Vert ^2 \end{aligned}$$
(13)
$$\begin{aligned} s.t.\quad \mathbf {H}\in \{-1,1\}^{k\times n}, \end{aligned}$$

where \( \mathbf {Y}\in R^{k\times n} \) is the output matrix of fch layer: \( \mathbf {Y}=\{\mathbf {y}_i\}_{i=1}^n \), and \( \mathbf {y}_i=\mathbf {W}^{fch}\mathbf {f}_i^7+\mathbf {b}^{fch} \). The last term in Eq. (13) is to minimize the quantization loss, and the parameter \( \varphi >0 \) is the regularization parameter.

2.3 Learning

In the DGLH model, we need to learn the parameters of the deep neural network, which could be solved by stochastic gradient descent algorithm. In each iteration we select a mini-batch of images from the training set to optimize the model parameters.

And the derivative of the objective function in Eq. (13) with respect to \(\mathbf {Y}\) is computed as:

$$\begin{aligned} \frac{\partial \mathbf {C}}{\partial Y}=2\mathbf {YL}+2\alpha \mathbf {Y}{} \mathbf 1 {} \mathbf 1 ^T+4\beta (\mathbf {YY}^T-n\mathbf {I}_k)\mathbf {Y} +2\varphi (\mathbf {Y}-\mathbf {H}). \end{aligned}$$
(14)

Then we can update the parameters \( \mathbf {W}^{fch} \), \( \mathbf {b}^{fch} \) and the parameters of five convolutional layers (conv1–conv5) and two fully connected layers (fc6–fc7) by back propagation method.

3 Experiments

To evaluate the proposed method, we conduct large-scale similarity search experiments on three benchmark datasets: CIFAR-10, CIFAR-100, Flickr-25K, and compare the proposed DGLH method with five unsupervised hashing methods, including Locality Sensitive Hashing (LSH) [1], Spectral Hashing (SH) [22], Spherical Hashing (SpH) [9], Density Sensitive Hashing (DSH) [11] and PCA-Iterative Quantization (PCA-ITQ) [2].

3.1 Datasets and Setting

The CIFAR-10 [5] dataset consists of 60000 color images in 10 classes with size of \(32\times 32\) (6000 images per class). We randomly select 1000 images (100 images per class) as the query set, and the remaining 59000 images as the database for image retrieval.

The CIFAR-100 [5] dataset consists of 60000 color images in 100 classes (600 images per class). The size of each image is also \( 32\times 32 \). We randomly select 10000 images (100 images per class) as the query set, and the remaining 50000 images as the database for image retrieval.

The Flickr-25K [3] dataset consists of 25000 images collected from Flickr. The dataset are annotated by 24 topics such as bird, sky, night, and people. We randomly select 2000 images as the test query set, and the remaining images as the database for image retrieval. We resize each image in all dataset into \( 224\times 224 \) to feed the proposed neural network.

For DGLH method, we directly use the image pixels as input. For other compared unsupervised hashing methods, we represent each image in 4096-dimensional deep features extracted by the AlexNet [6] with pre-trained model on ImageNet dataset.

We implement the DGLH model based on the MatConvNet [19]. Specifically, we employ the AlexNet architecture [6], and reuse the parameters of convolutional layers conv1–conv5 and full-connect layers fc6–fc7 from the pre-trained model on ImageNet dataset, and train the last hashing layer fch from scratch. The bandwidth parameter is set as \( \rho =4 \). The two hyper-parameters \( \alpha =0.006 \), \( \beta =0.18 \). We set \( \varphi =100 \) for CIFAR dataset, and \( \varphi =1000 \) for Flickr-25K.

3.2 Experimental Results and Analysis

To evaluate the performance of different hashing methods on image retrieval, we follow [12] to use the mean Average Precision (mAP), precision and recall at top-K positions (Precision@K, Recall@K) as evaluation metrics.

Table 1 shows the mAP results of DGLH and other comparing unsupervised hashing methods on CIFAR-10,CIFAR-100 and Flickr-25K when learning 32, 48 and 64 bits hashing codes. Figures 2, 3 and 4 show the results of precision and recall curves with 48 and 64 bits for DGLH and other comparing unsupervised hashing methods on CIFAR-10, CIFAR-100 and Flickr-25K respectively. As above experimental results shown, it could be concluded that the proposed DGLH approach outperforms the comparing hashing methods.

Fig. 2.
figure 2

The results of precision curves and recall curves with 48 and 64 bits hashing on the CIFAR-10.

Fig. 3.
figure 3

The results of precision curves and recall curves with 48 and 64 bits hashing on the CIFAR-100.

Fig. 4.
figure 4

The results of precision curves and recall curves with 48 and 64 bits hashing on the Flickr-25K.

Table 1. mAP comparison results using Hamming ranking on three datasets with different hash bits.

4 Conclusion

In this paper, we propose a novel unsupervised deep hashing method for large-scale image retrieval. In this method, unlike most previous unsupervised hashing methods, we design a graph-based deep hashing network to simultaneously learn robust feature representation and compact hashing codes, which preserves the adjacency relation of images without using the label information. When training the model, we relax the discrete optimization problem and reduce quantization loss when learning the hashing code. We also use other two constraints to optimize the binary code learning to make the learned hash codes more effective. The proposed DGLH method is an end-to-end model, which can learn the hashing code from image pixels directly. Experimental results on three image retrieval datasets demonstrate that the DGLH method outperforms other state-of-the-art unsupervised hashing methods.