Keywords

1 Introduction

With the explosive growth of data in practical applications such as image retrieval, approximate nearest neighbor (ANN) search has become a hot topic in recent years. In the existing ANN technology, hashing method has become one of the most popular and effective technologies because of its fast query speed and low memory cost. Amounts of studies have shown that hashing has improved the performance on image retrieval tasks [7, 23]. However, these methods are defective in feature representation and can not be trained end-to-end.

Recently, convolutional neural networks (CNNs) are gradually applied to image hashing retrieval, and have achieved promising performance. Xia et al. [22] firstly adopt the CNN architecture in the hash algorithm. Later, series of deep hashing methods based on CNN [16, 17] are proposed in an end-to-end manner, showing the effectiveness of deep feature representation. The performance of these deep learning hash methods has been greatly improved compared with the traditional hash method in many benchmarks. Moreover, it proves crucial to jointly learn similarity-preserving representations and control quantization error of converting continuous representation into binary codes [3]. However, existing deep feature representation are generated with gradually discarding image information. It may result in discarding of representative feature variability in the process of feature transformation, which can not guarantee obtaining complete image information. In addition, informative regions of image are not highlighted well in existing algorithm, causing poor generalization ability.

To effectively solve the above-mentioned problems, we propose a novel image retrieval framework based on invertible network with spatial attention mechanism. Firstly, a reversible network is proposed, which guarantee the lossless representative features transformed from original image. In such a way, all the information of the image will be forwarded through the network. Then, we adopt spatial attention architecture to tell where to focus, which also improves the representation of interests. Spatial attention effectively learns which information to emphasize or suppress in the process of information transmission. As shown in Fig. 1, our method yield most of state-of-the-art retrieval performance. To summarize, the main contributions of this paper are three-fold:

  • We propose an effective invertible network with lossless image information for image retrieval, where the whole framework can be trained end-to-end;

  • To excavate informative regions of features, we adopt spatial attention module in our invertible block to learn how to focus on objective information and suppress unnecessary ones.

  • Extensive experiments on benchmark datasets show that our architecture is effective and achieves promising performance.

The rest of the paper is organized as follows: in Sect. 2, we introduce some related work about our algorithm. The proposed method is illustrated in Sect. 3, followed by the experimental results in Sect. 4. In Sect. 5, we conclude our work.

2 Related Work

2.1 Hashing Methods

Existing hashing methods [1, 25] can be roughly divided into two categories, namely unsupervised hashing and supervised hashing. Unsupervised hashing exploit unlabeled data to learn a set of functions, which encode data to binary codes [5, 21]. Locality-Sensitive Hashing (LSH) [5] is the most representative unsupervised hashing algorithm, achieving promising performance compared with previous approaches. LSH guarantees similar data points preserve similar binary codes after the same hash mapping, vice versa. Supervised hashing [18, 20] further exploit label information during learning to generate compact hash code. Supervised Hashing with Kernels (KSH) utilizes the pair-wised labels to generate effective hash functions, which guarantees minimizing the Hamming distances for similar pair-wise data and meanwhile maximizing the dissimilar ones.

In recent years, CNN have shown significant success in computer vision [13,14,15, 19, 26,27,28,29,30,31, 34,35,37]. In the domain of hashing retrieval, [22] was the first deep neural network, achieving promising performance compared with conventional approaches. Deep Hashing Network (DHN) [33] not only preserves pairwise similarity but also controls the quantization error. For improving DHN, HashNet balances training data consisting of positive pairs and negative pairs, and reduces quantization error by continuation technology, thus gaining the most advanced performance on several benchmark datasets. But the high-dimensional features obtained in these methods are accompanied with gradual loss of image information, and we can not ensure whether the discarded information variability is significant.

2.2 Attention Mechanism

The attention mechanism can be viewed as a strategy to bias the allocation of available processing resources towards the most informative components of an input [10]. Attention module has been widely applied in the Natural Language Processing (NLP) field like machine translation, sentence generation etc. And these performance is surprisingly remarkable. Meanwhile, in the image vision field, attention mechanism also demonstrates powerful capabilities. For example, Hu et al. [9] utilize attention to propose an object relation module, which models the relationship among a set of objects and improves object recognition. In this work [24], a self-attention module is introduced in order to better generate images. A channel-wise attention was proposed for image super-resolution task [32]. In our work, the attention-aware invertible hashing network aims at utilizing spatial attention to enhance informative features from the spatial domain, which can accurately tell which information to emphasize or suppress.

3 Our Method

3.1 Overview

The architecture of AIHN is shown in Fig. 1. The pair-wise images are firstly fed into an invertible downsample layer to increase the number of output channels, while decreasing the spatial resolution. Then, the output is split into two sublayers (\(x_1\), \(y_1\)) of equal channel dimension. Next, sublayers (\(x_1\), \(y_1\)) are put into the invertible block. It is worth noting that spatial attention and invertible downsampling module are introduced in the invertible block. Spatial attention module is to notice the most informative components of an input, and invertible downsampling module is adopted to reduce the number of computations while maintaining good performance. More details about these two will be introduced in Sects. 3.2 and 3.3 below. After totally 100 similar blocks, invertible high-dimensional features are obtained through followed concatenation operation. The invertible features are send to average pooling and linear layer after a ReLU nonlinearity. The results are quantized by Sgn function to get pair-wise binary hash codes. The pairwise similarity loss is adopted for similarity-preserving learning in the Hamming space, and quantization loss is to control both the binarization error and the hash code quality. The invertible downsampling, spatial attention module, as well as invertible block will be introduced in next sections in detail.

Fig. 1.
figure 1

The framework of the proposed invertible spatial-attention hashing network.

3.2 Invertible Downsampling

In order to facilitate calculation and avoid the use of irreversible module at the same time, we introduce invertible downsampling module to our architecture instead of Maxpooling used in [6]. It not only reduce the spatial resolution of the input for the sake of simplicity but also potentially increase the number of channel for lossless information. As shown in Fig. 2, downsampled by a factor of \(\theta \) 4, the output’s channel is 4 times the original, and the size of each feature map is reduced by 4 times. And also invertible downsampling preserves roughly the spatial ordering, thus avoiding mixing different neighborhoods via the next convolution. Invertible downsampling operation can be written as below:

Fig. 2.
figure 2

The illustration of invertible downsampling.

$$\begin{aligned} T(\theta ,Fe(c,w,h)) =Fe(\theta \times c,w/(\theta /2),h/(\theta /2)) \end{aligned}$$
(1)

where \(\theta \) represents scaling factor which determines the downsampled size directly, T is the function of downsampling operation, and Fe(c, w, h) denotes feature maps with channel c, width w, and height h.

For reducing computational costs, invertible downsampling is designed tightly for our architecture. It will correspond to an invertible downsampling operator respectively at the begin of our network and depth d = 6, 22, 94.

3.3 Invertible Block

Fig. 3.
figure 3

The structure of invertible block.

The invertible block is an important component for our invertible hashing network. It not only determines the reversibility of information flow, but also generates attentioned features with lossless information. Spatial attention module and invertible downsampling module introduced in Sects. 3.2 and 3.4 are adopted in the invertible block. In particular, spatial attention module mining the objective information and invertible downsampling module allows us to reduce the number of computations while maintaining good performance. The details of the invertible block are illustrated as Fig. 4.

Detailedly, sublayers \((x_i,y_i)\) obtained through splitting operation are doing different two operations. \(x_i\) is feed to a invertible downsampling layer with scaling factor \(\theta \) 4 directly, so we can get \(T(4,x_i)\). \(y_i\) is sent to a bottleneck block F, mainly consisting of a succession of 3 convolutional operators. The second convolutional layer has four times fewer channels than the other two, while their corresponding filter sizes are respectively 1 \(\times \) 1, 3 \(\times \) 3, 1 \(\times \) 1. The first and the second are preceded by spatial attention module, Batch normalization (BN) and ReLU non-linearity. What needs to be emphasized is that the last convolution layer are followed by batch normalization and ReLU non-linearity only. Obtained \(F(y_i)\) plus \(T(4,x_i)\), then \(Y_{i+1}\) is got. Meanwhile, \(y_i\) is also feed to an invertible downsampling layer for convenient calculation, and \(x_{i+1}\) is equal to output \(T(4,y_i)\). In summary, the detailed operation is described as below:

$$\begin{aligned} x_{i+1} = T(4,y_i) \end{aligned}$$
(2)
$$\begin{aligned} y_{i+1} = F(y_i) + T(4,x_i) \end{aligned}$$
(3)

and reverse propagation can be computed by the following:

$$\begin{aligned} y_i = T^{-1}(4,x_{i+1}) \end{aligned}$$
(4)
$$\begin{aligned} x_i = T^{-1}(4,(y_{i+1}-F(y_i))) \end{aligned}$$
(5)

where \(T^{-1}\) represents reverse calculation of T function.

3.4 Spatial Attention

Fig. 4.
figure 4

Diagram of spatial attention module.

The spatial attention module aims to highlight the expressions of key objects for image retrieval. Firstly, it learns a set of weight maps from the feature maps, and provides a larger weight for the informative region in each feature map, while providing a smaller weight for the background region. Then, the learned weight maps is multiplied by the feature map, so feature maps focusing on key objects and suppressing background regions is obtained. More specifically, the spatial attention tell which information to emphasize or suppress in the process of feature transmission. As shown in Fig. 3, feature maps are send to a Max-pooling and average pooling operation respectively, both which demonstrate effective in highlighting informative regions. Then concatenating the both outputs to generate a concentrated feature descriptor. Next, we apply a convolution layer followed by sigmoid operation on the attentioned feature descriptor to get a spatial attention map \(SA(Fe) \in R^ {H\times W}\), which tell information flow which part to emphasize or suppress. In short, the detailed operation is described as below:

$$\begin{aligned} SA(Fe) = \sigma (g^{7\times 7}(Cat(Ap(Fe),Mp(Fe)))) \end{aligned}$$
(6)

where \(\sigma \) presents the sigmoid operation, \(g^{7\times 7}\) denotes a convolution operation with kernel size of \(7\times 7\), cat is concatenation operation along the channel axis, Ap(Fe) and Mp(Fe) respectively represent average pooling and Max-pooling operation, and Fe is a brief expression of feature map.

3.5 Loss Function

In our paper, we focus on the supervised setting utilizing label information. We can easily obtain a set of image pairs, where each pair (\(a_i,a_j\)) consists of an image \(a_i\) and \(a_j (j \ne i)\). Using both category information, we can get the similarity \(s_{ij}\) of image pair (\(a_i,a_j\)). Following [3, 33], the similarity information is constructed directly by image labels:if two images \(a_i\) and \(a_j\) share at least one label, they are similar and \(s_{ij}=1\); otherwise, they are dissimilar and \(s_{ij}=0\).

Intuitively, the desired hash codes should be able to preserve the relative similarities in the image pairs. Corresponding optimization goal is to make the Hamming distance between two similar points as small as possible, and simultaneously make the Hamming distance between two dissimilar points as large as possible. In this way, we can define a pairwise loss that has also been successfully applied in prior research [16], which is defined over the output binary codes \((b_i,b_j)\) corresponding to the training image pair (\(a_i,a_j\)):

$$\begin{aligned} l_1 = min(-(s_{ij}\beta _{ij}-log(1+e^{\beta _{ij}}))) \end{aligned}$$
(7)

where \(\beta _{ij} = \frac{1}{2} b^T_i b_j\), \(s_{ij}\) presents the similarity of image pair (\(a_i,a_j\)). To pursue representative hash codes, we learn our Invertible Hashing Network by minimizing the pairwise loss. This can drive our network to process strong capability of distinguishing the images. Since the hash codes are discrete, we additionally adopt the following quantization loss for each image \(a_i\):

$$\begin{aligned} l_2 = ||b_i-u_i||^2_2 \end{aligned}$$
(8)

where \(u_i \in \mathbb {R}^{c\times 1} \), \(b_i \in {(-1,1)}^c\), and c represent the hash code length. Based on the two type of loss, we can train our Invertible Hashing Network through the following function:

$$\begin{aligned} l = l_1 + l_2 = - \sum _{s_{ij} \in S}(s_{ij}\beta _{ij}-log(1+e^{\beta _{ij}})) + \lambda \sum _{i=1}^{n}||b_i-u_i||^2_2 \end{aligned}$$
(9)

where \(\lambda \) is the hyper-parameter; \(||.||_2\) denotes the \(l_2\) norm.

4 Experiments

4.1 Dataset and Evaluation

We evaluate the effect of the proposed AIHN with several state-of-the-art hashing methods on two benchmark datasets.

  • CIFAR-10 is a single-label dataset with 60000 images divided into 10 categories (6000 images per class). We follow [2, 33] to randomly select 100 images per class for training, 500 images per class for testing, and the rest 54000 images used as database.

  • NUS-WIDE81 is a multi-label dataset, which contains 269648 images consisting of 81 categories. We follow similar experimental protocols in [3, 33], and randomly sample 5000 images as test images, 10000 images for training, and the remaining images used as database. To evaluate our method, the mean average precision (MAP) is used to measure the accuracy of our proposed method and other baselines (Fig. 5).

Fig. 5.
figure 5

The curve convergence of our network on CIFAR-10 with 16 bits code.

4.2 Implementation Detail

Network Detail. The proposed network is trained specifically for image hashing retrieval. The input image size of our network is 3 \(\times \) 224 \(\times \) 224. After the invertible downsampling with \(\theta =4\), the size of features becomes 12 \(\times \) 112 \(\times \) 112. Then, the spliting operation guarantees two sublayers with equal channel. Next, both sublayers pass through totally 100 similar invertible blocks. It will correspond to an invertible downsampling operator respectively at the block = 6, 22, 94. The spatial resolution of these layers is reduced by a factor 4 while increasing the number of channels respectively to 48, 192, 768 and 3072. Furthermore, it means that the corresponding spatial resolutions are respectively 56 \(\times \) 56, 28 \(\times \) 28, 14 \(\times \) 14, 7 \(\times \) 7. Last, the obtained representation is spatially averaged and projected onto one-dimensional vector after a ReLU nonlinearity. Binary hashing code can be obtained through Sgn computation conducted on the one-dimensional vector.

Training Detail. We randomly crop a set of 224 \(\times \) 224 patches for training. The training batch size is set to 64 in each back-propagation. This network is trained via an end-to-end manner. Pairwise similarity loss and quantilization loss are concurrently adopted in CIFAR-10 and NUS-WIDE81, where the data augmentation with random horizontal flip is adopted. The SGD is adopted for optimizing our network, and the initial learning rate is set to 0.05. For each 50 epochs, the learning rate will decrease by the scale of 0.1. The hyper-parameter \(\lambda \) in our network is chosen by a validation set, which is 10 for CIFAR-10 and 100 for NUS-WIDE81. At test time, we rescale the image size to \(256\times 256\) and perform a center crop of size \(224\times 224\). The curve convergence of our network on CIFAR-10 with 16 bits code are shown below. Experiments are performed on two NVIDIA Titan XP GPUs for training and testing.

4.3 Compare with State-of-the-Arts

We use MAP evaluation metrics to compare retrieval performance of AIHN with classical or state-of-the-art methods: supervised shallow methods ITQ-CCA [8], BRE [11], KSH [18], SDH [20] and supervised deep methods CNNH [22], DNNH [12], DHN [33], HashNet [3]. For fair comparison, all methods use identical training and test sets. We adopt MAP@5000 for evaluation in NUS-WIDES. For shallow hashing methods, we use as image features the 4096-dimensional \(DeCAF_7\) feature [4]. For deep hashing methods, we use raw images as the input. We adopt the AlexNet architecture for all deep hashing methods.

Table 1. The best MAPs for each category are shown in boldface. Here, the MAP value is calculated based on the top 5000 returned neighbors for NUS-WIDE dataset.

Experimental results are as shown in Table 1. It can be seen that our method AIHN achieves the best performance among all the methods. Specifically, compared to the best shallow hashing method using deep features as input, ITQ-CCA, we achieve absolute boosts of 33.45%, 48% in average MAP for different bits on CIFAR-10 and NUS-WIDE dataset respectively. Compared to the state-of-the-art deep hashing method, HashNet, we achieve absolute boosts of 10.21%, 7.9% in average MAP for different bits on the two datasets, respectively. An interesting phenomenon is that the performance boost of AIHN over HashNet is significantly different across the two datasets. Specifically, the performance boost on NUS-WIDES is much larger than that on CIFAR-10 generally. But with the code length increasing, MAP has an exciting increase in CIFAR10 dataset.

4.4 Ablation Experiment

For investigating the effectiveness of proposed two different components, we research two AIHN variants: (1) AIHN-AI is a AIHN variant without using spatial attention module, and replace invertible network with Alexnet which may cause gradually information lost. (2) AIHN-A is a AIHN variant using invertible network for feature extraction. But in each invertible block, there is no spatial attention module adopted.

AIHN-A outperforms AIHN-AI by very large margins of 10.58\(\%\), 8.69\(\%\), 10.71\(\%\) and 9\(\%\) in MAP with corresponding 16, 32, 48, 64 code lengths on CIFAR-10. The invertible Network guarantees that the final hash codes can be completely obtained from input images without any information loss. AIHN outperforms AIHN-A by 0.75\(\%\), 1.48\(\%\), 0.03\(\%\), 1.77\(\%\) in MAP with different 16, 32, 48, 64 code lengths on CIFAR-10 respectively. These results validate that the spatial attention module can enhance efficiency and improve MAP results. That is because the spatial attention module can better capture the objective information. As shown in Table 2, our proposed AIHN achieves the highest result in terms of the MAP evaluation metrics in CIFAR-10 dataset. Further analysis, we can find that Invertible Network which guarantees the lossless generated features contributes to our network largely. This can be explained as the following: when learning image features, progressively discarding variability about the input image may cause effective information to be discarded.

Table 2. Results of ablation study on CIFAR-10

5 Conclusion

In this paper, we propose a novel attention-aware invertible hashing network for image retrieval. By invertible feature representations, the final hash codes can be completely obtained from input images without any information loss, so as to produce accurate hash codes with complete image information. For highlighting informative regions, we present a novel attention-aware invertible block as the basic module of AIHN, which can promote generalization ability by spatial attention mechanism. Extensive experiments conducted on benchmark datasets have demonstrated the state-of-the-art performance of our method.