Keywords

1 Introduction

Due to the rapid development of the Internet, different types of media data are also growing rapidly, e.g., texts, images, and videos. Cross-modal retrieval, which takes one type of data as the query and returns the relevant data of another type, is increasingly receiving attention since it is a natural way to search for multi-modal data. The solution methods can be roughly divided into two categories [33]: real-valued representation learning and binary representation learning. Because of the low storage cost and fast retrieval speed of the binary representation, we only focus on cross-modal binary representation learning (i.e., hashing [17, 31]) in this paper.

Fig. 1.
figure 1

Attention-aware deep adversarial hashing. To learn the attention masks, we train the attention module and the hashing module in an adversarial way (II): (1) the hashing module learns to preserve the similarities of multi-modal data, while (2) the attention module attempts to generate attention masks that make the hashing module unable to preserve the similarities of the unattended features.

To date, various cross-modal hashing algorithms [3, 8, 15, 19, 36, 40, 41] have been proposed for embedding correlations among different modalities of data. In the cross-modal hashing procedure, feature extraction is considered the first step for representing all modalities of data, and then, these multi-modal features can be projected into a common Hamming space for future searches. Many methods [8, 40] use a shallow architecture for feature extraction. For example, collective matrix factorization hashing (CMFH) [8] and semantic correlation maximization (SCM) [40] use the hand-crafted features. Recently, deep learning has also been adopted for cross-modal hashing due to its powerful ability to learn good representations of data. The representative work of deep-network-based cross-modal hashing includes deep cross-modal hashing (DCMH) [15], deep visual-semantic hashing (DVSH) [3], pairwise relationship guided deep hashing (PRDH) [36], etc.

In parallel, the computational model of “attention”’ has drawn much interest due to its impressive result in various applications, e.g., image caption [34]. It is also desired for cross-modal retrieval problems. For example, as shown in Fig. 1, given a query girl sits on donkey, if we can locate the more informative regions in the image (e.g., the black regions), a higher degree of accuracy can be obtained. To the best of our knowledge, the attention mechanism has not been well-explored for cross-modal hashing.

In this paper, we propose an attention mechanism for cross-modal hashing. The model first decides where (i.e., which region of multi-modal data) it should attend to; then, the attended region should be favoured for retrieval. Based on this, an attention module is proposed to find the attended regions and a hashing module is to learn the similarity-preserving hash functions. In the attention module, the adaptive attention mask is generated for each data, which divides the data into attended and unattended regions. Ideally, well-learned attention masks should locate discriminative regions, which means that the unattended regions of data are uninformative and difficult to preserve the similarities. Hence, the attention module undergoes learning to make the hashing module unable to preserve the similarities of the unattended regions of data. However, the learned hash functions should preserve the similarities for both the attended (which can be viewed as easy examples) and unattended (hard examples) regions of data to enhance the robustness and performance. Thus, the hashing module undergoes learning to preserve the similarities of both the unattended and attended regions of data. Note that the attention module and the hashing module are trained in an adversarial way: the attention module attempts to find the unattended regions in which the hashing module fails to maintain the similarities, whereas the hashing module aims to preserve the similarities of the multi-modal data.

A new deep adversarial hashing for cross-modal retrieval is illustrated in Fig. 2. It consists of three major components: (1) a feature learning module that uses CNN or MLP to extract high level semantic representations for the multi-modal data; (2) an attention module that generates the adaptive attention masks and divides the feature representations into the attended and unattended feature representations; and (3) a hashing module that focuses on learning the binary codes for the multi-modal data. The adversarial retrieval loss and the cross-modal loss are proposed to obtain good attention masks and powerful hash functions.

The main contributions of our work are three-fold. First, we propose an attention-aware method for the cross-modal hashing problem. It is able to detect the informative regions of multi-modal data, which is helpful for identifying content similarities between different modalities of data. Second, we propose a deep adversarial hashing for learning effective attention masks and compact binary codes simultaneously. Third, we quantitatively evaluate the usefulness of attention in cross-modal hashing, and our method yields better performances in comparison with several state-of-the-art methods.

2 Related Work

2.1 Cross-Modal Hashing

According to the utilized information for learning the common representations, cross-modal hashing can be categorized into three groups [33]: (1) the unsupervised methods [29], (2) the pairwise-based methods [21, 41] and (3) the supervised methods [4, 39]. The unsupervised methods only use co-occurrence information to learn hash functions for multi-modal data. For instance, cross-view hashing (CVH) [27] extends spectral hashing from uni-modal to multi-modal scenarios. The pairwise-based methods use both the co-occurrence information and similar/dissimilar pairs to learn the hash functions. Bronstein et al. [11] proposed cross-modal similarity sensitive hashing (CMSSH), which learns the hash functions to ensure that if two samples (with different modalities) are relevant/irrelevant, their corresponding binary codes are similar/dissimilar. The supervised methods exploit label information to learn more discriminative common representation. Semantic correlation maximization (SCM) [40] uses a label vector to obtain the similarity matrix and reconstruct it through the binary codes. Xu et al. [35] proposed discrete cross-modal hashing (DCH), which directly learns discriminative binary codes with the discrete constraints. Most of these works are based on hand-crafted features.

The deep learning with neural networks has shown that this approach can effectively discover the correlations across different modalities. The deep cross-modal hashing (DCMH) [15] integrates feature learning and hash-code learning into the same framework. Cao et al. [3] proposed deep visual-semantic hashing (DVSH), which utilizes both a convolutional neural network (CNN) and long short-term memory (LSTM) to separately learn the common representations for each modality. Pairwise relationship guided deep hashing (PRDH) [36] also adopts deep CNN models to learn feature representations and hash codes simultaneously.

2.2 Generative Adversarial Network

Recently, generative adversarial networks (GANs) [10] have received a lot of attention and achieved impressive results in various applications, including image-to-image translation [42], image generation [1, 23] and representation learning [22, 24]. GANs have also been applied to retrieval problem. IRGAN [32] is a recently proposed method for information retrieval, in which the generative retrieval focuses on predicting relevant documents and the discriminative retrieval focuses on predicting relevancy given a query document pair. IRGAN is designed for uni-modal retrieval. While we focus on cross-modal retrieval in this paper.

Very recently, Wang et al. [28] present an adversarial cross-modal retrieval (ACMR) method to seek an effective common subspace based on adversarial learning: the modality classifier distinguishes the samples in terms of their modalities, and the feature projector generates modality-invariant representations that confuse the modality classifier. Both the ACMR and the proposed method use the adversarial learning, the main difference is that ACMR seeks to learn common subspace for the multi-modal data, while the adversarial learning in the proposed method is tailored to explicitly handle the attention-aware networks for cross-modal hashing. In addition, the ACMR falls into the category of real-valued approaches, while our method belongs to binary approaches. Further, Li et al. [18] present a self-supervised adversarial hashing (SSAH) for cross-modal retrieval.

To the best of our knowledge, the attention mechanism has not been well-explored for cross-modal hashing. The attention mechanism has been proved to be very powerful in many applications, such as image classification [2], image caption [34], image question answering [38], video action recognition [25] and etc. Inspired by that, in this paper, we carefully design an attention-aware deep adversarial hashing network for cross-modal hashing.

Fig. 2.
figure 2

Overview of our method. Above is the image modality branch, and below is the text modality branch. Each branch is divided into three parts: the feature learning module (including \(E^I\) and \(E^T\)), the attention module (\(G^I\) and \(G^T\)) and the hashing module (\(D^I\) and \(D^T\)). The feature learning module maps the input multi-modal data into the high-level feature representations. Then, the attention module learns the attention masks to divide the features representations into the attended and unattended features. Finally, the hashing module encodes all features into binary codes and learn similarity-preserving hash functions. We train the attention module and the hashing module alternately.

3 Deep Adversarial Hashing for Cross-Modal Retrieval

3.1 Problem Definition

Suppose there are \(n\) training samples, each of which is represented in several modalities, e.g., audio, video, image, and text. In this paper, we only focus on two modalities: text and image. Note that our method can be easily extended to other modalities. We denote the training data as \(\{I_i,T_i \}_{i=1}^n\), where \(I_i\) is the i-th image and \(T_i\) is the corresponding text description of image \(I_i\). We also have a cross-modal similarity matrix S, where \(S(i,j) = 1\) means that the i-th image and the j-th text are similar, while \(S(i,j) = 0\) means that they are dissimilar. The goal of cross-modal hashing is to learn two mapping functions to transform images and texts into a common binary codes space, in which the similarities between the paired images and texts are preserved. For instance, if \(S(i,j) = 1\), the Hamming distance between the generated binary codes of the i-th image and the j-th text should be small. When \(S(i,j) = 0\), the Hamming distance between them should be large.

3.2 Network Architecture

The proposed deep adversarial hashing network contains three components: (1) the feature learning module to obtain the high-level representations of the multi-modal data; (2) the attention module to generate the attention masks, and (3) the hashing module to learn the similarity-preserving hash functions.

Feature Learning Module: \(E^I\) and \(E^T\). For the image modality, a convolutional neural network is used to obtain the high-level representation of images. Specifically, we use VGGNet [26] to extract the image feature maps, i.e., conv5_4 in VGGNet. For representing text instances, we use a well-known bag-of-words (BOW) vector. Then, we utilize the two-layer feed-forward neural network (BOW \(\rightarrow \) 8192 \(\rightarrow \) 1000) to obtain the semantic text features. Let \(f^{I}_i = E^I(I_i)\) and \(f^{T}_i = E^T(T_i)\) denote the image feature maps and the text feature vector, respectively.

Fig. 3.
figure 3

The attention module. It first generates the attention masks \(Z^I\) and \(Z^T\). Then, each feature is divided into the attended and the unattended two parts.

Attention Module: \(G^I\) and \(G^T\). With the powerful image feature maps \(f^{I}\) and the text feature vector \(f^{T}\), we first feed them into a one-layer neural network, i.e., a convolutional layer with a \(1 \times 1\) kernel size for image feature maps and a fully connected layer for the text feature vector, followed by softmax and threshold functions to generate the attention distribution over the regions of the multi-modal data. Then, the attention masks are used to divide the feature representations into the attended and unattended feature representations.

More specifically, the detailed pipeline for processing the image modality is shown on the left side of Fig. 3. Suppose \(f^I_i \in \mathbb {R}^{H \times W \times C}\) represents the feature maps for the i-th image, where H, W and C are the height, weight and channels, respectively. In the first step, we first use a convolutional layer to compress the feature maps \(f^I_i\) to a matrix \(M_i^I = Conv(f^I_i)\), where \(M_i^I \in \mathbb {R}^{H \times W}\). In the second step, the matrix \(M_i^I\) goes through a softmax layer, and the output is the probability matrix \(P_i^I\). In the third step, we add a threshold layer to obtain the attention mask, which is defined as

$$\begin{aligned} Z_i^I(h,w)= {\left\{ \begin{array}{ll} 1~~~~~~~~P_i^I(h,w)~\ge ~\alpha \\ 0~~~~~~~~P_i^I(h,w)~<~\alpha , \end{array}\right. } \end{aligned}$$
(1)

where \(\alpha \) is a predefined threshold. We set \(\alpha =\frac{1}{H{\times }W}\) in our experiment. The output of the threshold layer is a binary mask. Based on the binary mask, we can calculate the attended and unattended feature maps for the i-th image by multiplying the binary mask in element-wise, which is formulated as

$$\begin{aligned} \begin{aligned}&\breve{f}^I_i (h,w,c) = Z^I_i(h,w) \times f^I_i (h,w,c), \ \ \mathbf {(attended)}&\\&\hat{f}^I_i (h,w,c) = \Big (1 - Z^I_i(h,w) \Big ) \times f^I_i (h,w,c), \ \ \mathbf {(unattended)}&\\ \end{aligned} \end{aligned}$$
(2)

for all hw and c. For ease of representation, we denote the whole procedures as \([\breve{f}^I_i, \hat{f}^I_i ]= G^I (f^I_i)\).

For the text modality, we imitate the pipeline of the image modality, which is shown on the right hand of Fig. 3:

$$\begin{aligned} \begin{aligned}&M^T_i = \text {fc}(f^{T}_i),&\\&P^T_i = \text {softmax}(M^T_i),&\\&Z^T_i = \text {threshold}(P^T_i),&\\&\breve{f}^T_i (j) = Z^T_i (j) \times f^T_i (j),\ \ \mathbf {(attended)}&\\&\hat{f}^T_i (j) = \Big (1 - Z^T_i (j) \Big ) \times f^T_i(j), \ \ \mathbf {(unattended)}&\end{aligned} \end{aligned}$$
(3)

where fc is a fully connected layer, and Z(j) is the j-th value of the vector Z. We denote \([\breve{f}^T_i,\hat{f}^T_i] = G^T(f^T_i)\) as the attended and unattended features for the i-th text.

Fig. 4.
figure 4

The hashing module for image modality \(D^I\) and text modality \(D^T\).

Directly taking the derivative of the threshold function is incompatible with the back-propagation in training. To address this issue, we follow the idea proposed in [7], which uses the straight-through estimator to propagate the gradients of the threshold function.

Hashing Module: \(D^I\) and \(D^T\). For the image modality, since we adopt VGGNet as our basic architecture, we also use the last fully connected layers, i.e., fc6 and fc7 Footnote 1. Then, we add a fully connected layer with q dimensional features and a tanh layer that restricts the values in the range \([-1,1]\) as shown on the left side of Fig. 4. Let the outputs of the discriminator be (1) the attended features \(H^I_i = D^I(\breve{f}^I_i)\) and (2) the unattended features \(\hat{H}^I_i = D^I(\hat{f}^I_i)\).

For the text modality, we also add a fully connected layer and a tanh layer to encode the text features into q bits as shown on the right side of Fig. 4. The outputs are (1) the attended features \(H^T_i = D^T(\breve{f}^T_i)\) and (2) the unattended features \(\hat{H}^T_i = D^T(\hat{f}^T_i)\).

3.3 Hashing Objectives

Our objectives contain two terms: (1) the cross-modal retrieval loss that corresponds to learning to preserve the similarities between different modalities of data and (2) the adversarial retrieval loss that corresponds to the hashing module aiming to preserve the similarities of the unattended binary codes, while the attention module tries to make the hashing module fails to maintain the similarities of the unattended binary codes.

Cross-modal Retrieval Loss. The aim of the cross-modal loss function is to keep the similarities between images and texts. The inter-modal ranking loss and the intra-modal ranking loss are used to preserve the similarities. That is, the hash codes from different modalities should preserve the semantic similarities, and the hash codes from the same modality should also preserve the semantic similarities. Hence, the cross-modal retrieval loss can be formulated as

$$\begin{aligned} \min \mathcal {F}_{T \rightarrow I} + \mathcal {F}_{I \rightarrow T} + \mathcal {F}_{I \rightarrow I} + \mathcal {F}_{T \rightarrow T}, \end{aligned}$$
(4)

where the first two terms are used to preserve the semantic similarities between different modalities, and the last two terms are used to preserve the similarities in their own modality. The symbol \(A \rightarrow B\) denotes the A modality is taken as the query to retrieve the relevant data of the B modality, where \(A \in \{T, I\}\) and \(B \in \{T, I\}\). \(\mathcal {F}_{A \rightarrow B}\) is the loss function for the A modality as the query and B modality as the database, which is defined as

$$\begin{aligned} \begin{aligned} \mathcal {F}_{A \rightarrow B}&= \sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^A_i - H^B_j|| - ||H^A_i - H^B_k||\} \\ s.t.~~~~~~~&\forall \langle i, j, k \rangle , \ S(i,j) > S(i,k), \end{aligned} \end{aligned}$$
(5)

where \(\langle i, j, k \rangle \) is the triplet form and \(\varepsilon \) is the margin. The objective is the triplet ranking loss [16], which shows effectiveness in the retrieval problem.

Adversarial Retrieval Loss. Inspired by the impressive results of the generative adversarial network, we adopt it to generate the attention distributions and learn the binary codes. Take the text \(\rightarrow \) image as an example, which is also shown in Fig. 1. Given a query \(H_i^T\), the hashing and the attention modules are trained in an adversarial way: (1) the hashing module preserves the semantic similarity between the query and the unattended features of the image modality, that is \(H_i^T\) is closer to \(\hat{H}^I_j\) than to \(\hat{H}^I_k\) when \(S(i,j) > S(i,k)\); (2) the attention module tries to find the unattended regions of the images in which the hashing module fails to preserve the similarities, that is \(H_i^T\) is closer to \(\hat{H}^I_k\) but not to \(\hat{H}^I_j\). The objective can be defined as \(\mathcal {F}_{T \rightarrow \hat{I}} = \sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^T_i - \hat{H}^I_j|| - ||H^T_i - \hat{H}^I_k||\}\). The hashing module tries to minimize the objective, while the attention module tries to maximize it. The same process for the image \(\rightarrow \) text. Thus, the loss can be expressed as

$$\begin{aligned} \begin{aligned} \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}} =&\sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^T_i - \hat{H}^I_j|| - ||H^T_i - \hat{H}^I_k||\} \\&+\sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^I_i - \hat{H}^T_j|| - ||H^I_i - \hat{H}^T_k||\}, \end{aligned} \end{aligned}$$
(6)

where \(\hat{H}^T\) and \(\hat{H}^I\) are the unattended features defined in Sect. 3.2. The first term corresponds to taking the text modality as the query to retrieve the unattended features of the image modality. The second term corresponds to the image modality being taken as the query to retrieve the unattended features of the text modality. \(G^I, G^T\) attempt to maximize the loss and \(D^I, D^T\) to minimize the objective:

$$\begin{aligned} \min _{D^I,D^T} \max _{G^I,G^T} \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}}. \end{aligned}$$
(7)

Full Objective. Our full objective is

$$\begin{aligned}&\mathcal {F}(E^I,E^T,G^I,G^T,D^I,D^T)= \ \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}} \nonumber \\&+ \mathcal {F}_{T \rightarrow I} + \mathcal {F}_{I \rightarrow T} + \mathcal {F}_{I \rightarrow I} + \mathcal {F}_{T \rightarrow T} \nonumber . \end{aligned}$$

We train our model alternatively. The parameters in \(G^I\) and \(G^T\) are fixed, while the other parameters are trained:

$$\begin{aligned} \min _{E^I,E^T,D^I,D^T} \mathcal {F}(E^I,E^T,G^I,G^T,D^I,D^T). \end{aligned}$$
(8)

Then \(E^I,E^T,D^I,\) and \(D^T\) are fixed and the attention models are updated:

$$\begin{aligned} \max _{G^I,G^T} \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}}. \end{aligned}$$
(9)

4 Experiments

In this section, we evaluate the performance of our proposed methods on three datasets and compare it to the performance of several stage-of-the-art algorithms.

4.1 Experimental Settings

Datasets. We choose three benchmark datasets: IAPR TC-12 [9], MIR-Flickr 25K [13] and NUS-WIDE [6].

  • IAPR TC-12 [9]: This dataset consists of 20,000 images taken from locations around the world. Each image is associated with a text caption, e.g., a sentence. The image-text pairs are annotated using 255 labels. For the text modality, each sentence is represented as a 2,912-dimensional bag-of-words vector Footnote 2.

  • MIR-Flickr 25K [13]: This dataset contains 25,000 multi-label images downloaded from the Flickr Footnote 3 photo-sharing website. Each image is associated with several textural tags. For a fair comparison, we follow the settings in [15] to use the subset of the image-text pairs with at least 20 textual tags. For the text modality, the textural tags are represented as a 1,386-dimensional bag-of-words vector.

  • NUS-WIDE [6]: This dataset consists of 269,648 images collected from Flickr. Each image is associated with one or multiple textural tags in 81 semantic concepts. We evaluate the performance on 195,834 image-text pairs belonging to the 21 most frequent labels, as suggested by [15]. The text is represented as a 1,000-dimensional bag-of-words vector.

We follow the settings of DCMH [15] to construct the query sets, training sets, and retrieval databases. The randomly sampled 2,000 image-text pairs are constructed as the query set for IAPR TC-12 and MIR-Flickr 25K. For the NUS-WIDE dataset, we randomly sample 2,100 image-text pairs as the query set. For all datasets, the remaining image-text pairs are used as the databases for retrieval. For all supervised methods, we also sample 10,000 pairs from the retrieval set as the training set for IAPR TC-12 and MIR-Flickr 25K, as well as 10,500 pairs from the retrieval set as the training set for NUS-WIDE.

Note that the representations of text are not the focus of this paper. Since the most related works, e.g., DCMH [15], use bag-of-words, we also use bag-of-words for a fair comparison.

Implementation details. We implement our codes based on the open source caffe [14] framework. In training, the networks are updated alternatively through the stochastic gradient solver, i.e., ADAM  (\(\alpha =0.0002\), \(\beta _{1}=0.5\)). We alternate between four steps of optimizing ED and one step of optimizing G. For the image modality, the weights of VGGNet are initialized with the pre-trained model that learns from the ImageNet dataset. For text modality, all parameters are randomly initialized with a Gaussian with mean zero and standard deviation 0.01. The batch size is 64, and the total epoch is 100. The base learning rate is 0.005, and it is changed to one-tenth of the current value after every 20 epochs. In testing, we use only the attended features of the data to construct the binary codes.

Evaluation Measures. To evaluate the performance of hashing models, we use two metrics: the mean average precision (MAP) [20] and precision-recall curves. The MAP is a standard evaluation metric for information retrieval.

Fig. 5.
figure 5

Precision-recall curves on three datasets. The length of hash code is 16.

4.2 Comparison with State-of-the-Art Methods

The first set of experiments is to evaluate the performance of the proposed method and compare it with the performance of several state-of-the-art algorithms Footnote 4: CCA [12], CMFH [40], SCM [8], SMTH [30], SePH [19], DCMH [15], and PRDH [37]. The results of CCA, CMFH, SCM, STMH, SePH and DCMH are directly cited from [15] published in CVPR17 Footnote 5. Since the experimental settings of PRDH in [37] are different from those of the proposed method, we carefully implement PRDH using the same CNN network and the same settings for a fair comparison.

The comparison results of the search accuracies on all three datasets are shown in Table 1. We can see that our method outperforms other baselines and achieves excellent performance. For example, on IAPR TC-12, the MAP of our method is 0.5439, compared to the value of 0.5135 for the second best algorithm (PRDH), on 64 bits when taking the image as the query to retrieve text. The precision-recall curves are also shown in Fig. 5. It can be seen that our method shows comparable performance to the existing baselines.

Table 1. MAP of Hamming ranking w.r.t. different numbers of bits on three datasets.

Since the code of DVSH is not publicly available and it is difficult to re-implement the complex algorithm, we utilize the same experimental settings used in DVSH for our method. The results of DVSH are directly cited from [3] for a fair comparison. The top-500 MAP results on IAPR TC-12 are shown in Table 2. Moreover, we make a comparison with DCMH under the same settings. Please note that DVSH adopts the LSTM recurrent neural network for text representation, while DCMH and our method only use bag-of-words. From the table, we can see that our methods can achieve better performance than the baselines in most cases, even we use the weak representations of text.

Table 2. The comparison results w.r.t. the top-500 MAP on the IAPR TC-12 dataset.

We also explore the effects of small network architecture in the feature learning module for the image modality since VGGNet is a large deep network. In this experiment, we select CNN-F [5] as the basic network for the image modality. The comparison results are shown in Table 3. We can see that VGGNet performs better than CNN-F while our method using CNN-F also achieves good performance compared to other state-of-the-art baselines.

Fig. 6.
figure 6

Some image and mask samples. The first line represents the original images, the masks are shown in the second line, and the combinations are shown in the last two lines.

Fig. 7.
figure 7

Different attention mechanisms.

Table 3. MAP on IAPR TC-12 dataset with different networks.

The main reason for the good performance of our method is that we can obtain attended regions for the multi-modal data. Figure 6 shows some examples of the image modality. Note that it is difficult to visualize the text modality (the networks for the text modality are the fully connected layers instead of the CNN. The order of words in the sentence are changed after going through the fully connected layers), thus, we do not show the masks learned in the text network.

4.3 Comparison with Different Attention Mechanisms

In this section, we present an ablation study to clarify the impact of each part of the attention modules on the final performance.

To provide an intuitive comparison of our method, we compare it with the following baselines. In the first baseline, we do not use any attention mechanism as shown on the left side of Fig. 7. It is also the traditional deep cross-modal hashing. In the second baseline, we only apply the visual attention mechanism as seen in the middle of Fig. 7. Similarly, the last baseline is to explore the textural attention mechanism as shown on the right side of Fig. 7. Note that all baselines, as well as our method, use the same network. The only differences are the use of the different attention mechanisms. These comparisons can show whether the proposed attention mechanism can contribute to the accuracy.

Table 4 shows the comparison results with respect to the MAP. The results show that our proposed attention mechanism can achieve better performance than the baselines that are lacking attention mechanisms. The main reason for this is that our method can focus on the most discriminative regions of the data.

Table 4. The comparison results for different attention mechanisms.

5 Conclusion

In this paper, we proposed a novel approach called deep adversarial hashing for cross-modal hashing. The proposed method contains three major components: a feature learning module, an attention module, and a hashing module. The feature learning module learns powerful representations for the multi-modal data. The attention module and the hashing module are trained in an adversarial way, in which the hashing module tries to minimize the similarity-preserving loss functions, while the attention module aims to find the unattended regions of data that maximize the retrieval loss. We performed our method on three datasets, and the experimental results demonstrate the appealing performance of our method.