Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval

Zhang, Xi; Lai, Hanjiang; Feng, Jiashi

doi:10.1007/978-3-030-01267-0_36

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11219))

Included in the following conference series:

European Conference on Computer Vision

3024 Accesses
90 Citations

Abstract

Due to the rapid growth of multi-modal data, hashing methods for cross-modal retrieval have received considerable attention. However, finding content similarities between different modalities of data is still challenging due to an existing heterogeneity gap. To further address this problem, we propose an adversarial hashing network with an attention mechanism to enhance the measurement of content similarities by selectively focusing on the informative parts of multi-modal data. The proposed new deep adversarial network consists of three building blocks: (1) the feature learning module to obtain the feature representations; (2) the attention module to generate an attention mask, which is used to divide the feature representations into the attended and unattended feature representations; and (3) the hashing module to learn hash functions that preserve the similarities between different modalities. In our framework, the attention and hashing modules are trained in an adversarial way: the attention module attempts to make the hashing module unable to preserve the similarities of multi-modal data w.r.t. the unattended feature representations, while the hashing module aims to preserve the similarities of multi-modal data w.r.t. the attended and unattended feature representations. Extensive evaluations on several benchmark datasets demonstrate that the proposed method brings substantial improvements over other state-of-the-art cross-modal hashing methods.

You have full access to this open access chapter, Download conference paper PDF

Multi-attention based semantic deep hashing for cross-modal retrieval

Article 20 January 2021

Multi-attention and Multi-layer Hashing for Cross-Modal Retrieval

Deep semantic hashing with dual attention for cross-modal retrieval

Article 12 November 2021

Keywords

1 Introduction

Due to the rapid development of the Internet, different types of media data are also growing rapidly, e.g., texts, images, and videos. Cross-modal retrieval, which takes one type of data as the query and returns the relevant data of another type, is increasingly receiving attention since it is a natural way to search for multi-modal data. The solution methods can be roughly divided into two categories [33]: real-valued representation learning and binary representation learning. Because of the low storage cost and fast retrieval speed of the binary representation, we only focus on cross-modal binary representation learning (i.e., hashing [17, 31]) in this paper.

To date, various cross-modal hashing algorithms [3, 8, 15, 19, 36, 40, 41] have been proposed for embedding correlations among different modalities of data. In the cross-modal hashing procedure, feature extraction is considered the first step for representing all modalities of data, and then, these multi-modal features can be projected into a common Hamming space for future searches. Many methods [8, 40] use a shallow architecture for feature extraction. For example, collective matrix factorization hashing (CMFH) [8] and semantic correlation maximization (SCM) [40] use the hand-crafted features. Recently, deep learning has also been adopted for cross-modal hashing due to its powerful ability to learn good representations of data. The representative work of deep-network-based cross-modal hashing includes deep cross-modal hashing (DCMH) [15], deep visual-semantic hashing (DVSH) [3], pairwise relationship guided deep hashing (PRDH) [36], etc.

In parallel, the computational model of “attention”’ has drawn much interest due to its impressive result in various applications, e.g., image caption [34]. It is also desired for cross-modal retrieval problems. For example, as shown in Fig. 1, given a query girl sits on donkey, if we can locate the more informative regions in the image (e.g., the black regions), a higher degree of accuracy can be obtained. To the best of our knowledge, the attention mechanism has not been well-explored for cross-modal hashing.

In this paper, we propose an attention mechanism for cross-modal hashing. The model first decides where (i.e., which region of multi-modal data) it should attend to; then, the attended region should be favoured for retrieval. Based on this, an attention module is proposed to find the attended regions and a hashing module is to learn the similarity-preserving hash functions. In the attention module, the adaptive attention mask is generated for each data, which divides the data into attended and unattended regions. Ideally, well-learned attention masks should locate discriminative regions, which means that the unattended regions of data are uninformative and difficult to preserve the similarities. Hence, the attention module undergoes learning to make the hashing module unable to preserve the similarities of the unattended regions of data. However, the learned hash functions should preserve the similarities for both the attended (which can be viewed as easy examples) and unattended (hard examples) regions of data to enhance the robustness and performance. Thus, the hashing module undergoes learning to preserve the similarities of both the unattended and attended regions of data. Note that the attention module and the hashing module are trained in an adversarial way: the attention module attempts to find the unattended regions in which the hashing module fails to maintain the similarities, whereas the hashing module aims to preserve the similarities of the multi-modal data.

A new deep adversarial hashing for cross-modal retrieval is illustrated in Fig. 2. It consists of three major components: (1) a feature learning module that uses CNN or MLP to extract high level semantic representations for the multi-modal data; (2) an attention module that generates the adaptive attention masks and divides the feature representations into the attended and unattended feature representations; and (3) a hashing module that focuses on learning the binary codes for the multi-modal data. The adversarial retrieval loss and the cross-modal loss are proposed to obtain good attention masks and powerful hash functions.

The main contributions of our work are three-fold. First, we propose an attention-aware method for the cross-modal hashing problem. It is able to detect the informative regions of multi-modal data, which is helpful for identifying content similarities between different modalities of data. Second, we propose a deep adversarial hashing for learning effective attention masks and compact binary codes simultaneously. Third, we quantitatively evaluate the usefulness of attention in cross-modal hashing, and our method yields better performances in comparison with several state-of-the-art methods.

2 Related Work

2.1 Cross-Modal Hashing

According to the utilized information for learning the common representations, cross-modal hashing can be categorized into three groups [33]: (1) the unsupervised methods [29], (2) the pairwise-based methods [21, 41] and (3) the supervised methods [4, 39]. The unsupervised methods only use co-occurrence information to learn hash functions for multi-modal data. For instance, cross-view hashing (CVH) [27] extends spectral hashing from uni-modal to multi-modal scenarios. The pairwise-based methods use both the co-occurrence information and similar/dissimilar pairs to learn the hash functions. Bronstein et al. [11] proposed cross-modal similarity sensitive hashing (CMSSH), which learns the hash functions to ensure that if two samples (with different modalities) are relevant/irrelevant, their corresponding binary codes are similar/dissimilar. The supervised methods exploit label information to learn more discriminative common representation. Semantic correlation maximization (SCM) [40] uses a label vector to obtain the similarity matrix and reconstruct it through the binary codes. Xu et al. [35] proposed discrete cross-modal hashing (DCH), which directly learns discriminative binary codes with the discrete constraints. Most of these works are based on hand-crafted features.

The deep learning with neural networks has shown that this approach can effectively discover the correlations across different modalities. The deep cross-modal hashing (DCMH) [15] integrates feature learning and hash-code learning into the same framework. Cao et al. [3] proposed deep visual-semantic hashing (DVSH), which utilizes both a convolutional neural network (CNN) and long short-term memory (LSTM) to separately learn the common representations for each modality. Pairwise relationship guided deep hashing (PRDH) [36] also adopts deep CNN models to learn feature representations and hash codes simultaneously.

2.2 Generative Adversarial Network

Recently, generative adversarial networks (GANs) [10] have received a lot of attention and achieved impressive results in various applications, including image-to-image translation [42], image generation [1, 23] and representation learning [22, 24]. GANs have also been applied to retrieval problem. IRGAN [32] is a recently proposed method for information retrieval, in which the generative retrieval focuses on predicting relevant documents and the discriminative retrieval focuses on predicting relevancy given a query document pair. IRGAN is designed for uni-modal retrieval. While we focus on cross-modal retrieval in this paper.

Very recently, Wang et al. [28] present an adversarial cross-modal retrieval (ACMR) method to seek an effective common subspace based on adversarial learning: the modality classifier distinguishes the samples in terms of their modalities, and the feature projector generates modality-invariant representations that confuse the modality classifier. Both the ACMR and the proposed method use the adversarial learning, the main difference is that ACMR seeks to learn common subspace for the multi-modal data, while the adversarial learning in the proposed method is tailored to explicitly handle the attention-aware networks for cross-modal hashing. In addition, the ACMR falls into the category of real-valued approaches, while our method belongs to binary approaches. Further, Li et al. [18] present a self-supervised adversarial hashing (SSAH) for cross-modal retrieval.

To the best of our knowledge, the attention mechanism has not been well-explored for cross-modal hashing. The attention mechanism has been proved to be very powerful in many applications, such as image classification [2], image caption [34], image question answering [38], video action recognition [25] and etc. Inspired by that, in this paper, we carefully design an attention-aware deep adversarial hashing network for cross-modal hashing.

3 Deep Adversarial Hashing for Cross-Modal Retrieval

3.1 Problem Definition

Suppose there are $n$ training samples, each of which is represented in several modalities, e.g., audio, video, image, and text. In this paper, we only focus on two modalities: text and image. Note that our method can be easily extended to other modalities. We denote the training data as $\{I_i,T_i \}_{i=1}^n$, where $I_i$ is the i-th image and $T_i$ is the corresponding text description of image $I_i$. We also have a cross-modal similarity matrix S, where $S(i,j) = 1$ means that the i-th image and the j-th text are similar, while $S(i,j) = 0$ means that they are dissimilar. The goal of cross-modal hashing is to learn two mapping functions to transform images and texts into a common binary codes space, in which the similarities between the paired images and texts are preserved. For instance, if $S(i,j) = 1$, the Hamming distance between the generated binary codes of the i-th image and the j-th text should be small. When $S(i,j) = 0$, the Hamming distance between them should be large.

3.2 Network Architecture

The proposed deep adversarial hashing network contains three components: (1) the feature learning module to obtain the high-level representations of the multi-modal data; (2) the attention module to generate the attention masks, and (3) the hashing module to learn the similarity-preserving hash functions.

Feature Learning Module: $E^I$ and $E^T$. For the image modality, a convolutional neural network is used to obtain the high-level representation of images. Specifically, we use VGGNet [26] to extract the image feature maps, i.e., conv5_4 in VGGNet. For representing text instances, we use a well-known bag-of-words (BOW) vector. Then, we utilize the two-layer feed-forward neural network (BOW $\rightarrow $ 8192 $\rightarrow $ 1000) to obtain the semantic text features. Let $f^{I}_i = E^I(I_i)$ and $f^{T}_i = E^T(T_i)$ denote the image feature maps and the text feature vector, respectively.

Attention Module: $G^I$ and $G^T$. With the powerful image feature maps $f^{I}$ and the text feature vector $f^{T}$, we first feed them into a one-layer neural network, i.e., a convolutional layer with a $1 \times 1$ kernel size for image feature maps and a fully connected layer for the text feature vector, followed by softmax and threshold functions to generate the attention distribution over the regions of the multi-modal data. Then, the attention masks are used to divide the feature representations into the attended and unattended feature representations.

More specifically, the detailed pipeline for processing the image modality is shown on the left side of Fig. 3. Suppose $f^I_i \in \mathbb {R}^{H \times W \times C}$ represents the feature maps for the i-th image, where H, W and C are the height, weight and channels, respectively. In the first step, we first use a convolutional layer to compress the feature maps $f^I_i$ to a matrix $M_i^I = Conv(f^I_i)$, where $M_i^I \in \mathbb {R}^{H \times W}$. In the second step, the matrix $M_i^I$ goes through a softmax layer, and the output is the probability matrix $P_i^I$. In the third step, we add a threshold layer to obtain the attention mask, which is defined as

$$\begin{aligned} Z_i^I(h,w)= {\left\{ \begin{array}{ll} 1~~~~~~~~P_i^I(h,w)~\ge ~\alpha \\ 0~~~~~~~~P_i^I(h,w)~<~\alpha , \end{array}\right. } \end{aligned}$$

(1)

where $\alpha $ is a predefined threshold. We set $\alpha =\frac{1}{H{\times }W}$ in our experiment. The output of the threshold layer is a binary mask. Based on the binary mask, we can calculate the attended and unattended feature maps for the i-th image by multiplying the binary mask in element-wise, which is formulated as

$$\begin{aligned} \begin{aligned}&\breve{f}^I_i (h,w,c) = Z^I_i(h,w) \times f^I_i (h,w,c), \ \ \mathbf {(attended)}&\\&\hat{f}^I_i (h,w,c) = \Big (1 - Z^I_i(h,w) \Big ) \times f^I_i (h,w,c), \ \ \mathbf {(unattended)}&\\ \end{aligned} \end{aligned}$$

(2)

for all h, w and c. For ease of representation, we denote the whole procedures as $[\breve{f}^I_i, \hat{f}^I_i ]= G^I (f^I_i)$.

For the text modality, we imitate the pipeline of the image modality, which is shown on the right hand of Fig. 3:

$$\begin{aligned} \begin{aligned}&M^T_i = \text {fc}(f^{T}_i),&\\&P^T_i = \text {softmax}(M^T_i),&\\&Z^T_i = \text {threshold}(P^T_i),&\\&\breve{f}^T_i (j) = Z^T_i (j) \times f^T_i (j),\ \ \mathbf {(attended)}&\\&\hat{f}^T_i (j) = \Big (1 - Z^T_i (j) \Big ) \times f^T_i(j), \ \ \mathbf {(unattended)}&\end{aligned} \end{aligned}$$

(3)

where fc is a fully connected layer, and Z(j) is the j-th value of the vector Z. We denote $[\breve{f}^T_i,\hat{f}^T_i] = G^T(f^T_i)$ as the attended and unattended features for the i-th text.

Directly taking the derivative of the threshold function is incompatible with the back-propagation in training. To address this issue, we follow the idea proposed in [7], which uses the straight-through estimator to propagate the gradients of the threshold function.

Hashing Module: $D^I$ and $D^T$. For the image modality, since we adopt VGGNet as our basic architecture, we also use the last fully connected layers, i.e., fc6 and fc7 ^{Footnote 1}. Then, we add a fully connected layer with q dimensional features and a tanh layer that restricts the values in the range $[-1,1]$ as shown on the left side of Fig. 4. Let the outputs of the discriminator be (1) the attended features $H^I_i = D^I(\breve{f}^I_i)$ and (2) the unattended features $\hat{H}^I_i = D^I(\hat{f}^I_i)$.

For the text modality, we also add a fully connected layer and a tanh layer to encode the text features into q bits as shown on the right side of Fig. 4. The outputs are (1) the attended features $H^T_i = D^T(\breve{f}^T_i)$ and (2) the unattended features $\hat{H}^T_i = D^T(\hat{f}^T_i)$.

3.3 Hashing Objectives

Our objectives contain two terms: (1) the cross-modal retrieval loss that corresponds to learning to preserve the similarities between different modalities of data and (2) the adversarial retrieval loss that corresponds to the hashing module aiming to preserve the similarities of the unattended binary codes, while the attention module tries to make the hashing module fails to maintain the similarities of the unattended binary codes.

Cross-modal Retrieval Loss. The aim of the cross-modal loss function is to keep the similarities between images and texts. The inter-modal ranking loss and the intra-modal ranking loss are used to preserve the similarities. That is, the hash codes from different modalities should preserve the semantic similarities, and the hash codes from the same modality should also preserve the semantic similarities. Hence, the cross-modal retrieval loss can be formulated as

$$\begin{aligned} \min \mathcal {F}_{T \rightarrow I} + \mathcal {F}_{I \rightarrow T} + \mathcal {F}_{I \rightarrow I} + \mathcal {F}_{T \rightarrow T}, \end{aligned}$$

(4)

where the first two terms are used to preserve the semantic similarities between different modalities, and the last two terms are used to preserve the similarities in their own modality. The symbol $A \rightarrow B$ denotes the A modality is taken as the query to retrieve the relevant data of the B modality, where $A \in \{T, I\}$ and $B \in \{T, I\}$. $\mathcal {F}_{A \rightarrow B}$ is the loss function for the A modality as the query and B modality as the database, which is defined as

$$\begin{aligned} \begin{aligned} \mathcal {F}_{A \rightarrow B}&= \sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^A_i - H^B_j|| - ||H^A_i - H^B_k||\} \\ s.t.~~~~~~~&\forall \langle i, j, k \rangle , \ S(i,j) > S(i,k), \end{aligned} \end{aligned}$$

(5)

where $\langle i, j, k \rangle $ is the triplet form and $\varepsilon $ is the margin. The objective is the triplet ranking loss [16], which shows effectiveness in the retrieval problem.

Adversarial Retrieval Loss. Inspired by the impressive results of the generative adversarial network, we adopt it to generate the attention distributions and learn the binary codes. Take the text $\rightarrow $ image as an example, which is also shown in Fig. 1. Given a query $H_i^T$, the hashing and the attention modules are trained in an adversarial way: (1) the hashing module preserves the semantic similarity between the query and the unattended features of the image modality, that is $H_i^T$ is closer to $\hat{H}^I_j$ than to $\hat{H}^I_k$ when $S(i,j) > S(i,k)$; (2) the attention module tries to find the unattended regions of the images in which the hashing module fails to preserve the similarities, that is $H_i^T$ is closer to $\hat{H}^I_k$ but not to $\hat{H}^I_j$. The objective can be defined as $\mathcal {F}_{T \rightarrow \hat{I}} = \sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^T_i - \hat{H}^I_j|| - ||H^T_i - \hat{H}^I_k||\}$. The hashing module tries to minimize the objective, while the attention module tries to maximize it. The same process for the image $\rightarrow $ text. Thus, the loss can be expressed as

$$\begin{aligned} \begin{aligned} \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}} =&\sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^T_i - \hat{H}^I_j|| - ||H^T_i - \hat{H}^I_k||\} \\&+\sum _{\langle i, j, k \rangle } \max \{0, \varepsilon + ||H^I_i - \hat{H}^T_j|| - ||H^I_i - \hat{H}^T_k||\}, \end{aligned} \end{aligned}$$

(6)

where $\hat{H}^T$ and $\hat{H}^I$ are the unattended features defined in Sect. 3.2. The first term corresponds to taking the text modality as the query to retrieve the unattended features of the image modality. The second term corresponds to the image modality being taken as the query to retrieve the unattended features of the text modality. $G^I, G^T$ attempt to maximize the loss and $D^I, D^T$ to minimize the objective:

$$\begin{aligned} \min _{D^I,D^T} \max _{G^I,G^T} \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}}. \end{aligned}$$

(7)

Full Objective. Our full objective is

$$\begin{aligned}&\mathcal {F}(E^I,E^T,G^I,G^T,D^I,D^T)= \ \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}} \nonumber \\&+ \mathcal {F}_{T \rightarrow I} + \mathcal {F}_{I \rightarrow T} + \mathcal {F}_{I \rightarrow I} + \mathcal {F}_{T \rightarrow T} \nonumber . \end{aligned}$$

We train our model alternatively. The parameters in $G^I$ and $G^T$ are fixed, while the other parameters are trained:

$$\begin{aligned} \min _{E^I,E^T,D^I,D^T} \mathcal {F}(E^I,E^T,G^I,G^T,D^I,D^T). \end{aligned}$$

(8)

Then $E^I,E^T,D^I,$ and $D^T$ are fixed and the attention models are updated:

$$\begin{aligned} \max _{G^I,G^T} \mathcal {F}_{T \rightarrow \hat{I}} + \mathcal {F}_{I \rightarrow \hat{T}}. \end{aligned}$$

(9)

4 Experiments

In this section, we evaluate the performance of our proposed methods on three datasets and compare it to the performance of several stage-of-the-art algorithms.

4.1 Experimental Settings

Datasets. We choose three benchmark datasets: IAPR TC-12 [9], MIR-Flickr 25K [13] and NUS-WIDE [6].

IAPR TC-12 [9]: This dataset consists of 20,000 images taken from locations around the world. Each image is associated with a text caption, e.g., a sentence. The image-text pairs are annotated using 255 labels. For the text modality, each sentence is represented as a 2,912-dimensional bag-of-words vector ^{Footnote 2}.
MIR-Flickr 25K [13]: This dataset contains 25,000 multi-label images downloaded from the Flickr ^{Footnote 3} photo-sharing website. Each image is associated with several textural tags. For a fair comparison, we follow the settings in [15] to use the subset of the image-text pairs with at least 20 textual tags. For the text modality, the textural tags are represented as a 1,386-dimensional bag-of-words vector.
NUS-WIDE [6]: This dataset consists of 269,648 images collected from Flickr. Each image is associated with one or multiple textural tags in 81 semantic concepts. We evaluate the performance on 195,834 image-text pairs belonging to the 21 most frequent labels, as suggested by [15]. The text is represented as a 1,000-dimensional bag-of-words vector.

We follow the settings of DCMH [15] to construct the query sets, training sets, and retrieval databases. The randomly sampled 2,000 image-text pairs are constructed as the query set for IAPR TC-12 and MIR-Flickr 25K. For the NUS-WIDE dataset, we randomly sample 2,100 image-text pairs as the query set. For all datasets, the remaining image-text pairs are used as the databases for retrieval. For all supervised methods, we also sample 10,000 pairs from the retrieval set as the training set for IAPR TC-12 and MIR-Flickr 25K, as well as 10,500 pairs from the retrieval set as the training set for NUS-WIDE.

Note that the representations of text are not the focus of this paper. Since the most related works, e.g., DCMH [15], use bag-of-words, we also use bag-of-words for a fair comparison.

Implementation details. We implement our codes based on the open source caffe [14] framework. In training, the networks are updated alternatively through the stochastic gradient solver, i.e., ADAM ($\alpha =0.0002$, $\beta _{1}=0.5$). We alternate between four steps of optimizing E, D and one step of optimizing G. For the image modality, the weights of VGGNet are initialized with the pre-trained model that learns from the ImageNet dataset. For text modality, all parameters are randomly initialized with a Gaussian with mean zero and standard deviation 0.01. The batch size is 64, and the total epoch is 100. The base learning rate is 0.005, and it is changed to one-tenth of the current value after every 20 epochs. In testing, we use only the attended features of the data to construct the binary codes.

Evaluation Measures. To evaluate the performance of hashing models, we use two metrics: the mean average precision (MAP) [20] and precision-recall curves. The MAP is a standard evaluation metric for information retrieval.

4.2 Comparison with State-of-the-Art Methods

The first set of experiments is to evaluate the performance of the proposed method and compare it with the performance of several state-of-the-art algorithms ^{Footnote 4}: CCA [12], CMFH [40], SCM [8], SMTH [30], SePH [19], DCMH [15], and PRDH [37]. The results of CCA, CMFH, SCM, STMH, SePH and DCMH are directly cited from [15] published in CVPR17 ^{Footnote 5}. Since the experimental settings of PRDH in [37] are different from those of the proposed method, we carefully implement PRDH using the same CNN network and the same settings for a fair comparison.

The comparison results of the search accuracies on all three datasets are shown in Table 1. We can see that our method outperforms other baselines and achieves excellent performance. For example, on IAPR TC-12, the MAP of our method is 0.5439, compared to the value of 0.5135 for the second best algorithm (PRDH), on 64 bits when taking the image as the query to retrieve text. The precision-recall curves are also shown in Fig. 5. It can be seen that our method shows comparable performance to the existing baselines.

Table 1. MAP of Hamming ranking w.r.t. different numbers of bits on three datasets.

Full size table

Since the code of DVSH is not publicly available and it is difficult to re-implement the complex algorithm, we utilize the same experimental settings used in DVSH for our method. The results of DVSH are directly cited from [3] for a fair comparison. The top-500 MAP results on IAPR TC-12 are shown in Table 2. Moreover, we make a comparison with DCMH under the same settings. Please note that DVSH adopts the LSTM recurrent neural network for text representation, while DCMH and our method only use bag-of-words. From the table, we can see that our methods can achieve better performance than the baselines in most cases, even we use the weak representations of text.

Table 2. The comparison results w.r.t. the top-500 MAP on the IAPR TC-12 dataset.

Full size table

We also explore the effects of small network architecture in the feature learning module for the image modality since VGGNet is a large deep network. In this experiment, we select CNN-F [5] as the basic network for the image modality. The comparison results are shown in Table 3. We can see that VGGNet performs better than CNN-F while our method using CNN-F also achieves good performance compared to other state-of-the-art baselines.

Table 3. MAP on IAPR TC-12 dataset with different networks.

Full size table

The main reason for the good performance of our method is that we can obtain attended regions for the multi-modal data. Figure 6 shows some examples of the image modality. Note that it is difficult to visualize the text modality (the networks for the text modality are the fully connected layers instead of the CNN. The order of words in the sentence are changed after going through the fully connected layers), thus, we do not show the masks learned in the text network.

4.3 Comparison with Different Attention Mechanisms

In this section, we present an ablation study to clarify the impact of each part of the attention modules on the final performance.

To provide an intuitive comparison of our method, we compare it with the following baselines. In the first baseline, we do not use any attention mechanism as shown on the left side of Fig. 7. It is also the traditional deep cross-modal hashing. In the second baseline, we only apply the visual attention mechanism as seen in the middle of Fig. 7. Similarly, the last baseline is to explore the textural attention mechanism as shown on the right side of Fig. 7. Note that all baselines, as well as our method, use the same network. The only differences are the use of the different attention mechanisms. These comparisons can show whether the proposed attention mechanism can contribute to the accuracy.

Table 4 shows the comparison results with respect to the MAP. The results show that our proposed attention mechanism can achieve better performance than the baselines that are lacking attention mechanisms. The main reason for this is that our method can focus on the most discriminative regions of the data.

Table 4. The comparison results for different attention mechanisms.

Full size table

5 Conclusion

In this paper, we proposed a novel approach called deep adversarial hashing for cross-modal hashing. The proposed method contains three major components: a feature learning module, an attention module, and a hashing module. The feature learning module learns powerful representations for the multi-modal data. The attention module and the hashing module are trained in an adversarial way, in which the hashing module tries to minimize the similarity-preserving loss functions, while the attention module aims to find the unattended regions of data that maximize the retrieval loss. We performed our method on three datasets, and the experimental results demonstrate the appealing performance of our method.

Notes

1.
The last fully connected layer (i.e., fc8) is removed since it is for classification problems.
2.
We follow the settings of DCMH [15] for a fair comparison
3.
www.flickr.com
4.
Note that IRGAN is designed for uni-modal retrieval. ACMR is a cross-modal retrieval method that falls in the category of real-valued approaches. In this paper, we only focus on cross-modal hashing.
5.
Table 4 in http://openaccess.thecvf.com/content_cvpr_2017/papers/Jiang_Deep_Cross-Modal _Hashing_CVPR_2017_paper.pdf

References

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recongnition with visual attention. In: ICLR (2015)
Google Scholar
Cao, Y., Long, M., Wang, J., Yang, Q., Philip, S.Y.: Deep visual-semantic hashing for cross-modal retrieval. In: KDD, pp. 1445–1454 (2016)
Google Scholar
Cao, Y., Long, M., Wang, J., Zhu, H.: Correlation autoencoder hashing for supervised cross-modal search. In: ICMR, pp. 197–204 (2016)
Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. Comput. Sci. (2014)
Google Scholar
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of singapore. In: ICIVR, p. 48 (2009)
Google Scholar
Courbariaux, M., Bengio, Y.: Binarynet: training deep neural networks with weights and activations constrained to +1 or -1. abs/1602.02830 (2016)
Ding, G., Guo, Y., Zhou, J.: Collective matrix factorization hashing for multimodal data. In: CVPR, pp. 2075–2082 (2014)
Google Scholar
Escalante, H.J.: The segmented and annotated iapr tc-12 benchmark. CVIU 114(4), 419–428 (2010)
Google Scholar
Goodfellow, I. et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Google Scholar
He, R., Zheng, W.S., Hu, B.G.: Maximum correntropy criterion for robust face recognition. TPAMI 33(8), 1561–1576 (2011)
Article Google Scholar
Hotelling, H.: Relations Between Two Sets of Variates. Springer, New York (1992)
Book Google Scholar
Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: ICMIR, pp. 39–43 (2008)
Google Scholar
Jia, Y. et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Jiang, Q.Y., Li, W.: Deep cross-modal hashing. In: CVPR (2016)
Google Scholar
Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR, pp. 3270–3278 (2015)
Google Scholar
Lai, H., Yan, P., Shu, X., Wei, Y., Yan, S.: Instance-aware hashing for multi-label image retrieval. TIP 25(6), 2469–2479 (2016)
MathSciNet Google Scholar
Li, C., Deng, C., Li, N., Liu, W., Gao, X., Tao, D.: Self-supervised adversarial hashing networks for cross-modal retrieval. In: CVPR, pp. 4242–4251 (2018)
Google Scholar
Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-view retrieval. In: CVPR, pp. 3864–3872 (2015)
Google Scholar
Liu, W., Kumar, S., Kumar, S., Chang, S.F.: Discrete graph hashing. In: NIPS, pp. 3419–3427 (2014)
Google Scholar
Masci, J., Bronstein, M.M., Bronstein, A.M., Schmidhuber, J.: Multimodal similarity-preserving hashing. TPAMI 36(4), 824–830 (2014)
Article Google Scholar
Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: NIPS, pp. 5040–5048 (2016)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sun, L., Ji, S., Ye, J.: A least squares formulation for canonical correlation analysis. In: ICML, pp. 1024–1031 (2008)
Google Scholar
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: ACMMM, pp. 154–162 (2017)
Google Scholar
Wang, D., Cui, P., Ou, M., Zhu, W.: Learning compact hash codes for multimodal representations using orthogonal deep structure. TMM 17(9), 1404–1416 (2015)
Google Scholar
Wang, D., Gao, X., Wang, X., He, L.: Semantic topic multimodal hashing for cross-media retrieval. In: ICAI, pp. 3890–3896 (2015)
Google Scholar
Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. In: TPAMI (2017)
Google Scholar
Wang, J. et al.: Irgan: a minimax game for unifying generative and discriminative information retrieval models. arXiv preprint arXiv:1705.10513 (2017)
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016)
Xu, K. et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. TIP 26(5), 2494–2507 (2017)
MathSciNet Google Scholar
Yang, E., Deng, C., Liu, W., Liu, X., Tao, D., Gao, X.: Pairwise relationship guided deep hashing for cross-modal retrieval. In: AAAI, pp. 1618–1625 (2017)
Google Scholar
Yang, E., Deng, C., Liu, W., Liu, X., Tao, D., Gao, X.: Pairwise relationship guided deep hashing for cross-modal retrieval. In: AAAI (2017)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Google Scholar
Yu, Z., Wu, F., Yang, Y., Tian, Q., Luo, J., Zhuang, Y.: Discriminative coupled dictionary hashing for fast cross-media retrieval. In: SIGIR, pp. 395–404 (2014)
Google Scholar
Zhang, D., Li, W.J.: Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI. vol. 1, p. 7 (2014)
Google Scholar
Zhen, Y., Yeung, D.Y.: Co-regularized hashing for multimodal data. In: NIPS, pp. 1376–1384 (2012)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grants (61602530, U1611264, U1711262, 61472453, U1401256 and U1501252). This work is also supported by the Research Foundation of Science and Technology Plan Project in Guangdong Province (2017B030308007).

Author information

Authors and Affiliations

School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
Xi Zhang & Hanjiang Lai
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China
Xi Zhang & Hanjiang Lai
Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
Jiashi Feng

Authors

Xi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hanjiang Lai
View author publications
You can also search for this author in PubMed Google Scholar
Jiashi Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanjiang Lai .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Lai, H., Feng, J. (2018). Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11219. Springer, Cham. https://doi.org/10.1007/978-3-030-01267-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-01267-0_36
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01266-3
Online ISBN: 978-3-030-01267-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval

Abstract

Similar content being viewed by others

Multi-attention based semantic deep hashing for cross-modal retrieval

Multi-attention and Multi-layer Hashing for Cross-Modal Retrieval

Deep semantic hashing with dual attention for cross-modal retrieval

Keywords

1 Introduction