1 Introduction

Person re-identification (re-ID) is a critical task in intelligent video surveillance, aiming to associate the same people across different cameras. Encouraged by the remarkable success of deep Convolutional Neural Network (CNN) in image classification [23], the re-ID community has made great process by developing various networks, yielding quite effective visual representations [1, 12, 24, 28, 34, 36, 45, 50, 52, 67]. To further boost the identification accuracy, diverse auxiliary information has been incorporated in the deep neural networks, such as the camera ID information [32], human poses [59], person attributes [33, 48], depth maps [7], and infrared person images [54]. These data are utilized as either the augmented information for an enhanced inter-image similarity estimation [32, 54, 59] or the training supervisions that can regularize the feature learning process [33, 48]. Our work belongs to the latter category and proposes to use language descriptions as training supervisions to improve the person visual features. Compared with other types of auxiliary information, natural language provides a flexible and compact way of describing the salient visual aspects for distinguishing different persons. Previous efforts on language-based person re-ID [26] is about cross-modal image-text retrieval, aiming to search the target image from a gallery set by a text query. Instead, we are interested in how the language can help the image-to-image search when they are only utilized in the training stage. This task is non-trivial because it requires a detailed understanding of the content of images, language, and their cross-modal correspondences.

Fig. 1.
figure 1

Illustration of global and local image-language association in our framework. The global association is applied to the whole image and the language description, aiming to discriminate the matched image-language pairs from the unmatched ones. The local association aims to model the correspondences between the noun-phrases and images regions. The global and local image-language association is utilized to supervise the learning of person visual features.

To exploit the semantic information conveyed in the language descriptions, we not only need to identify the final image representation but also propose to optimize the global and local association between the intermediate features and linguistic features. The global image-language association is learned from their ID labels. That is, the overall image feature and text feature should have high relevance for the same person, and have low relevance when they are from different persons (Fig. 1, left). The local image-language association is based on the implicit correspondences between image regions and noun phrases (Fig . 1, right). As in a coupled image-text pair, a noun phrase in the text usually describes a specific region in the image, thus the phrase feature is more related to some local visual features. We design a deep neural network to automatically associate related phrases and local visual features via the attention mechanism, then aggregate these visual features to reconstruct the phrase. Reasoning such latent and inter-modal correspondence makes the feature embedding interpretable, can be employed as a regularization scheme for feature learning.

In summary, our contributions are three-fold: (1) We propose to use language description as training supervisions for learning more discriminative visual representation for person re-ID. This is different from existing text-image embedding methods aiming at cross-modal retrieval. (2) We provide two effective and complementary image-language association schemes, which utilize semantic, linguistic information to guide the learning of visual features in different granularities. (3) Extensive ablation studies validate the effectiveness and complementarity of the two association schemes. Our method achieves state-of-the-art performance on person re-ID and outperforms conventional cross-modal embedding methods.

2 Related Work

Early works on person re-ID concentrated on either feature extraction [17, 37, 53] or metric learning [9,10,11, 22, 38]. Recent methods mainly benefit from the advances of CNN architectures [26], which combine the above two aspects to produce robust and ID-discriminative image representation [1, 8, 28, 46, 50, 52]. Our work aims to further improve the deep visual representation by making use of language descriptions as training supervisions.

Diverse auxiliary information has been introduced to improve the visual feature representations for person re-ID. Several works [47, 59, 61] detected person pose landmarks to obtain the human body regions. They firstly decomposed the feature maps according to the regions, then fused them to create the well-aligned feature maps. Lin et al. utilized Camera ID information to assist inter-image similarity estimation [32] by keeping consistencies in a camera network. Also, different types of sensors such as depth cameras [7], or infrared [54] cameras have been employed in person re-ID to generate more reliable visual representations. For these methods, the auxiliary information is used in both training and testing stage, requiring an additional model or data acquisition device for algorithm deployment. Differently, human attributes usually serve as a kind of training supervisions. For example, Lin et al. [33] improved the interpretability of the intermediate feature maps by jointly optimizing the identification loss and attribute classification loss. Although attributes proves helpful for feature learning, they are quite difficult to obtain as people need to remember tens of attribute labels for annotations. They are also less flexible to describe diverse variations in human appearance.

Associating image and language helps establish correspondences for their inter-relations. It has attracted great attention in recent years because of its wide applications in image captioning [13, 20, 35, 51, 57], visual QA [4, 19, 30], and text-image retrieval [18, 41]. These cross-modal associations can be modeled by either generative methods or discriminative methods. Generative models utilize probabilistic models to capture the temporal or spatial dependencies within the image or text [39, 51], and have popular applications like caption generation [3, 35, 43, 51, 57] and image generation [41, 42]. On the other hand, discriminative models have also been developed for image-text association. Karpathy and Fei-Fei [21] formulated a bidirectional ranking loss to associate the text and image fragments. Reed et al. [41] proposed deep symmetric structured joint embeddings, and enforced the embedding of matched image-text pair should be higher than those of unmatched pairs. Our method combines the merits of both discriminative and generative methods to build image-text association in different granularities, where the language descriptions act as training supervisions to improve visual representation.

Fig. 2.
figure 2

Overall framework of our proposed approach. We employ the ResNet-50 as the backbone architecture. The produced intermediate feature \(\varPsi (I)\) is associated to the description feature \(\theta ^{g}(T)\) and the phrase feature \(\theta ^{l}(P)\) by global discriminative association and local reconstructive association, respectively.

3 Our Approach

We aim to exploit language descriptions of person images as the training supervisions in addition to the original ID labels for better visual representation. The visual representations are not only required to be discriminative for different persons but also need to keep consistencies with the linguistic representations. We, therefore, propose the global and local image-language association schemes. The global visual feature of one person should be more relevant to the language description features of the same person than those of a different person. Unlike existing cross-modal joint embedding methods, we do not require the visual and linguistic features to be mapped to a unified embedding space. Furthermore, based on the assumption that the image and language are spatially decomposable and temporally decomposable, we also try to find the mutual correspondences between the features of the image regions and the noun-phrases. The overall framework is illustrated in Fig. 2.

3.1 Visual and Linguistic Representation

Given a dataset \(\mathcal {D}=\{(I_{n}, T_{n}, l_{n})\}_{n=1}^{N}\) containing N tuples, each tuple has an image I, a text description T, and an ID label l. To improve the learned visual feature \(\phi (I)\), we build global and local correspondences between the intermediate visual feature maps \(\varPsi (I)\) and linguistic representation \(\varTheta (T)\).

The visual representation. The visual feature \(\phi (I)\) and the intermediate feature map \(\varPsi (I)\) are obtained from standard convolutional neural network (CNN), which takes ResNet-50 as the backbone network. \(\varPsi (I)\) is the feature map obtained with the \(1\!\times \!1\) convolution over the last residual-block. Suppose the \(\varPsi (I)\) has K bins, the feature vector at the kth bin is denoted by \(\psi _{k}(I)\), then \(\varPsi (I)\) can be represented as \(\varPsi (I) =\{ \psi _{k}(I) \}_{k=1}^{K}\). The objective visual feature vector \(\phi (I)\) is linear projection from the average feature map \(\bar{\psi }(I)=\frac{1}{K}\sum _{k=1}^{K}\psi _{k}(I)\):

$$\begin{aligned} \small \phi (I)=f_{\phi }(\varPsi (I)) = \mathbf{{W}}_{\phi }\bar{\psi }(I) + \mathbf{{b}}_{\phi }. \end{aligned}$$
(1)

We employ the ID loss over \(\phi (I)\). Specifically, given N images belonging to I persons, the ID loss is the average negative log-likelihood of the feature maps being correctly classified to its ID:

$$\begin{aligned} \small \mathcal {L}_{I} = -\frac{1}{N} \sum _{n=1}^{N}\sum _{i=1}^{I}y_{i,n} \log \left( \frac{\exp (\mathbf{{w}}_{i}^{\top } \phi (I_{n}) )}{\sum _{j=1}^{I}\exp (\mathbf{{w}}_{j}^{\top } \phi (I_{n}))} \right) , \end{aligned}$$
(2)

where \(y_{i,n}\) is the index label with \(y_{i,n}=1\) if the nth image \(I_{n}\) belongs to the ith person and \(y_{i,n} = 0\) otherwise. \(\mathbf{{w}}_{i}\) are the classifier parameters associated with the ith person over the visual feature vectors.

The linguistic representation. \(\varTheta (T)\) contains two types of feature vectors as shown in Fig. 2. One is the global description feature \(\theta ^{g}(T)\) mapped from the whole text, the other is the local phrase feature \(\theta ^{l}(P) \) that encodes a distinctive noun phrase P cropped from the text T. The noun-phrase extraction procedure is demonstrated in Fig. 3, and the obtained phrases in T form the set \(\mathcal {P}(T)\). Each word in text T or phrase P is firstly represented as a D dimensional one-hot vector, denoted by \(\mathbf{{o}}_{m}\in \mathbb {R}^{D}\) for the mth word and D is the vocabulary size. Then the one-hot vector is projected to a word embedding: \(\mathbf{{e}}_{m} = \mathbf{{W}}_{e} \mathbf{{o}}_{m}\).

Based on the embedding, we feed either a whole description or a short phrase to a long short-term memory network (LSTM) word by word, which has the following updating procedure: \(\mathbf{{h}}_{m+1} = \text {LSTM}(\mathbf{{e}}_{m}, \mathbf{{h}}_{m})\). The LSTM unit takes the current word embedding \(\mathbf{{e}}_{m}\) and hidden state \(\mathbf{{h}}_{m}\) as inputs, and outputs the hidden state of the next step \(\mathbf{{e}}_{m+1}\). The hidden states at the final time step are effective summarization of the description T or phrase P, obtaining the description feature \(\theta ^{g}(T) = \mathbf{{W}}_{g} \mathbf{{h}}_{F}(T) + \mathbf{{b}}_{g} \) or the phrase feature \(\theta ^{l}(P) = \mathbf{{W}}_{l}{} \mathbf{{h}}_{F}(P) + \mathbf{{b}}_{l}\), where \(\mathbf{{h}}_{F}(T)\) and \(\mathbf{{h}}_{F}(P)\) are the final hidden states for text T and phrase P, respectively. Because T describes abundant person characteristics throughout the body, \(\theta ^{g}(T)\) could describe a specific person. We therefore impose another ID loss to make \(\theta ^{g}(T)\) be separable for different persons,

$$\begin{aligned} \small \mathcal {L}_{T} = - \frac{1}{N}\sum _{n=1}^{N} \sum _{i=1}^{I} y_{i,n}\log \left( \frac{\exp (\mathbf{{v}}_{i}^{\top } \theta ^{g}(T_{n}))}{\sum _{j=1}^{I}\exp (\mathbf{{v}}_{j}^{\top } \theta ^{g}(T_{n}))} \right) , \end{aligned}$$
(3)

where \(y_{i,n}\) is label defined similar to the one in Eq. (2), and \(\mathbf{{v}}_{i}\) indicates the classifier parameters associated with the ith person over the text feature.

Fig. 3.
figure 3

The flowchart of extracting interested noun phrases from the text. We perform word-level tokenization and part-of-speech tagging, then extract noun phrases by chunking. As not all the phrases can have discriminative information, we are interested in two kinds of the phrases: (1) the noun phrase with adjectives (JJ), defined as JNP (2) the noun phrase consists of multiple nouns joined by preposition (IN).

3.2 Global Discriminative Image-Language Association

The ID losses in the previous section only enforce the visual and linguistic feature be discriminative within each modality but do not establish image-language correspondences to enhance the visual feature. As the global description is usually related to multiple and diverse regions in the image, \(\theta ^{g}(T)\) can be associated to \(\bar{\psi }(I)\) (Eq. (1)) in a discriminative fashion. Specifically, \(\bar{\psi }(I)\) and \(\theta ^{g}(T)\) firstly form a joint representation \(\varphi (I, T)\): \(\varphi (I, T) = \big (\bar{\psi }(I) - \theta ^{g}(T) \big ) \circ \big (\bar{\psi }(I) - \theta ^{g}(T) \big )\), where \(\circ \) denotes the Hadamard product. The joint representation is then projected into a scalar value within the range (0, 1) by:

$$\begin{aligned} \small s(I, T) = \frac{ \exp (\mathbf{{w}}_{s}^{\top }\varphi (I, T)+ b_{s})}{1+ \exp (\mathbf{{w}}_{s}^{\top }\varphi (I, T)+ b_{s})}. \end{aligned}$$
(4)

To build the relevance between \(\bar{\psi }(I)\) and \(\theta ^{g}(T)\), we expect s(IT) to be 1 when I and T belong to the same person and to be 0 when they belong to different persons. We thus impose the binary cross-entropy loss over the scores:

$$\begin{aligned} \small \mathcal {L}_{dis} = - \frac{1}{\hat{N}}\sum _{i,j} \Big [ l_{i,j}\log \big ( s(I_{i}, T_{j}) \big ) + (1-l_{i,j})\log \big (1-s(I_{i}, T_{j})\big ) \Big ], \end{aligned}$$
(5)

where \(\hat{N}\) is the number of sampled image-text pairs. \(l_{i,j} =1\) if \(I_{i}\) and \(T_{j}\) are describing a same person and \(l_{i,j}=0\) otherwise.

Discussion. Here, we draw a distinction between the proposed discriminative scheme and the bi-directional ranking [21, 41, 63], which is formulated by:

$$\begin{aligned} \small \mathcal {L}_{rank}= \frac{1}{\hat{N}}\sum _{i,j}\max (0, k_{i,j}- k_{i,i}+ \alpha )+ \max (0, k_{j,i}-k_{i,i} + \alpha ), \end{aligned}$$
(6)

where \(k_{i,j} = \bar{\psi }(I_{i})^{\top }\theta ^{g}(T_{j})\). The loss stipulates that the cosine similarity \(k_{i,i}\) for one image-text tuple should be higher than \(k_{i,j}\) or \(k_{j,i}\) for any \(i \!\ne \! j\) by at least a margin of \(\alpha \). We highlight two main differences between the proposed \(\mathcal {L}_{dis}\)(Eq. (5)) and \(\mathcal {L}_{rank}\): (1) As \(\mathcal {L}_{rank}\) is originally applied in the image-text retrieval task, it associates the image and text features by simply checking whether they are from the same tuple. Differently, \(\mathcal {L}_{dis}\) is based on person ID, which is more reasonable as one description can well correspond to different images of the same person. (2) \(\mathcal {L}_{rank}\) estimates the image-text relevance by cosine similarity, requiring \(\bar{\psi }(I_{i})\) and \(\theta ^{g}(T_{j})\) lie in the same feature space. While the proposed method doesn’t have such restriction.

3.3 Local Reconstructive Image-Language Association

A phrase usually only describes one part of an image and could be contained in the descriptions of different persons. For this reason, a phrase is disjoint with the person ID, but can still build correspondences with a certain region in the image it describes. We therefore propose a reconstruction scheme. That is, the phrase feature \(\theta ^{l}(P)\) can select relevant feature vectors in visual feature map \(\varPsi (I_{n})\) if \(P \in \mathcal {P}(T_{n})\), and the selected feature vectors are able to reconstruct the phrase P in turn.

Image feature aggregation. Suppose P is a phrase that describes a specific region in image \(I_{n}\), we aim to estimate a vector \(\hat{\psi }_{P}(I_{n})\) that can reflect the features in the region. For this purpose, we compute \(\hat{\psi }_{P}(I_{n})\) by weighted aggregation of the feature vectors \(\{\psi _{k}(I_{n})\}_{k=1}^{K}\) in the feature map \(\varPsi (I_{n})\): \(\hat{\psi }_{P}(I_{n}) = \sum _{k=1}^{K} r_{k}(P,I_{n})\psi _{k}(I_{n})\), where \(r_{k}(P, I_{n})\) is the attention weight reflecting the relevance between the phrase P and the feature vector \(\psi _{k}(I_{n})\) . It is estimated by an attention function \(f_{att}\big (\psi _{k}(I_{n}), \theta ^{l}(P) \big )\), which first computes the the unnormalized weight \( \bar{r}_{k}(P, I_{n})\) with a linear projection over the joint representation of \(\psi _{k}(I_{n})\) and \(r_{k}(P, I_{n})\):\( \bar{r}_{k}(P,I_{n}) = {\mathbf{{w}}}_{\bar{r}}^{\top } \big ((\psi _{k}(I_{n})- \theta ^{l}(P)) \circ (\psi _{k}(I_{n})- \theta ^{l}(P))\big ) + b_{\bar{r}}\), then normalizes the values by using a softmax operation over all the K bins: \(r_{k}(P,I_{n})= \textstyle {\exp (\bar{r}_{k}(P, I_{n})) /\sum _{k=1}^{K} \exp (\bar{r}_{k}(P, I_{n}))}\). In practice, the attention model is easy to overfit with limited training data. Besides, the spatially adjacent feature maps possibly represent one phrase, they are more reasonable to be merged. For these reasons, we reduce the training burden by average pooling the neighboring feature maps in \(\varPsi (I_{n})\) before the weighted aggregation, which is also illustrated in Fig. 4.

Fig. 4.
figure 4

The network structure for the local reconstructive image-language association. We first use the feature maps \(\varPsi (I_{n})\) and the phrase feature \(\theta ^{l}(P)\) to compute the attention weights for the intermediate features at different locations, then perform weighted aggregation to obtain the visual feature \(\hat{\psi }_{P}(I_{n})\), and finally employ LSTM to reconstruct P with \(\hat{\psi }_{P}(I_{n})\).

Phrase reconstruction. To enforce the consistency between the aggregated feature map \( \hat{\psi }_{P}(I_{n})\) and the input phrase P, we build the conditional probability \(p(P| \hat{\psi }_{P}(I_{n}))\) to reconstruct P with \( \hat{\psi }_{P}(I_{n})\). Since a phrase has a unbounded length M, it is common to apply the chain rule to model the probability over \(\{{\mathbf{{o}}}_{1}, {\mathbf{{o}}}_{2}, ..., {\mathbf{{o}}}_{M+1} \}\): \(\log p(P|\hat{\psi }_{P}(I_{n})) = \sum _{m=0}^{M} \log p\big (\mathbf{{o}}_{m+1}| \hat{\psi }_{P}(I_{n}), \hat{\mathbf{{o}}}_{0},..., \hat{\mathbf{{o}}}_{m}\big )\). More specifically, \(\mathbf{{o}}_{m+1}(m=0,...,M)\) is the random variable over the one-hot vectors of the m-th word, and \(\{\hat{\mathbf{{o}}}_{0},...,\hat{\mathbf{{o}}}_{M+1}\}\) are one-hot vectors of the ground truth words. Among them, \(\hat{\mathbf{{o}}}_{0},\hat{\mathbf{{o}}}_{M+1}\) are the one-hot vectors that designate the start and end of the phrase. Inspired by the task of image caption generation [51, 57], LSTM is employed to model \( p \big (\mathbf{{o}}_{m+1}| \hat{\psi }_{P}(I_{n}), \hat{\mathbf{{o}}}_{0},..., \hat{\mathbf{{o}}}_{m} \big )\). More specifically, we initially feed \(\hat{\psi }_{P}(I_{n})\) to the LSTM, then feed the embedding of the current word to obtain the hidden state of the next word. The next word probability is computed from the hidden state \(\mathbf{{h}}_{m\!+\!1}\) and the word embedding \(\mathbf{{e}}_{m}\). The word probability can be formulated as: \(p\big ( \mathbf{{o}}_{m+1}| \hat{\psi }_{P}(I_{n}), \hat{\mathbf{{o}}}_{0},..., \hat{\mathbf{{o}}}_{m} \big ) \propto \exp (\mathbf{{W}}_{oh}{} \mathbf{{h}}_{m+1} + \mathbf{{W}}_{oe}{} \mathbf{{e}}_{m})\). The reconstruction loss is the sum of the negative log likelihood of the correct word at each step:

$$\begin{aligned} \small \mathcal {L}_{rec} = -\frac{1}{N}\sum _{n=1}^{N} \frac{1}{|\mathcal {P}(T_{n})|}\sum _{P\in \mathcal {P}(T_{n})} \log p\big (P| \hat{\psi }_{P}(I_{n})\big ). \end{aligned}$$
(7)

3.4 Training and Testing

The final loss function is a combination of the image ID loss, the text ID loss as well as the discriminative and reconstructive image-language association losses:

$$\begin{aligned} \small \mathcal {L} = \mathcal {L}_{I}+ \lambda _{T}\mathcal {L}_{T}+ \lambda _{dis} \mathcal {L}_{dis}+\lambda _{rec} \mathcal {L}_{rec}, \end{aligned}$$
(8)

where \(\lambda _{T}, \lambda _{dis}\) and \(\lambda _{rec}\) are balancing parameters. For network training, we adopt stochastic gradient descent (SGD) with an initial learning rate of \(10^{-2}\), which is further decayed to \(10^{-3}\) after the 20th epoch. We organize the training batch as follows. The data tuple \((I_{n}, T_{n}, d_{n})\) is firstly transformed to \((I_{n}, T_{n}, \mathcal {P}(T_{n}), d_{n})\). Each batch contains the samples from 32 randomly selected persons, and each person has two randomly sampled tuples. For global discrimination, we form \(32 \times 4\) positive image-description pairs by exploiting all the intra-tuple and inter-tuple image-description compositions, and sample 6 negative pairs for each image, yielding \(64\times 6\) negative pairs, keeping the pos/neg ratio to be 1:3. Meanwhile, the local reconstruction is performed within each tuple.

In testing, only image features are extracted, and no language descriptions are used. The distance between two image features are simply the Euclidean distance, i.e., \(d_{i,j} = \Vert \phi (I_{i}) - \phi (I_{j})\Vert _{2}\). Person Re-ID is performed by ranking the distances between the probe image and gallery images in ascending order.

4 Experiments

We evaluate the proposed approach on three standard person re-ID datasets, whose language annotations can be fully or partially obtained from the CUHK-PEDES dataset [26]. Ablation studies are mainly conducted on Market-1501 [62] and CUHK-SYSU [56], which are convenient for extensive evaluation as with fixed training/testing splits. We also report the overall results on Market-1501, CUHK03 [28] and CUHK01 [27] to compare with the state-of-the-art approaches.

4.1 Experimental Setup

Datasets and Metrics.    To verify the utility of language descriptions in person re-ID, we augment four standard person re-ID datasets (Market-1501, CUHK03, CUHK01, and CUHK-SYSU) with language descriptions. The language descriptions are obtained from the CUHK-PEDES dataset, which is originally developed for cross-modal text-based person search and contains 40,206 images of 13,003 persons from five existing person re-ID datasets. Since persons in Market-1501 and CUHK03 have many similar samples, only four images of each person in this two datasets have language descriptions.

Among the four datasets, Market-1501, CUHK03 and CUHK01 follow the standard training and testing partitions. CUHK-SYSU is a new dataset used for joint detection and identification. According to separation in CUHK-PEDES, 15,080 images from 5,532 identities are used for training, 8,341 images from 2,900 persons are used for testing with 2,900 query images and 5,441 gallery images. Mean average precision (mAP) and CMC top-1, top-5, top-10 accuracies are adopted as the evaluation metrics.

Implementation details. All the person images are resized to 256\(\times \)128. For data augmentation, random horizontal flipping and random cropping are adopted. We empirically set the dimensions of feature embeddings \(\phi (I)\), \(\theta ^{l}(P)\) and \(\theta ^{g}(T)\) to be 256, and set the balancing parameters \(\lambda _{T}=0.1\), \( \lambda _{dis}=1\), \(\lambda _{rec}=1\), respectively. As some images in Market-1501 and CUHK03 do not have language descriptions, we employ the description of the same person (in the same camera if possible) for them to compose the data tuple \((I_{n}, T_{n}, d_{n})\). The ResNet-50 backbone is initialized by the parameters pre-trained on ImageNet [16].

Table 1. The loss configurations for the baseline and other variants.

Baseline and variants. The baseline is just the visual CNN that produces the feature map \(\phi (I)\), indicated by the red lines in Fig. 2. We additionally build 4 variants on the baseline for ablation study. The loss configuration of them are displayed in Table 1. Among them, basel. only imposes the ID loss to make \(\phi (I)\) be separable for different persons. Both basel.+rank and basel.+GDA additionally impose the ID loss over the global description feature \(\theta ^{g}(T)\) but have different global image-language association schemes. basel.+rank employs the \(\mathcal {L}_{rank}\) in Eq. (6), while basel.+GDA utilizes the proposed \(\mathcal {L}_{dis}\) in Eq. (5). The variant basel.+LRA employs the reconstruction loss \(\mathcal {L}_{rec}\) in Eq. (7) to build the local association between the aggregated feature vector \(\hat{\psi }_{P}(I_{n})\) and the phrase feature \(\theta ^{l}(P)\). Our proposed method takes advantages of both global and local image-language association schemes.

Table 2. Comparison of different association schemes upon our baseline method. Top-1,-5,-10 accuracies (%) and mAP(%) are reported.

4.2 The Effect of Global Discriminative Association (GDA)

Comparison with non-discriminative variants. We evaluate the effects of global discriminative image-language association by comparing the variants with and without using the description feature \(\theta ^{g}(T)\). Among them, basel.+GDA improves basel. by 5.6% and 4.4% in term of mAP on Market-1501 and CUHK-SYSU respectively (Table 2), which shows that GDA can benefit the learning of visual representation. Furthermore, our proposed method yields better performance than \( basel.+LRA \), indicating the effect of global discriminative association is complementary to that of the local reconstructive association.

Comparison with bi-directional ranking loss [21, 63]. \(\mathcal {L}_{dis}\) in GDA aims to discriminate the matched image-text pairs from the unmatched ones. It has the similar functions with the bidirectional ranking loss \(\mathcal {L}_{rank}\) (Eq. (6)) for image-language cross-modal retrieval. We implement two types of ranking losses for comparison. The first one is more similar to the loss in [21], where a positive image-text pair is composed of the image and text from the same tuple. The other one adopts the loss in [63], where the positive image-text pairs are obtained by arbitrary image-text combinations from the same person. We modify basel.+GDA by replacing \(\mathcal {L}_{dis}\) with the two loss functions, and denote them by \(basel.+rank ^{1} \) and \(basel.+rank ^{2}\), respectively. The results in Table 2 show that both ranking losses can boost the baseline. Besides, \(basel.\!+\! rank ^{2} \) is better than \(basel.\!+\! rank ^{1} \) by incorporating more abundant positive samples for discrimination. The proposed basel.+ GDA further improves the mAP by 2.3% and 1.4% on Market-1501 and CUHK-SYSU, verifying the effectiveness of our relevance estimation strategy (Eq. 4).

The importance of \(\mathcal {L}_{T}\). To preserve separability of the visual feature, the associated linguistic feature \(\theta ^{g}(T)\) is supposed to be discriminative for different persons, thus \(\mathcal {L}_{T}\) is employed along with \(\mathcal {L}_{dis}\). We investigate the importance of \(\mathcal {L}_{T}\) based upon basel.+GDA and observe how the performance changes with \(\lambda _{T}\) in Table 3. Slightly worse results are observed when \(\lambda _{T}=0\), indicating \(\mathcal {L}_{T}\) is indispensable. On the other hand, the optimal results are achieved when \(\lambda _{T}\) is around 0.1. One possible reason is that language description is sometimes more ambiguous to describe a specific person, making \(\mathcal {L}_{I}\) and \(\mathcal {L}_{T}\) not equally important. For example, “The man wears a blue shirt" can simultaneously describe different persons wearing a dark blue shirt and a light blue shirt.

4.3 The Effect of Local Reconstructive Association (LRA)

Comparison with non-reconstructive variants. We evaluate the effects of local reconstructive association by comparing the variants with and without using the local phrase feature \(\theta ^{l}(P)\). The performance gap between basel. and basel.+LRA proves the effectiveness of LRA for visual feature learning. Employing LRA brings \(5.2\%\) and \(3.9\%\) mAP gain over the two datasets, which is close to the gain of employing GDA. Besides, the fact that the proposed method is better than basel.+GDA also indicates the effectiveness of LRA.

Visualization of phrase-guided attention weights. We compute the attention weights for a specific phrase (Eq. (4)), align the weights to the corresponding image, and obtain the heat map for the phrase. The heat maps are displayed in Fig. 5, showing that the attention weights can roughly capture the local regions described by the phrases.

4.4 Results on Text-to-Image Retrieval

As a by-product, our method can also be utilized for text-to-image retrieval, which is fulfilled by ranking the cross-modal relevance (Eq. (4)). We report the retrieval results on CUHK-PEDES following the standard protocol, where there are 3,074 test images with 6,156 captions, 3,078 validation images with 6,158 captions, and 34,054 training images with 68,126 captions. The quantitative and qualitative results are reported in Table 4 and Fig. 6, respectively. Although our method is not specifically designed for this task, it achieves competitive results to the current state-of-the-art methods.

Table 3. Importance analysis of \(\mathcal {L}_{T}\) in basel.+GDA. We fix \(\lambda _{dis}=1\) and adjust \(\lambda _{T}\) over 0, 0.05, 0.1, 0.5, 1. Top-1,-5,-10 accuracies (%) and mAP(%) are reported.
Table 4. Results on CUHK-PEDES.

4.5 Comparison with the State-of-the-Art Approaches

We compare our method with the current state-of-the-arts on the Market1501, CUHK03, and CUHK01 datasets. The results on Market-1501 are reported in Table 6 left. Our method outperforms all the other approaches regarding mAP and top-1 accuracy under both single-query and multi-query protocols. Note that the baseline of our method is quite competitive to the most of the previous methods, which is partly because of well initialized ResNet-50 backbone and proper data augmentation strategies. The proposed image-language association scheme can largely boost the well-performed baseline, making our method better than the recent state-of-the-arts [2, 6]. CUHK03 has two types of person bounding boxes: one is manually labelled, and the other is obtained by a pedestrian detector. We compare our methods and others on both types, and report the top-1 and top-5 accuracies in Table 6 right. It can be seen that our method has significant advantages over the top-1 accuracy, but is 0.2% less than D-person [6] on the top-5 accuracy for the labelled bounding boxes. As D-person only utilizes image data, it is promising to apply our language association scheme to D-person for better performance. Compared with Market-1501 and CUHK03, CUHK01 has fewer images for training as described in Table 5. As in Table 5, the proposed association schemes have 7.8% top-1 accuracy gain over the baseline on CUHK01. The results confirm the effectiveness of language description, and indicate the schemes may be more useful when the image data are not enough.

Table 5. Results on CUHK01. Top-1,-5,-10 accuracies(%) are reported.
Fig. 5.
figure 5

Heat maps of the attention weights. The phrases are placed in on the left of the corresponding heat maps. Zoom in the figure for better view of the phrases.

Fig. 6.
figure 6

Examples of the text-to-image search. The most relevant 24 images are displayed. Red boxes indicate the ground truth.

Among the compared approaches, Spindle [59] and PDC [47] utilize pose landmarks, CADL [32] employs the camera ID labels, and ACN [44] makes use of the attributes for training. We achieve better results than them on all the three datasets (Tables 5 and 6). The results indicate language description is also a kind of useful auxiliary information for person re-ID. With the proposed schemes, it can achieve the superior performance with the standard CNN architecture.

Table 6. Comparison with the state-of-the-art methods on the Market-1501 and CUHK03 datasets. The results on Market-1501 are under single-query and multi-query protocols. MAP (%) and top-1 accuracy (%) are reported. Meanwhile, the performances on CUHK03 are evaluated with labeled and detected bounding boxes. Top-1 and Top-5 accuracies(%) are reported.

5 Conclusions

We utilized language descriptions as additional training supervisions to improve the visual features for person re-identification. The global and local image-language association schemes have been proposed. The former learns better global visual features with the discriminative supervision of the overall language descriptions, while the latter enforces the semantic consistencies between local visual features and noun phrases by phrase reconstruction. Our ablation studies show that the proposed image-language association schemes can remarkably improve the learning of the visual feature and are more effective than the existing image-text joint embedding methods. The proposed method achieves state-of-the-art performance on three public person re-ID datasets.