1 Introduction

Person re-identification (Re-ID) is an active research problem in computer vision and visual surveillance. The aim of Re-ID is to identify a specific probe person image from a set of gallery images captured from cross-view cameras. Many existing methods [12, 15, 28, 30] generally first extract a kind of global feature representation for person images and then utilize some metric learning methods to conduct a holistic comparison between test images for Re-ID.

One main challenge for Re-ID is to deal with the misalignment issue between image pair caused by large variations in camera views or human poses. Obviously, traditional global-based methods generally ignore spatial misalignment. To alleviate this issue, one kind of popular ways is to use part (or patch) based metric learning methods [16, 20, 23, 29, 33, 36]. These methods generally first partition each person image into a set of local patches. Then, they aim to conduct online patch-wise matching to obtain the spatial correspondences between patches of different images. Finally, the computed patch-wise matching is combined with local patch features to generate a robust metric learning for person Re-ID. However, one main issue for this online patch-wise matching is that it may lead to some mismatching among patches due to (1) lacking spatial and visual context information among local patches and (2) existing similar patch appearances or occlusions.

To overcome this issue, recent works [16, 23, 36] propose to develop some matching learning strategies for Re-ID. These methods generally first obtain a kind of reliable patch-wise matchings between training images in the training phase. Then, these learned matchings are utilized or transferred to guide the robust patch-wise matching between test images in the testing phase.

However, the graph matching methods they used generally do not explicitly consider the one-to-one matching constraint and the impact of outlier patches which may lead to inaccurate correspondence relationships. In this paper, we propose a novel patch graph matching model for Re-ID problem. The aim of the proposed matching model is that it can obtain a robust one-to-one matching solution for the patches of two images. In particular, in the training phase, we first use the proposed matching model to learn an optimal correspondence relationship between positive sample pairs. In the testing phase, we then select the former R pairs references by pose pairs similarity, and use their correspondence relationship for the new testing image pair. Finally, we adopt the local-global distance metric for Re-ID problem. Overall, this paper makes the following contributions.

  • In order to reduce the impact of outliers and obtain robust patch-wise correspondence relationships, we propose a novel graph matching model to make the spatial misalignment problem better solved.

  • We propose a novel person Re-ID approach by employing both visual context information and spatial correspondence relationship learning simultaneously to avoid the limitations caused by local information and misalignment issues.

  • Experimental results demonstrate that the proposed Re-ID method outperforms the other state-of-the-art approaches, validating the effectiveness of the method.

2 Related Work

Here, we briefly review some related works that are devoted to spatial misalignment for person Re-ID. Oreifej et al. [20] propose to utilize Earth Movers Distance (EMD) to obtain the whole similarity based on similarities between extracted patches. However, it ignores the spatial context of patches which may lead to mismatching for the patches with similar appearance or occlusions. Cheng et al. [9] propose to alleviate the influence of misalignment based on body part detection. The effectiveness of this approach generally relies on the detection result of the body part, which may be less effective in presence of occlusion. Some recent works [2, 17, 33] also propose to explore saliency or body prior to guiding the patch-wise matching between image pair.

One main limitation for the above online matching is that it may lead to some mismatching among patches due to (1) lacking using spatial context information among patches and (2) similar patch appearances or occlusions. To alleviate this limitation, Zhou et al. [36] recently propose to use a graph matching technique to obtain the optimal correspondence between each image pair during the training phase. Then, it transfers the learned patch-wise correspondence directly to the test image pair based on pose-pair configuration. This approach generally relies on image-level matching results obtained in the training phase. Lin et al. [16, 23] propose to learn a correspondence structure via a boosting based approach for each camera (pose) pair in the training phase. The learned correspondence structure is then utilized to guide the robust patch matching between test images. However, this method lacks considering spatial context information of patches in the matching process, which may be less effective in the presence of similar patch appearances.

Fig. 1.
figure 1

Framework of the proposed approach.

3 The Proposed Model

In this section, we propose our patch matching model, followed by an effective update algorithm to compute it. We present our complete Re-ID approach in Sect. 4.

3.1 Model Formulation

Given a positive image pair I and \(I'\), we first divide them into several overlapping patches \(P=(p_1,p_2\cdots p_n)\) and \(P' = (p'_1,p'_2\cdots p'_m)\), respectively. Then, we extract feature descriptor for each patch of the image. Our aim is to find the correspondence relationship between patches of two images. In order to do so, we construct an attributed relation graph \(G=(V,E,A,R)\) for image I, where nodes V represent patches P and edges E denote the relationship among patches. Each node \(v_i\in V\) has an associated attribute vector \(\mathbf a _i \in A\) and each edge \(e_{ih} \in E\) has a weight value \(\mathbf r _{ih} \in R\). Similarly, we can construct a graph \(G'=(V',E',A',R')\) for \(I'\). Based on this graph representation, the above patch matching problem can be reformulated as finding the correspondences between nodes of two graphs. Let \(\mathbf{Z }\in \{0,1\}^{n \times m}\) denote the correspondence solution between two graphs, in which \(\mathbf{Z }_{ij}=1\) implies that node \(v_i \in G\) corresponds to node \(v^{'}_j \in G^{'}\) , and \(\mathbf{Z }_{ij}\) = 0 otherwise. To obtain the optimal \(\mathbf{Z }\), we define an affinity matrix \(\mathbf K \). The diagonal term \(\mathbf K _{ij,ij}\) of \(\mathbf K \) represents the unary affinity \(f_a(\mathbf a _i, \mathbf a _j )\) that measures how well node \(v_i \in V\) matches node \(v^{'}_j \in V^{'}\). The non-diagonal element \(\mathbf K _{ij,hk}\) contains the pair-wise affinity \(f_r(\mathbf r _{ih}, \mathbf r _{jk} )\) that measures how compatible the nodes \((v_i,v_h)\) in G are with the nodes \((v^{'}_j,v^{'}_k)\) in \(G^{'}\). We can obtain the optimal \(\mathbf{Z }\) by optimizing the following objective function,

$$\begin{aligned} \begin{aligned}&\max _{\mathbf{Z }} \,\,\, \mathcal {Q}_{\mathrm {IQP}}=\text { vec}{(\mathbf{Z })}^\mathrm {T}{} \mathbf K \text { vec}{(\mathbf{Z })},\\&s.t. \forall {i} \sum ^{n}_{j=1}{} \mathbf Z _{ij}=1,\forall {j} \sum ^{m}_{i=1}{} \mathbf Z _{ij}\le 1,\mathbf z _{ij}\in \{0,1\} \end{aligned} \end{aligned}$$
(1)

It is known that the above problem is an Quadratic Assignment Problem (QAP) which is a NP-hard problem. Therefore, relaxation models are required to find some approximate solutions. For the person image matching problem, an ideal matching relaxation model should be satisfied with the following two aspects. (1) An one-to-one matching constraint should be imposed in the final matching results, i.e., each patch in image I should correspond to at most one patch in \(I'\). (2) There may exist outlier patches in both I and \(I'\). Thus, the matching process should perform robustly to the outlier patches. To address these issues, we propose to obtain the optimal matching \(\mathbf{Z }\) by solving the following novel sparse relaxation matching problem,

$$\begin{aligned} \max _{\mathbf{Z }}\,\, \text { vec}{(\mathbf{Z })}^\mathrm {T}{} \mathbf K \text { vec}{(\mathbf{Z })}-\beta \left\| {\mathbf{Z }}\right\| _2^2 \quad \text {s. t.}\quad \left\| \mathbf Z \right\| _{1,2}=1,\mathbf Z \ge 0 \end{aligned}$$
(2)

where \(\left\| \mathbf Z \right\| _{1,2}=(\sum _i(\sum _j|\mathbf Z _{ij}|)^2)^{1/2}\) is used to encourage local sparse and thus one-to-one matching constraint [3]. The \(\ell _2\)-norm regularization term is used to control the compactness of all inlier, which make the model perform robustly to the outlier patches, as discussed in work [26].

3.2 Computational Algorithm

The proposed patch matching model can be solved effectively via a simple multiplicative update algorithm. Starting from \(\mathbf{Z }^{(0)}\), the proposed algorithm conducts the following update until convergence.

$$\begin{aligned} \mathbf{Z }_{ij}^{(t+1)}=\mathbf{Z }_{ij}^{(t)} \sqrt{ \frac{\mathbf{M }_{ij}^{(t)}}{\lambda {\sum }_j \mathbf{Z }_{ij}^{(t)}} +\beta \mathbf{Z }_{ij}^{(t)}} \end{aligned}$$
(3)

where matrix \(\mathbf M ^{(t)} \in \mathbb {R}^{m\times n}\) is the matrix form of the vector \([\mathbf K ^{(t)} \mathrm {vec} (\mathbf{Z }^{(t)})]\), and \(\lambda \) is computed as,

$$\begin{aligned} \lambda =\mathrm {vec}(\mathbf{Z }^{(t)})^\mathrm {T}{} \mathbf K \mathrm {vec}(\mathbf{Z }^{(t)})-\beta \mathrm {vec}(\mathbf{Z }^{(t)}) \end{aligned}$$
(4)

Theoretical Analysis. The optimality and convergence of the algorithm are guaranteed by Theorems 1 and 2, respectively.

Theorem 1

Update rule of Eq. (3) satisfies the first-order Karush-Kuhn-Tucker (KKT) optimality condition.

Theorem 2

Under the update rule Eq. (3), the Lagrangian function \(\mathcal {L}(\mathbf X )\) Eq. (5) is monotonically increasing.

$$\begin{aligned} \mathcal {L}(\mathbf{Z }) = \text { vec}{(\mathbf{Z })}^\mathrm{T}\mathbf{K }\text { vec}{(\mathbf{Z })} -\beta \left\| \mathbf{Z }\right\| _2^2 -\lambda (\sum _i(\sum _j \mathbf{Z }_{ij})^2 -1) \end{aligned}$$
(5)

The proof of them can be similarly derived from work [3] which is omitted here due to limited space.

4 Person Re-ID

In this section, we describe our Re-ID approach based on the proposed patch matching model. The complete process is shown in Fig. 1.

4.1 Training Stage

Given a positive image pair \(I_p\) and \(I_g\), we decompose them into many overlapping patches at the start. Then, we construct a graph for these patch features and learn the patch-wise correspondence \(\mathbf Z \) of them via the proposed matching model. Of course, the part of graph matching can also be replaced by some classic graph matching methods such as [4, 11].

In Re-ID, \(\mathbf Z _{ij}=1\) means the \(i^{th}\) patch in \(I_p\) semantically corresponds to the \(j^{th}\) patch in \(I_g\). And the graph matching model was detailed in the previous section. In this stage, we can obtain satisfactory patch-wise matching results for Re-ID, and these correspondence relationships will be used to distance measure.

4.2 Testing Stage

We use the local-global pattern for the final distance metric learning due to local information is one-sided and global information is easily lead to misalignment. We compute the final distance D as follows,

$$\begin{aligned} D({I_p^{'},I_g^{'}})=\alpha D_l+(1-\alpha ) D_g \end{aligned}$$
(6)

where \(D_g\) and \(D_l\) represent global and local distance between test images \(I_g^{'}\) and \(I_p^{'}\) respectively, and \(\alpha \) is a balance parameter.

Local Distance Metric. When we have learned the patch-wise correspondence relationships of all trained positive sample pairs from the training stage, we can get the most similar R pairs of positive references to test pair by comparing the pose similarity. The local distance compute is as follows,

$$\begin{aligned} D_l(I_p^{'},I_g^{'})=\sum _{i=1}^Rw_{i}\sum _{j=1}^{S_i}\zeta (f_p^{'},f_g^{'}) \end{aligned}$$
(7)

where \(f_p^{'}\) and \(f_g^{'}\) represent features of patches in probe image \(I_p^{'}\) and gallery image \(I_g^{'}\), and \(\zeta (.)\) denotes the KISSME metric [12]. We use \(\varPhi =\{\phi _i\}_{i=1}^R\) to represent the R templates selected, where each template \(\phi _i=\{z_{ij}\}_{j=1}^{S_i}\) contains a total of \(S_i\) patch-wise correspondences, and each correspondence \(z_{ij}\) denotes the positions of matched patches calculated by \(\mathbf Z _{ij}\), which is consistent with the original method [36]. The difference with the method lies in that we normalized weighted and summed the selected R pairs of references, so that the more similar the pose is, the higher the weight is, where the weight is represented by \(w_i\).

Global Distance Metric. For the test pair \(I_p^{'}\) and \(I_g^{'}\), we adopt the LOMO+XQDA [15] to supplement the global distance compute. We combine global information into patch-wise feature distances between each correspondence of selected references, so we can calculate the local and global distances between the test image pairs. In all our experiments, we use Local Maximal Occurrence features [15] for the patch features representation, whether local or global.

5 Experiments

5.1 Datasets

VIPeR Dataset: The VIPeR dataset [10] includes 1264 images of 632 pedestrians, and each pedestrian has two images collected from cameras A and B. Each image is adjusted to the size of \(128\times 48\). The dataset is characterized by a diversity of perspectives and lighting.

Road Dataset: This dataset [23] is captured from a crowd road scene by two cameras and consists of 416 image pairs. It is very challenging due to the large variation of human pose and camera view.

PRID450S Dataset: This dataset [22] consists of 450 image pairs from two camera views. The low image qualities and camera viewpoint changes make it very challenging for person re-identification.

CUHK01 Dataset: This dataset [13] consists of 971 individuals captured from two disjoint camera views. The images on this dataset are of higher resolutions. We also adopt the commonly utilized 485/486 setting for person re-identification evaluation.

Table 1. Comparisons of top r matching rate using CMC on VIPeR dataset. The best results are marked in bold.
Table 2. Comparisons of top r matching rate using CMC on Road dataset. The best results are marked in bold.

5.2 Evaluation Settings

Parameter Setup. We follow previous methods and perform experiments under the half-training and half-testing setting. All images are scaled to \(128 \times 48\). The patch size is set to \(32 \times 24\). The stride size between neighboring patches is 6 horizontally and 8 vertically for probe images and gallery images. More specifically, for each stripe in the probe image, patch-wise correspondences are established between the corresponding gallery stripe within the search range in the gallery image.

Evaluation. On all datasets, both the training/testing set partition and probe/gallery set partition are performed 10 times and average performance is reported. The performance is evaluated by using the Cumulative Matching Characteristic (CMC) curve, which represents the expected probability of finding the correct match for a probe image in the top r matches in the gallery list [36]. Tables 1, 2, 3 and 4 show the CMC results of different methods on four datasets and Fig. 2 shows the CMC curves on three datasets.

Table 3. Comparisons of top r matching rate using CMC on PRID450S dataset. The best results are marked in bold.
Table 4. Comparisons of top r matching rate using CMC on CUHK01 dataset. The best results are marked in bold.

5.3 Results

To evaluate the effectiveness of the proposed method, we evaluate the proposed Re-ID method by comparing it with some other methods including KISSME [12], SVMML [14], SalMatch [35], ELS [6], MFA [28], kLFDA [28], IDLA [1], JLR [27], LOMO + XQDA [15], Semantic [25], LMF + LADF [34], DCSL [32], deCPPs + MER [24], TCP [9], TMA [18], LOMO-fusing [31], SCNCD [30], Mirror-KMFA [8], eSDC-knn [33], CSL [23], single-KMFA [16], multi-manu-KMFA [16], DeepRanking [7], ME [21], GOG [19], SalMatch [35], CSBT [5], and GCT [36].

Fig. 2.
figure 2

CMC scores comparison on different datasets.

Tables 1, 2, 3 and 4 summarize the comparison results. Here, we can note: (1) Our approach performs better than graph correspondence transfer (GCT) [36], which demonstrates the effectiveness and robustness of the proposed patch matching model by reducing the impact of outliers. (2) Our method also outperforms many other Re-ID methods and obtains the best performance on all datasets, which indicates the effectiveness of the proposed Re-ID approach.

6 Conclusion

In this paper, we propose a new model to solve the problem of cross-view spatial misalignment of person Re-id. We first propose a novel local sparse matching model to learn the corresponding relationship between patches of image pair in the training stage. Then, in the testing phase, we adopt local-global distance measure method to make the learning result of person images more accurate. Extensive experimental results on several benchmarks demonstrate the effectiveness of the proposed Re-ID method.