Keywords

1 Introduction

Person re-identification is a fundamental problem in automated video surveillance and has attracted significant attention in recent years [7, 23, 78]. When a person is captured by cameras with non-overlapping views, or by the same camera but over many days, the objective is to recognize them across views among a large number of imposters. This is a difficult problem because of the visual ambiguity in a person’s appearance due to large variations in illumination, human pose, camera settings and viewpoint. Additionally, re-identification systems have to be robust to partial occlusions and cluttered background. Multi-person association has wide applicability and utility in areas such as robotics, multimedia, forensics, autonomous driving and cashier-free shopping.

Fig. 1.
figure 1

Filter responses from “conv1” (upper right), “conv2” (bottom left) and “conv3” (bottom right) layers for a given frame from the TUM GAID data using (a) a framework for person re-identification from RGB [82] and (b) the feature embedding \(f_{CNN}\) of our framework, which is drawn in Fig. 3 and exclusively utilizes depth data.

1.1 Related Work

Existing methods of person re-identification typically focus on designing invariant and discriminant features [10, 22, 24, 38, 43, 46, 50, 87, 100], which can enable identification despite nuisance factors such as scale, location, partial occlusion and changing lighting conditions. In an effort to improve their robustness, the current trend is to deploy higher-dimensional descriptors [43, 47] and deep convolutional architectures [1, 17, 40, 45, 65, 73, 79, 82, 83, 89, 101, 109].

In spite of the ongoing quest for effective representations, it is still challenging to deal with very large variations such as ultra wide-baseline matching and dramatic changes in illumination and resolution, especially with limited training data. As such, there is vast literature in learning discriminative distance metrics [5, 19, 35, 42, 43, 48, 51, 53, 55, 61, 77, 91, 105, 108] and discriminant subspaces [15, 43, 46, 47, 63, 64, 84, 94, 107]. Other approaches handle the problem of pose variability by explicitly accounting for spatial constraints of the human body parts [12, 39, 96, 97] or by predicting the pose from video [16, 72].

However, a key challenge to tackle within both distance learning and deep learning pipelines in practical applications is the small sample size problem [14, 94]. This issue is exacerbated by the lack of large-scale person re-identification datasets. Some new ones have been released recently, such as CUHK03 [40] and MARS [102], a video extension of the Market-1501 dataset [103]. However, their training sets are in the order of 20, 000 positive samples, i.e. two orders of magnitude smaller than Imagenet [66], which has been successfully used for object recognition [37, 69, 75].

The small sample size problem is especially acute in person re-identification from temporal sequences [9, 26, 54, 86, 110], as the feature dimensionality increases linearly in the number of frames that are accumulated compared to the single-shot representations. On the other hand, explicitly modeling temporal dynamics and using multiple frames help algorithms to deal with noisy measurements, occlusions, adverse poses and lighting.

Regularization techniques, such as Batch Normalization [30] and Dropout [27], help learning models with larger generalization capability. Xiao et al. [82] achieved top accuracy on several benchmarks by leveraging on their proposed “domain-guided dropout” principle. After their model is trained on a union of datasets, it is further enhanced on individual datasets by adaptively setting the dropout rate for each neuron as a function of its activation rate in the training data.

Haque et al. [26] designed a glimpse layer and used a 4D convolutional autoencoder in order to compress the 4D spatiotemporal input video representation, while the next spatial location (glimpse) is inferred within a recurrent attention framework using reinforcement learning [56]. However, for small patches (at the glimpse location), the model loses sight of the overall body shape, while for large patches, it loses the depth resolution. Achieving a good trade-off between visibility and resolution within the objective of compressing the input space to tractable levels is hard with limited data. Our algorithm has several key differences from this work. First, observing that there are large amount of RGB data available for training frame-level person ReID models, we transfer parameters from pre-trained RGB models with an improved transfer scheme. Second, since the input to our frame-level model is the entire body region, we do not have any visibility constraints at a cost of resolution. Third, in order to better utilize the temporal information from video, we propose a novel reinforced temporal attention unit on top of the frame-level features which is guided by the task in order to predict the weights of individual frames into the final prediction.

Our method for transferring a RGB Person ReID model to the depth domain is based on the key observation that the model parameters at the bottom layers of a deep convolutional neural network can be directly shared between RGB and depth data while the remaining upper layers need to be fine-tuned. At first glance, our observation is inconsistent with what was reported in the RGB-D object recognition approach by Song et al. [71]. They reported that the bottom layers cannot be shared between RGB and depth models and it is better to retrain them from scratch. Our conjecture is that this behavior is in part specific to the HHA depth encoding [25], which is not used in our representation.

Some recent works in natural language processing [11, 49] explore temporal attention in order to keep track of long-range structural dependencies. Yao et al. [88] in video captioning use a soft attention gate inside their Long Short-term memory decoder, so that they estimate the relevance of current features in the input video given all the previously generated words. One key difference of our approach is that our attention unit is exclusively dependent on the frame-level feature embedding, but not on the hidden state, which likely makes it less prone to error drifting. Additionally, our temporal attention is not differentiable so we resort to reinforcement learning techniques [80] for binary outcome. Being inspired by the work of Likas [44] in online clustering and Kontoravdis et al. [36] in exploration of binary domains, we model the weight of each frame prediction as a Bernoulli-sigmoid unit. We review our model in detail in Sect. 2.2.

Depth-based methods that use measurements from 3D skeleton data have emerged in order to infer anthropometric and human gait criteria [2, 3, 21, 57, 60]. In an effort to leverage the full power of depth data, recent methods use 3D point clouds to estimate motion trajectories and the length of specific body parts [29, 95]. It is worthwhile to point out that skeleton information is not always available. For example, the skeleton tracking in Kinect SDK can be ineffective when a person is in side view or the legs are not visible.

On top of the above-mentioned challenges, RGB-based methods are challenged in scenarios with significant lighting changes and when the individuals change clothes. These factors can have a big impact on the effectiveness of a system that, for instance, is meant to track people across different areas of a building over several days where different areas of a building may have drastically different lighting conditions, the cameras may differ in color balance, and a person may wear clothes of different patterns. This is our key motivation for using depth silhouettes in our scenario, as they are insensitive to these factors.

Our contributions can be summarized as follows:

  1. (i)

    We propose novel reinforced temporal attention on top of the frame-level features to better leverage the temporal information from video sequences by learning to adaptively weight the predictions of individual frames based on a task-based reward. In Sect. 2.2 we define the model, its end-to-end training is described in Sect. 2.3, and comparisons with baselines are shown in Sect. 3.5.

  2. (ii)

    We tackle the data scarcity problem in depth-based person re-identification by leveraging the large amount of RGB data to obtain stronger frame-level features. Our split-rate RGB-to-depth transfer scheme is drawn in Fig. 4. We show in Fig. 5 that our method outperforms a popular fine-tuning method by more effectively utilizing pre-trained models from RGB data.

  3. (iii)

    Extensive experiments in Sect. 3.5 not only show the superiority of our method compared to the state of the art in depth-based person re-identification from video, but also tackle a challenging application scenario where the persons wear clothes that were unseen during training. In Table 2 we demonstrate the robustness of our method compared to its RGB-based counterpart and the mutual gains when jointly using the person’s head information.

2 Our Method

2.1 Input Representation

The input for our system is raw depth measurements from the Kinect V2 [68]. The input data are depth images \(\mathbf{D}\in \mathbb {Z}^{512\times 424}\), where each pixel \(D[i,j], i\in [1,512], j\in [1,424]\), contains the Cartesian distance, in millimeters, from the image plane to the nearest object at the particular coordinate (ij). In “default range” setting, the intervals \([0, 0.4\,m)\) and \((8.0\,m, \infty )\) are classified as unknown measurements, [0.4, 0.8) [m] as “too near", (4.0, 8.0] [m] as “too far” and [0.8, 4.0] [m] as “normal” values. When skeleton tracking is effective, the body index \(\mathbf{B}\in \mathbb {Z}^{512\times 424}\) is provided by the Kinect SDK, where 0 corresponds to background and a positive integer i for each pixel belonging to the person i.

After extracting the person region \(\mathbf{D_p} \subset \mathbf{D}\), the measurements within the “normal” region are normalized in the range [1, 256], while the values from “too far” and “unknown” range are set as 256, and values within the “too near” range as 1. In practice, in order to avoid a concentration of the values near 256, whereas other values, say on the floor in front of the subject, span the remaining range, we introduce an offset \(t_o=56\) and normalize in \([1, 256-t_o]\). This results in the “grayscale” person representation \(\mathbf{D_p^g}\). When the body index is available, we deploy \(\mathbf{B_p} \subset \mathbf{B}\) as mask on the depth region \(\mathbf{D_p}\) in order to achieve background subtraction before applying range normalization (see Fig. 2).

Fig. 2.
figure 2

The cropped color image (left), the grayscale depth representation \(\mathbf{D_p^g}\) (center) and the result after background subtraction (right) using the body index information \(\mathbf{B_p}\) from skeleton tracking.

2.2 Model Structure

The problem is formulated as sequential decision process of an agent that performs human recognition from a partially observed environment via video sequences. At each time step, the agent observes the environment via depth camera, calculates a feature vector based on a deep Convolutional Neural Network (CNN) and actively infers the importance of the current frame for the re-identification task using novel Reinforced Temporal Attention (RTA). On top of the CNN features, a Long Short-Term Memory (LSTM) unit models short-range temporal dynamics. At each time step the agent receives a reward based on the success or failure of its classification task. Its objective is to maximize the sum of rewards over time. The agent and its components are detailed next, while the training process is described in Sect. 2.3. The model is outlined in Fig. 3.

Fig. 3.
figure 3

Our model architecture consists of a frame-level feature embedding \(f_{CNN}\), which provides input to both a recurrent layer \(f_{LSTM}\) and the Reinforced Temporal Attention (RTA) unit \(f_w\) (highlighted in red). The classifier is attached to the hidden state \(h_t\) and its video prediction is the weighted sum of single-frame predictions, where the weights \(w_t\) for each frame t are predicted by the RTA unit. (Color figure online)

Agent: Formally, the problem setup is a Partially Observable Markov Decision Process (POMDP). The true state of the environment is unknown. The agent learns a stochastic policy \(\pi ((w_t, c_t)|s_{1:t}; \theta )\) with parameters \(\theta =\{\theta _g, \theta _w, \theta _h, \theta _c \}\) that, at each step t, maps the past history \(s_{1:t}=I_1, w_1, c_1, \ldots , I_{t-1}, w_{t-1}, c_{t-1}, I_t\) to two distributions over discrete actions: the frame weight \(w_t\) (sub-policy \(\pi _1\)) and the class posterior \(c_t\) (sub-policy \(\pi _2\)). The weight \(w_t\) is sampled stochastically from a binary distribution parameterized by the RTA unit \(f_w(g_t;\theta _w)\) at time t: \(w_t \sim \pi _1(\cdot | f_w(g_t;\theta _w))\). The class posterior distribution is conditioned on the classifier module, which is attached to the LSTM output \(h_t\): \(c_t \sim \pi _2(\cdot | f_c(h_t;\theta _c))\). The vector \(h_t\) maintains an internal state of the environment as a summary of past observations. Note that, for simplicity of notation, the input image at time t is denoted as \(I_t\), but the actual input is the person region \(D_{p,t}^g\) (see Sect. 2.1).

Frame-Level Feature Embedding \(\varvec{f_{CNN}(\theta _{g})}\): Given that there is little depth data but a large amount of RGB data available for person re-identification, we would like to leverage the RGB data to train depth models for frame-level feature extraction. We discovered that the parameters at the bottom convolutional layers of a deep neural network can be directly shared between RGB and depth data (cf. Sect. 2.3) through a simple depth encoding, that is, each pixel with depth D is replicated to three channels and encoded as (D, D, D), which corresponds to the three RGB channels. This motivates us to select a pre-trained RGB model.

RGB-based person re-identification has progressed rapidly in recent years [1, 40, 73, 79, 82, 89]. The deep convolutional network introduced by Xiao et al. [82] outperformed other approaches on several public datasets. Therefore, we decide to adopt their model for frame-level feature extraction. This network is similar in nature to GoogleNet [75]; it uses batch normalization [30] and includes \(3\times 3\) convolutional layers [69], followed by 6 Inception modules [75], and 2 fully connected layers. In order to make this network applicable to our scenario, we introduce two small modifications. First, we replace the top classification layer with a \(256\times N\) fully connected layer, where N is the number of subjects at the target dataset and its weights are initialized at random from a zero-mean Gaussian distribution with standard deviation 0.01. Second, we add dropout regularization between the fully-connected layers. In Sect. 2.3 we demonstrate an effective way to transfer the model parameters from RGB to Depth.

Recurrent Module \(\varvec{f_{LSTM}(\theta _{h})}\): We use the efficient Long Short-Term Memory (LSTM) element units as described in [92], which have been shown by Donahue et al. [20] to be effective in modeling temporal dynamics for video recognition and captioning. In specific, assuming that \(\sigma ()\) is sigmoid, g[t] is the input at time frame t, \(h[t-1]\) is the previous output of the module and \(c[t-1]\) is the previous cell, the implementation corresponds to the following updates:

$$\begin{aligned} i[t]&= \sigma (W_{gi}g[t] + W_{hi}h[t-1] + b_i)\end{aligned}$$
(1)
$$\begin{aligned} f[t]&= \sigma (W_{gf}g[t] + W_{hf}h[t-1] + b_f)\end{aligned}$$
(2)
$$\begin{aligned} z[t]&= tanh(W_{gc}g[t] + W_{hc}h[t-1] + b_c)\end{aligned}$$
(3)
$$\begin{aligned} c[t]&= f[t] \odot c[t-1] + i[t] \odot z[t]\end{aligned}$$
(4)
$$\begin{aligned} o[t]&= \sigma (W_{go}g[t] + W_{ho}h[t-1] + b_o)\end{aligned}$$
(5)
$$\begin{aligned} h[t]&= o[t] \odot tanh(c[t]) \end{aligned}$$
(6)

where \(W_{sq}\) is the weight matrix from source s to target q for each gate q, \(b_q\) are the biases leading into q, i[t] is the input gate, f[t] is the forget gate, z[t] is the input to the cell, c[t] is the cell, o[t] is the output gate, and h[t] is the output of this module. Finally, \(x \odot y\) denotes the element-wise product of vectors x and y.

Reinforced Temporal Attention \(\varvec{f_w(\theta _w)}\): At each time step t the RTA unit infers the importance \(w_t\) of the image frame \(I_t\), as the latter is represented by the feature encoding \(g_t\). This module consists of a linear layer which maps the \(256\times 1\) vector \(g_t\) to one scalar, followed by Sigmoid non-linearity which squashes real-valued inputs to a [0, 1] range. Next, the output \(w_t\) is defined by a Bernoulli random variable with probability mass function:

$$\begin{aligned} f(w_t; f_w(g_t;\theta _w)) = {\left\{ \begin{array}{ll} f_w(g_t;\theta _w), &{} w_t=1 \\ 1-f_w(g_t;\theta _w), &{} w_t=0 \end{array}\right. } \end{aligned}$$
(7)

The Bernoulli parameter is conditioned on the Sigmoid output \(f_w(g_t;\theta _w)\), shaping a Bernoulli-Sigmoid unit [80]. During training, the output \(w_t\) is sampled stochastically to be a binary value in \(\{0,1\}\). During evaluation, instead of sampling from the distribution, the output is deterministically decided to be equal to the Bernoulli parameter and, therefore, \(w_t=f_w(g_t;\theta _w)\).

Classifier \(\varvec{f_c(\theta _c)}\) and Reward: The classifier consists of a sequence of a rectified linear unit, dropout with rate \(r=0.4\), a fully connected layer and Softmax. The parametric layer maps the \(256\times 1\) hidden vector \(h_t\) to the \(N\times 1\) class posterior vector \(c_t\), which has length equal to the number of classes N. The multi-shot prediction with RTA attention is the weighted sum of frame-level predictions \(c_t\), as they are weighted by the normalized, RTA weights \(w_t' = \frac{f_w(g_t;\theta _w)}{\sum _{t=1}^T f_w(g_t;\theta _w)} \).

The Bernoulli-Sigmoid unit is stochastic during training and therefore we resort to the REINFORCE algorithm in order to obtain the gradient for the backward pass. We describe the details of the training process in Sect. 2.3, but here we define the required reward function. A straightforward definition is:

$$\begin{aligned} r_t = \mathcal {I} (arg\,max (c_t) = g_t) \end{aligned}$$
(8)

where \(r_t\) is the raw reward, \(\mathcal {I}\) is the indicator function and \(g_t\) is the ground-truth class for frame t. Thus, at each time step t, the agent receives a reward \(r_t\), which equals to 1 when the frame is correctly classified and 0 otherwise.

2.3 Model Training

In our experiments we first pre-train the parameters of the frame-level feature embedding, and afterwards we attach LSTM, RTA and the new classifier in order to train the whole model (cf. Fig. 3). At the second step the weights of the embedding are frozen while the added layers are initialized at random. We adopt this modular training so that we provide both single-shot and multi-shot evaluation, but the entire architecture can well be trained end to end from scratch if processing video sequences is the sole objective. Next, we first describe our transfer learning for the frame-level embedding and following the hybrid supervised training algorithm for the recursive model with temporal attention.

Split-Rate Transfer Learning for Feature Embedding \(\varvec{f_{CNN}(\theta _g)}\): In order to leverage on vast RGB data, our approach relies on transferring parameters \(\theta _g\) from a RGB pre-trained model for initialization. As it is unclear whether and which subset of RGB parameters is beneficial for depth embedding, we first gain insight from work by Yosinski et al. [90] in CNN feature transferability. They showed that between two almost equal-sized splits from Imagenet [66], the most effective model adaptation is to transfer and slowly fine-tune the weights of the bottom convolutional layers, while re-training the top layers. Other works that tackle model transfer from a large to a small-sized dataset (e.g. [33]) copy and slowly fine-tune the weights of the whole hierarchy except for the classifier which is re-trained using a higher learning rate.

Inspired by both approaches, we investigate the model transferability between RGB and depth. Our method has three differences compared to [90]. First, we found that even though RGB and depth are quite different modalities (cf. Fig. 1), the bottom layers of the RGB models can be shared with the depth data (without fine-tuning). Second, fine-tuning parameters transferred from RGB works better than training from scratch for the top layers. Third, using slower (or zero) learning rate for the bottom layers and higher for the top layers is more effective than using uniform rate across the hierarchy. Thus, we term our method as split-rate transfer. This first and third remarks also consist key differences with [33], as firstly they fine-tune all layers and secondly they deploy higher learning rate only for the classifier. Our approach is visualized in Fig. 4 and ablation studies are shown in Sect. 3.4 and Fig. 5, which support the above-mentioned observations.

Fig. 4.
figure 4

Our split-rate RGB-to-Depth transfer compared with Yosinski et al. [90]. At the top, the two models are trained from scratch with RGB and Depth data. Next we show the “R3D” instances (i.e. the bottom 3 layers’ weights from RGB remain frozen or slowly changing) for both methods, following the notation of [90]. The color of each layer refers to the initialization and the number below is the relative learning rate (the best performing one in bold). The key differences are summarized in the text.

Hybrid Learning for CNN-LSTM and Reinforced Temporal Attention: The parameters \(\{\theta _g, \theta _h, \theta _c \}\) of CNN-LSTM are learned by minimizing the classification loss that is attached on the LSTM unit via backpropagation backward through the whole network. We minimize the cross-entropy loss as customary in recognition tasks, such as face identification [74]. Thus, the objective is to maximize the conditional probability of the true label given the observations, i.e. we maximize \(\log \pi _2 (c_t^* | s_{1:t}; \theta _g, \theta _h, \theta _c)\), where \(c_t^*\) is the true class at step t.

The parameters \(\{\theta _g, \theta _w\}\) of CNN and RTA are learned so that the agent maximizes its total reward \(R=\sum _{t=1}^T r_t\), where \(r_t\) has defined in Eq. 8. This involves calculating the expectation \(J(\theta _g, \theta _w)=\mathbb {E}_{p(s_{1:T}; \theta _g, \theta _w)}[R]\) over the distribution of all possible sequences \(p(s_{1:T}; \theta _g, \theta _w)\), which is intractable. Thus, a sample approximation, known as the REINFORCE rule [80], can be applied on the Bernoulli-Sigmoid unit [36, 44], which models the sub-policy \(\pi _1(w_t | f_w(g_t;\theta _w))\). Given probability mass function \(\log \pi _1(w_t; p_t)=w_t\log p_t+(1-w_t)\log (1-p_t)\) with Bernoulli parameter \(p_t=f_w(g_t;\theta _w)\), the gradient approximation is:

$$\begin{aligned} \nabla _{\theta _g, \theta _w}J&= \sum _{t=1}^T \mathop {\mathbb {E}}\nolimits _{p(s_{1:T}; \theta _g, \theta _w)} [\nabla _{\theta _g, \theta _w}\log \pi _1(w_t | s_{1:t};\theta _g, \theta _w)(R_t-b_t)]\end{aligned}$$
(9)
$$\begin{aligned}&\approx \frac{1}{M} \sum _{i=1}^M \sum _{t=1}^T \frac{w_t^i-p_t^i}{p_t^i(1-p_t^i)} (R_t^i-b_t) \end{aligned}$$
(10)

where sequences i, \(i\in \{1,\ldots ,M\}\), are obtained while running the agent for M episodes and \(R_t^i=\sum _{\tau =1}^t r_{\tau }^i\) is the cumulative reward at episode i acquired after collecting the sample \(w_t^i\). The gradient estimate is biased by a baseline reward \(b_t\) in order to achieve lower variance. Similarly to [26, 56], we set \(b_t = \mathbb {E}_{\pi }[R_t]\), as the mean square error between \(R_t^i\) and \(b_t\) is also minimized by backpropagation.

At each step t, the agent makes a prediction \(w_t\) and the reward signal \(R_t^i\) evaluates the effectiveness of the agent for the classification task. The REINFORCE update increases the log-probability of an action that results in higher than the expected accumulated reward (i.e. by increasing the Bernoulli parameter \(f_w(g_t;\theta _w)\)). Otherwise, the log-probability decreases for sequence of frames that lead to low reward. All in all, the agent jointly optimizes the accumulated reward and the classification loss, which constitute a hybrid supervised objective.

3 Experiments

3.1 Depth-Based Datasets

DPI-T (Depth-Based Person Identification from Top). Being recently introduced by Haque et al. [26], it contains 12 persons appearing in a total of 25 sequences across many days and wearing 5 different sets of clothes on average. Unlike most publicly available datasets, the subjects appear from the top, which is a common scenario in automated video surveillance. The individuals are captured in daily life situations where they hold objects such as handbags, laptops and coffee.

BIWI. In order to explore sequences with varying human pose and scale, we use BIWI [58], where 50 individuals appear in a living room. 28 of them are re-recorded in a different room with new clothes and walking patterns. We use the full training set, while for testing we use the Walking set. From both sets we remove the frames with no person or a person heavily occluded from the image boundaries or too far from the sensor, as they provide no skeleton information.

IIT PAVIS. To evaluate our method when shorter video sequences are available, we use IIT PAVIS [6]. This dataset includes 79 persons that are recorded in 5-frame walking sequences twice. We use Walking1 and Walking2 sequences as the training and testing set, respectively.

TUM-GAID. To evaluate on a large pool of identities, we use the TUM-GAID database [28], which contains RGB and depth video for 305 people in three variations. A subset of 32 people is recorded a second time after three months with different clothes, which makes it ideal for our application scenario in Sect. 3.6. In our experiments we use the “normal” sequences (n) from each recording.

3.2 Evaluation Metrics

Top-k accuracy equals the percentage of test images or sequences for which the ground-truth label is contained within the first k model predictions. Plotting the top-k accuracy as a function of k gives the Cumulative Matching Curve (CMC). Integrating the area under the CMC curve and normalizing over the number of IDs produces the normalized Area Under the Curve (nAUC).

In single-shot mode the model consists only of the \(f_{CNN}\) branch with an attached classifier (see Fig. 3). In multi-shot mode, where the model processes sequences, we evaluate our CNN-LSTM model with (or without) RTA attention.

3.3 Experimental Setting

The feature embedding \(f_{CNN}\) is trained in Caffe [31]. Consistent with [82], the input depth images are resized to be \(144\times 56\). SGD mini-batches of 50 images are used for training and testing. Momentum \(\mu =0.5\) yielded more stable training. The momentum effectively multiplies the size of the updates by a factor of \(\frac{1}{1-\mu }\) after several iterations, so lower values result in smaller updates. The weight decay is set to \(2*10^{-4}\), as it is common in Inception architecture [75]. We deploy modest base learning rate \(\gamma _0=3\times 10^{-4}\). The learning rate is reduced by a factor of 10 throughout training every time the loss reaches a “plateau”.

The whole model with the LSTM and RTA layers in Fig. 3 is implemented in Torch/Lua [18]. We implemented customized Caffe-to-Torch conversion scripts for the pre-trained embedding, as the architecture is not standard. For end-to-end training, we use momentum \(\mu =0.9\), batch size 50 and learning rate that linearly decreases from 0.01 to 0.0001 in 200 epochs up to 250 epochs maximum duration. The LSTM history consists of \(\rho =3\) frames.

Fig. 5.
figure 5

Comparison of our RGB-to-Depth transfer with Yosinski et al. [90] in terms of top-1 accuracy on DPI-T. In this ablation study the x axis represents the number of layers whose weights are frozen (left) or fine-tuned (right) starting from the bottom.

3.4 Evaluation of the Split-Rate RGB-to-Depth Transfer

In Fig. 5 we show results of our split-rate RGB-to-Depth transfer (which is described in Sect. 2.3) compared to [90]. We show the top-1 re-identification accuracy on DPI-T when the bottom CNN layers are frozen (left) and slowly fine-tuned (right). The top layers are transferred from RGB and rapidly fine-tuned in our approach, while they were re-trained in [90]. Given that the CNN architecture has 7 main layers before the classifier, the x axis is the number of layers that are frozen or fine-tuned counting from the bottom.

Evidently, transferring and freezing the three bottom layers, while rapidly fine-tuning the subsequent “inception” and fully-connected layers, brings in the best performance on DPI-T. Attempting to freeze too many layers leads to performance drop for both approaches, which can been attributed to feature specificity. Slowly fine-tuning the bottom layers helps to alleviate fragile co-adaptation, as it was pointed out by Yosinski et al. [90], and improves generalization, especially while moving towards the right of the x axis. Overall, our approach is more accurate in our setting across the x axis for both treatments.

Table 1. Single-shot and multi-shot person re-identification performance on the test set of DPI-T, BIWI and IIT PAVIS. Dashes indicate that no published result is available

3.5 Evaluation of the End-to-End Framework

In Table 1 we compare our framework with depth-based baseline algorithms. First, we show the performance of guessing uniformly at random. Next, we report results from [6, 59], who use hand-crafted features based on biometrics, such as distances between skeleton joints. A 3D CNN with average pooling over time [8] and the gait energy volume [70] are evaluated in multi-shot mode. Finally, we provide the comparisons with 3D and 4D RAM models [26].

In order to evaluate our model in multi-shot mode without temporal attention, we simply average the output of the classifier attached on the CNN-LSTM output across the sequence (cf. Fig. 3). In the last two rows we show results that leverage temporal attention. We compare our RTA attention with the soft attention in [88], which is a function of both the hidden state \(h_t\) and the embedding \(g_t\), whose projections are added and passed through a tanh non-linearity.

We observe that methods that learn end-to-end re-identification features perform significantly better than the ones that rely on hand-crafted biometrics on all datasets. Our algorithm is the top performer in multi-shot mode, as our RTA unit effectively learns to re-weight the most effective frames based on classification-specific reward. The split-rate RGB-to-Depth transfer enables our method to leverage on RGB data effectively and provides discriminative depth-based ReID features. This is especially reflected by the single-shot accuracy on DPI-T, where we report \(19.3\%\) better top-1 accuracy compared to 3D RAM. However, it is worth noting that 3D RAM performs better on BIWI. Our conjecture is that the spatial attention mechanism is important in datasets with significant variation in human pose and partial body occlusions. On the other hand, the spatial attention is evidently less critical on DPI-T, which contains views from the top and the visible region is mostly uniform across frames.

Next in Fig. 6 we show a testing sequence with the predicted Bernoulli parameter \(f_w(g_t;\theta _w)\) printed. After inspecting the Bernoulli parameter value on testing sequences, we observe large variations even among neighboring frames. Smaller values are typically associated with noisy frames, or frames with unusual pose (e.g. person turning) and partial occlusions.

Fig. 6.
figure 6

Example sequence with the predicted Bernoulli parameter printed.

Fig. 7.
figure 7

Cumulative Matching Curves (CMC) on TUM-GAID for the scenario that the individuals wear clothes which are not provided during training.

Table 2. Top-1 re-identification accuracy (top-1, \(\%\)) and normalized Area Under the Curve (nAUC, \(\%\)) on TUM-GAID in new-clothes scenario with single-shot (ss) and multi-shot (ms) evaluation

3.6 Application in Scenario with Unseen Clothes

Towards tackling our key motivation, we compare our system compared to a state-of-the-art RGB method in scenario where the individuals change clothes between the recordings for training and test set. We use the TUM-GAID database at which 305 persons appear in sequences n01–n06 from session 1, and 32 among them appear with new clothes in sequences n07–n12 from session 2.

Following the official protocol, we use the Training IDs to perform RGB-to-Depth transfer for our CNN embedding. We use sequences n01–n04, n07–n10 for training, and sequences n05–n06 and n11–n12 for validation. Next, we deploy the Testing IDs and use sequences n01–n04 for training, n05–n06 for validation and n11–n12 for testing. Thus, our framework has no access to data from the session 2 during training. However, we make the assumption that the 32 subjects that participate in the second recording are known for all competing methods.

In Table 2 we show that re-identification from body depth is more robust than from body RGB [82], presenting \(6.2\%\) higher top-1 accuracy and \(10.7\%\) larger nAUC in single-shot mode. Next, we explore the benefit of using head information, which is less sensitive than clothes to day-by-day changes. To that end, we transfer the RGB-based pre-trained model from [82] and fine-tune on the upper body part, which we call “Head RGB”. This results in increased accuracy, individually and jointly with body depth. Finally, we show the mutual benefits in multi-shot performance for both body depth, head RGB and their linear combination in class posterior. In Fig. 7 we visualize the CMC curves for single-shot setting. We observe that ReID from body depth scales better than its counterparts, which is validated by the nAUC scores.

4 Conclusion

In the paper, we present a novel approach for depth-based person re-identification. To address the data scarcity problem, we propose split-rate RGB-depth transfer to effectively leverage pre-trained models from large RGB data and learn strong frame-level features. To enhance re-identification from video sequences, we propose the Reinforced Temporal Attention unit, which lies on top of the frame-level features and is not dependent on the network architecture. Extensive experiments show that our approach outperforms the state of the art in depth-based person re-identification, and it is more effective than its RGB-based counterpart in a scenario where the persons change clothes.