Keywords

1 Introduction

Multi-object tracking (MOT) has attracted increasing interests in computer vision over the past few years, which has various practical applications in surveillance, human computer interface, robotics and advanced driving assistant systems. The goal of MOT is to estimate the trajectories of different objects and track those objects across the video. While a variety of MOT methods have been proposed in recent years [7, 8, 14, 27, 34, 36, 40, 45,46,47, 52], it remains a challenging problem to track multiple objects in many unconstrained environments, especially in crowded scenes. This is because occlusions between different objects and large intra-class variations usually occur in such scenarios.

Fig. 1.
figure 1

The key idea of our proposed C-DRL method for multi-object tracking. Given a video and the detection results of different objects for the tth frame, we model each object as an agent and predict the location of each object for the following frames, where we seek the optimal tracked results by considering the interactions of different agents and environment via a collaborative deep reinforcement learning method. Lastly, we take actions to update agents at frame \(t+1\) according to the outputs of the decision network

Existing MOT approaches can be mainly divided into two categories, (1) offline (batch or semi-batch) [7, 27, 40, 45, 46, 52] and (2) online [8, 14, 34, 36, 47]. The key idea of offline methods is to group detections into short trajectory segments or tracklets, and then use more reliable features to connect those tracklets to full trajectories. Representative off-line methods use min-cost network flow [5, 54], energy minimization [28] or generalized minimum clique graphs [52] to address the data association problem. Online MOT methods estimate the object trajectories with the detections of the current and past frames, which can be applied to real-time applications such as advanced driving assistant systems or robotics. Conventional online methods usually employ Kalman filtering [19], Particle filtering [32] or Markov decisions [47]. However, the tracking accuracy of these methods is sensitive to the occlusions and noisy detection results, such as missing detections, false detections and the non-accurate bounding boxes, which make these methods difficult to be applied for videos of crowded scenes.

In this paper, we propose a collaborative deep reinforcement learning (C-DRL) method for multi-object tracking. Figure 1 illustrates the basic idea of our proposed approach. Given a video and the detection results of different objects for the tth frame, we model each object as an agent and predict the locations of objects of the following frames by using the history trajectories and the appearance information of the \((t+1)\)th image frame. We exploit the collaborative interaction of each agent between the neighboring agents and the environment, and make decisions for each agent to update, track or delete the target object via a decision network, where the influence of occlusions between objects and noisy detection results can be well alleviated by maximizing their shared utility. Experimental results on the challenging MOT15 and MOT16 benchmarks are presented to demonstrate the efficiency of our approach.

2 Related Work

Multi-object Tracking: Most existing MOT methods can be categorized into two classes: (1) offline [7, 27, 40, 45, 46, 52] and (2) online [8, 14, 34, 36, 47]. Methods in the first class group all detection results into short trajectory segments or tracklets, and connect those tracklets into full trajectories. For example, Zamir et al. [52] associated all detection results which incorporate both the appearance and motion information in a global manner by using generalized minimum clique graphs. Tang et al. [40] introduced a graph-based method that links and clusters objects hypotheses over time by solving a subgraph multicut problem. Maksai et al. [27] proposed an approach to track multiple objects with non-Markovian behavioral constraints. Methods in the second class estimate object trajectories with the detection results of the current and past frames. For example, Yang et al. [48, 49] introduced an online learned CRF model by solving an energy minimization problem with nonlinear motion patterns and robust appearance constraints for multi-object tracking. Xiang et al. [47] formulated MOT as a decision-making problem via a Markov decision process. Choi et al. [7] presented an aggregated local flow descriptor to accurately measure the affinity between different detection results. Hong et al. [14] proposed a data-association method to exploit structural motion constraints in the presence of large camera motion. Sadeghian et al. [34] encoded dependencies across multiple cues over a temporal window and learned multi-cue representation to compute the similarity scores in a tracking framework. To overcome the influence of noisy detection, several methods have also been proposed. For example, Shu et al. [36] introduced a part-based representation under the tracking-by-detection framework to handle partial occlusions. Chu et al. [8] focused on learning a robust appearance model for each target by using a single object tracker. To address the occlusion and noisy detection problem, our approach uses a prediction-decision network to make decisions for online multi-object tracking.

Deep Reinforcement Learning: Deep reinforcement learning has gained significant successes in various vision applications in recent years, such as object detection [25], face recognition [33], image super-resolution [6] and object search [20]. Current deep reinforcement learning methods can be divided into two classes: deep Q learning [12, 29, 30, 42] and policy gradient [1, 37, 50]. For the first class, Q-values are fitted to capture the expected return for taking a particular action at a particular state. For example, Cao et al. [6] proposed an attention-aware face hallucination framework with deep reinforcement learning to sequentially discover attended patches and perform facial part enhancement by fully exploiting the global interdependency of the image. Rao et al. [33] proposed a attention-aware deep reinforcement learning method to select key frames for video face recognition. Kong et al. [20] presented a collaborative deep reinforcement learning method to localize objects jointly in a few iterations. For the second class, the distribution of policies is represented explicitly and the policy is increased by updating the parameters in the gradient direction. Liu et al. [26] applied a policy gradient method to optimize a variety of captioning metrics. Yu et al. [50] proposed a sequence generative adversarial nets with policy gradient. More recently, deep reinforcement learning [15, 16, 39, 51, 53] has also been employed in visual tracking. For example, Yun et al. [51] proposed an action-decision network to generate actions to seek the locations and sizes of the objects in a new coming frame. Supancic et al. [39] proposed a decision policy tracker by using reinforcement learning to decide where to look in the upcoming frames, and when to re-initialize and update its appearance model for the tracked object. However, these methods can not be applied to multi-object tracking directly since they ignore the communication between different objects. In this work, we propose a collaborative deep reinforcement learning method to exploit the interactions of different objects for multi-object tracking.

Fig. 2.
figure 2

The framework of the proposed C-DRL for multi-object tracking. In this figure, there are three objects at frame t. We first predict the locations of these three objects at frame \(t+1\). Then we use a decision network to combine the prediction and detection results and make decisions for each agent to maximize their shared utility. For example, Agent 2 is blocked by its neighborhood (Agent 1). Agent 1 updates itself by using the nearest detection result, and Agent 3 ignores the noisy detection. We initialize Agent 4 by using the remaining detection result in the environment. Lastly we use the locations of each agent as the tracking results at frame \(t+1\)

3 Approach

Figure 2 shows the framework of the proposed C-DRL method for multi-object tracking, which contains two parts, (1) a prediction network and (2) a decision network. Given a video and the detection results of different objects at frame t, we model each object as an agent and predict the locations of objects for the following frames, and seek the optimal tracked results by considering the interactions of different agents and environment via the decision network. Lastly, we take actions to update, delete or initialize agents at frame \(t+1\) according to decisions. In the following subsections, we will detail the prediction network and the decision network, respectively.

Fig. 3.
figure 3

The framework of the prediction network. The prediction network learns the movement of the target object given an initial location of the object, which contains three convolutional layers and three fully connected layers

3.1 Learning of the Prediction Network

Given initial locations of objects, the prediction network aims to learn the movement of objects to predict the locations of the target object. As shown in Fig. 3, the inputs to the prediction network are the raw image cropped by the initial bounding box of the next frame and history trajectories. We randomly sample bounding boxes \(b \in B_{i,t} \) around the location of the object \(b^{*}_{i,t} \) in each frame in the training videos as the training set to learn the prediction network. The prediction network takes the \((t+1)\)th frame cropped by the initial location b and the history trajectories H of the last K frames for position prediction, where K is set as 10 in our work. We formulate location prediction as the following regression problem:

$$\begin{aligned} \arg \max _{\phi } J(\phi ) = \sum _{i,t}\sum _{b\in B_{i,t}} g(b_{i,t+1}^*,b+\phi (I_t,b,H_t)), \end{aligned}$$
(1)

where J is the optimization objective function at the top layer of the prediction network, \(\phi \) is the parameter set of the network, \(b_{i,t+1}^{*}\) is the ground truth of the object \(p_{i}\) at frame \(t+1\), and \(g(\cdot )\) denotes the intersection-over-union (IoU) of two bounding boxes.

$$\begin{aligned} g(b_i,b_j)=\frac{b_i \cap b_j}{b_i \cup b_{j}}. \end{aligned}$$
(2)

3.2 Collaborative Deep Reinforcement Learning

As shown in Fig. 2, the decision network is a collaborative system which contains multiple agents and the environment. Each agent takes actions with the information from itself, the neighborhoods and the environment, where the interactions between agents and the environment are exploited by maximizing their shared utility. To make better use of such contextual information, we formulate multi-object tracking as a collaborative optimization problem.

We consider each object as an agent. Each agent p contains the trajectory \(\{(x_0,y_0), (x_1,y_1),\cdots , (x_t,y_t)\}\), the appearance feature f, and the current location \(\{x,y,w,h\}\). Hence, the distance between two objects \(p_{i}\) and \(p_{j}\) can be computed as follows:

$$\begin{aligned} d(p_i,p_j)=\alpha (1-g(p_{i},p_{j}))+ \left( 1-\frac{f_{i}^T f_{j}}{\Vert f_i\Vert _2 \Vert f_j\Vert _2}\right) , \end{aligned}$$
(3)

where \(g(p_{i},p_{j})\) is the IoU of two bounding boxes, and \(\alpha \ge 0\).

The environment contains the object detection results: \(\mathcal {P}^{*}_{t}=\{p_{1}^{*}, p^{*}_{2},\cdots , p^{*}_{N_t}\} \). The distance between the object \(p_{i}\) and the detection results can be computed as follows:

$$\begin{aligned} d(p_i,p_j^{*})=\alpha (1-g(p_{i},p_{j}^{*}))+ \left( 1-\frac{f_{i}^T f_{j}}{\Vert f_i\Vert _2 \Vert f_j^*\Vert _2}\right) . \end{aligned}$$
(4)

Let \(I_t\) be the tth frame of the selected video, which contains \(n_{t}\) objects, \(\mathcal {P}_t=\{p_1,p_2,\cdots ,p_{n_{t}} \}\). The state at frame t, \(s_{t}=\{\mathcal {P}_t,\mathcal {P}^{*}_{t}\}\) contains the current agents and the detection results. For the object \(p_i\), we first use a prediction network to generate the position at frame \(t+1\). Then, we select the nearest neighborhood \(p_j \in \mathcal {P}_t -\{p_i\}\) and the nearest detection result \(p^*_k\in \mathcal {P}_{t+1}^*\). Subsequently, we take these three images as the input to the decision network if \(d(p_j,p_{i})<\tau \) and \(d(p^*_k,p_{i})<\tau \). If \(d(p_j,p_{i})\ge \tau \) or \(d(p^*_k,p_{i})<\tau \), we replace it with a zero image.

The object has two different status in each frame: visible or invisible. If the object is visible, we update the agent with the prediction or the detection result. If the detection result is reliable, we use both the detection result and the prediction result. If the detection results is not reliable, we only use the prediction result. If the object is invisible, the object may be blocked by other objects or disappears. If the object is blocked, we keep the appearance feature and only use the movement model to predict the location of the object for the next frame. If the object disappears, we delete the object directly. Hence, for each agent, the action set is defined as \(\mathcal {A} =\{update, ignore, block, delete\}\).

For the action update, we use both the prediction and detection results to update the position of \(p_{i}\) and the appearance feature, described as below:

$$\begin{aligned} f_i=(1-\rho )f_i+\rho f_i^*, \end{aligned}$$
(5)

where \(\rho \) is the learning rate of appearance features.

We delete the detection results which are used to update agents features. For remaining detection results in the environment, we initialize an agent for each remaining result. For a false detection, the agent is also initialized, but the reward of the action \(\{ update, ignore, block \}\) is set to -1 while the reward of the action delete is set to 1. Then, the agent is deleted in the next iteration.

For the action ignore, the detection result is not reliable or missing, while the prediction result is more reliable. We use the prediction result to update the position of \(p_{i}\).

For the action block, we keep the feature of \(p_{i}\) as the object has been blocked by other objects, and the location is updated according to the prediction result.

For the action delete, the object disappears, and we delete the object \(p_{i}\) directly.

Therefore, the rewards \(r_{i,t}^{*} \) of each action contains two terms: \(r_{i,t} \) and \(r_{j,t+1}\), where \(r_{i,t} \) describes its own state in the next frame, and \(r_{j,t+1}\) refers to its nearest neighborhood state in the next frame. The final rewards can be computed as follows:

$$\begin{aligned} r^*_{i,t}=r_{i,t}+\beta r_{j,t+1}, \end{aligned}$$
(6)

where \(\beta \ge 0\) is the balance parameter.

The \(r_{i,t}\) of actions \(\{{update,\ ignore,\ block}\}\) is defined by the IoU of the prediction location with the ground truth in the next frame. If the value of IoU is too small or the object disappears, \(r_{i,t}\) is set to \(-1\).

$$\begin{aligned} r_{i,t} =\left\{ \begin{array}{cl} 1 &{} \text {if}\ IoU \ge 0.7 \\ 0 &{} \text {if}\ 0.5 \le IoU \le 0.7 \\ -1 &{} else \end{array} \right. . \end{aligned}$$
(7)

The \(r_{i,t}\) of the action delete is defined by the states of objects. If the object disappears in the next frame, \(r_{i,t}\) is 1, and otherwise -1.

$$\begin{aligned} r_{\text {delete}} =\left\{ \begin{array}{cl} 1 &{} \text {if object disappeared} \\ -1 &{} \text {else} \end{array} \right. . \end{aligned}$$
(8)

We compute the Q value of \(\{s_{i,t},a_{i,t}\}\) as follows:

$$\begin{aligned} Q(s_{i,t},a_{i,t})=r^*_{i,t} + \gamma r^*_{i,t+1}+\gamma ^2 r^*_{i,t+2}+\cdots , \end{aligned}$$
(9)

where \(\gamma \) is the decaying parameter.

The optimization problem of the decision network is formulated as follows:

$$\begin{aligned} \arg \max _{\theta } L(\theta ) = \mathbb {E}_{s,a}\log (\pi (a\vert s,\theta ))Q(s,a), \end{aligned}$$
(10)

where \(\theta \) is the parameter set of the decision network, and the policy gradient can be computed as follows:

$$\begin{aligned} \begin{aligned} \varDelta _{\theta }L(\theta )&= \mathbb {E}_{s,a}\varDelta _{\theta }\log (\pi (a\vert s,\theta ))Q(s,a) \\&= \mathbb {E}_{s,a}\frac{Q(s,a)}{\pi (a\vert s,\theta )}\varDelta _{\theta }\pi (a\vert s,\theta ) . \end{aligned} \end{aligned}$$
(11)

The gradient shows that we can increase the probability of actions which have positive Q values, and decrease the probability of actions which have negative Q values. However, in some easy scenes, the Q values of most actions are positive, while the Q values of all actions are negative in some challenging cases or at the beginning of the training stage. Hence, the policy gradient network is difficult to converge. Therefore, we use the advantage value of actions to replace the Q value, where we first compute the value of the state s as follows:

$$\begin{aligned} V(s)=\frac{\sum _{a}p(a\vert s)Q(s,a)}{\sum _{a}p(a\vert s)}. \end{aligned}$$
(12)

Then, the advantage value is computed as follows:

$$\begin{aligned} A(s,a) = Q(s,a ) - V(s). \end{aligned}$$
(13)

The final formulation of the policy gradient is defined as:

$$\begin{aligned} L(\theta ) = \mathbb {E}_{s,a}\log (\pi (a\vert s,\theta ))A(s,a). \end{aligned}$$
(14)

The parameter \(\theta \) can be updated as follows:

$$\begin{aligned} \theta= & {} \theta + \rho \frac{L(\theta )}{\partial \theta } \nonumber \\= & {} \theta + \rho \mathbb {E}_{s,a}\frac{A(s,a)}{\pi (a\vert s,\theta )}\frac{\partial \pi (a\vert s,\theta )}{\partial \theta }. \end{aligned}$$
(15)

Algorithm 1 summarizes the detailed learning procedure of our decision network.

figure a

4 Experiments

4.1 Datasets

MOT15 [22]: It contains 11 training sequences and 11 testing sequences. For each testing sequence, we have a training set of similar conditions so that we can learn our model parameters accordingly. The most challenging sequence in MOT15 is the AVG-TownCentre in the testing sequences because its frame rate is very low, and there is no corresponding training sequence.

MOT16: It contains 7 training sequences and 7 testing sequences. Generally, MOT16 is more challenging than MOT15 because the ground truth annotations are more accurate (some hard examples are taken into account), the background settings are more complex (e.g. with moving cars or captured with a fast moving camera), and the pedestrians are more crowded so that the occlusion possibility is increased. The camera motion, camera angle and the imaging conditions vary largely among different sequences in both datasets.

4.2 Evaluation Metrics

We adopted the widely used CLEAR MOT metrics [4] including multiple object tracking precision (MOTP) and multiple object tracking accuracy (MOTA) which combine false positives (FP), false negatives (FN) and the identity switches (ID Sw) to evaluate the effectiveness of different MOT methods. We also used the metrics defined in [24] which contains the percentage of mostly tracked targets (MT, the ratio of ground-truth trajectories that is covered by a track hypothesis for at least 80% of their respective life span), the percentage of mostly lost targets (ML, the ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span), and the time of a trajectory is fragmented (Frag, interrupted during tracking).

4.3 Implementation Details

Decision Network: Our decision network consists of a feature extraction part and a decision-making part. We used the part of MDNet [31] which was pretrained on ImageNet [9] to extract the feature of each object. The input size of the network is \(3\times 107\times 107\). It consists of three consecutive convolution layer (\(7\times 7\times 96\), \(5\times 5\times 256\), \(3\times 3\times 512\)) and max pooling layer combos (including batch normalization layers), and finally a fully connected layer to flatten the feature to a column vector D of size \(512\times 1\). We then calculate the position feature P (of size \(4\times 1\)) and concatenate D and P to a mixed feature vector W. Having predicted \(agent_{1}\) with feature \(W_{1}\), the agent which is closest to \(agent_{1}\) in the predicted model is called \(agent_{2}\) with feature \(W_{2}\), and the counterpart \(agent_{1}^{det}\) with feature \(W_{1}^{det}\) for \(agent_{1}\) in the detection of the next frame, and finally \(agent_{1}\) in the previous frame with feature \(D_{1}^{pre}\). Having concatenated all features, we obtain the input to the decision-making network (input size: \(2060\times 1\)). The structure of the network is relatively simple, we just utilized 3 fully connected layers (with dropout when training) to reduce the dimension to \(4\times 1\), which is corresponding to these four strategies.

In order to show that our network can learn to make decisions under various scenarios, we trained the decision network on all training sequences (both from MOT15 and MOT16) and then evaluate it on all the testing sequences without further processing. Here we trained the decision network on the training sequence of MOT15 and MOT16 for 10 epochs (1 epoch loops through all training sets including both MOT15 and MOT16). We optimized the network with stochastic gradient descent with weight decay at the rate of 0.0005 and momentum at the rate of 0.9. We set the learning rate 0.0002 at first 5 epochs and changed it into 0.0001 for the next 5 epochs. We applied a dynamic batch size strategy, which means that we obtain each frame and feed all objects in this frame to the network as a batch. This process best mimics the real tracking process thus is good for our network to be utilized to real tracking scenarios.

As for the reinforcement learning hyper-parameters, we firstly set balance parameter \(\beta \) and discount rate parameter \(\gamma \) to zero to simplify the training phase and let the decision network converge to a certain reward. Here the reward is 0.637. We then did grid search based on fine tuning the network. As shown in Fig. 4 that when \(\gamma = 0.8\) and \(\beta = 0.4\), we get maximized normalized reward (we normalize it to [0, 1]), so we set the hyper-parameters as above.

Prediction Network: We extracted all positive examples from the all training sequences from the datasets. In order to simulate noisy situations, we merged the information of detections and ground truth annotations, and computed the IoU of the detection bounding boxes and ground truth bounding boxes. If \(IoU > 0.5\), the detection is valid and we put the detection into our dataset; otherwise, we treated the detection as a false positive and discard it. Therefore, we combined the detection and ground truth information when training the shift network. Our prediction network shares the same feature extraction part with the C-DRL network. Having obtained the feature vector \(D \), we concatenated it with \(H^{10}(x, y, h, w)\), which is the trajectory of the past 10 frames of the target. We trained the network for 20 epochs with a batch size of 20. We selected stochastic gradient descend with a learning rate of 0.002 and weight decay at the rate of 0.0005 and the momentum at the rate of 0.95. We halved the learning rate every 5 epochs. Our tracking system was implemented under the MATLAB 2015b platform with the MatConvNet [43] toolbox.

Fig. 4.
figure 4

The average normalized rewards versus different \(\beta \) and \(\gamma \) on the MOT15 training dataset

Table 1. Performance of our method under different inter-frame relation thresholds

4.4 Ablation Studies

We conducted ablation studies on MOT the SubCNN detection of MOT15 training set which was provided in [47].

Influences of the Inter-frame Relation: We changed the consecutive frame information in our network to investigate how it affects the performance. Our method automatically wipes out the agents of relatively short continuous appearing time, which has been utilized in the training stage of our C-DRL network (e.g. when an agent is lost for a certain number of frames, our method gives the command to pop it out and renews the weights in that direction). We set the threshold from 1 to 9. From Table 1, we see that when more inter-frame information is utilized, more constraints to our agents can be included, so that the noisy detection results can be well eliminated in our model. We also notice that as our FP goes up, our FN falls as well, which is a trade-off between the precision and recall. Since MOTA seems to be saturated after THRESH \(\ge 8\), setting THRESH to be 8 is a good choice for optimizing MOTA.

Influences of Decision Network: We set inter-frame threshold to 8 from the conclusion of previous part. Our original baseline (OB) is our full pipeline without modification. We replaced our decision network with vanilla Hungarian algorithm and fixed all other parameters (DN \(\rightarrow \) HA). We find that the overall performance of the whole system falls drastically according to 2. Especially, the FP almost doubles and the IDs increases by an order of magnitude. Our decision network effectively wipes out false positives and id switches by conducting appropriate actions.

Influences of Prediction Network: We replaced our prediction network with velocity model method (PN \(\rightarrow \) VM). We predict the position of each agent by using the trace of them. In other words, we model the instant velocity of agents by using their previous movement. According to our experiment result showed in Table 2, the performance gets worse as well. As the movement of pedestrians in MOT15 training set is relatively smooth and slow, there are rarely edge cases like turning or running. As a result, the performance is not bad. However, our original pipeline is still able to give more precise position prediction.

Influences of MDNet Feature: We replaced the MDNet part of our decision and the prediction networks with simple color histogram feature (PN \(\rightarrow \) VM) and then feed them to the fully-connected layers. This time, the performance downgrade is slight, which means our reinforcement learning method is robust to different feature representations. However, more delicate and informative feature is a boost.

We could easily see the advantage of our decision network and the effectiveness of prediction network. As our decision network apparently enhances the performance by a large margin, thats the core part of our whole system.

Table 2. Ablation studies of different settings
Fig. 5.
figure 5

Some tracking results on the MOT15 and MOT16 public detections, where the trajectory of each object has been painted from the first frame in the same color as its bounding box

4.5 Evaluations on MOT15

Comparison with State-of-the-Arts: For a fair comparison, we used the public detection results on MOT15 and MOT16. Sampled results are showed in Fig. 5. As shown in Table 3, our method outperforms most state-of-the-art trackers on MOT15 under the MOTA metric, which is one of the most important and persuasive metrics in multi-object tracking. Our method is also comparable with AMIR15 [34]. Moreover, we obtained the best FN among all online methods, which indicates that our method is able to recover detections effectively. We noticed that some methods such as LINF1 [10] can obtain relatively high performance on FP and ID Sw. However, it sacrifices lots of hard examples, which leads to a bad FN performance. Our method also outperforms all offline methods (e.g. they have access to all frames regardless of the time order so that they get far more information than an online one), which indicates that our network can well learn contextual information via the deep reinforcement learning framework.

Table 3. The performance of different methods on MOT15
Table 4. The performance of different methods on MOT16

4.6 Evaluations on MOT16

Comparison with State-of-the-Arts: As shown in Table 4, our method achieved the best MOTA result among all online MOT methods and is comparable to the best offline methods such as LMP [41] and FWT [13]. In terms of MT and ML, our method also achieves the best performance among all online methods, which indicates that our method can keep track of relatively more objects than other methods under complex environments. Since the detection results of MOT16 are more accurate, our decision network and prediction network can learn more right behaviors and decrease the possibility of losing objects. Another observation is that our method obtains the best FN performance among all online methods, which is because our method recovers some missing objects that were missed in the detector via the decision network. Since the public detector does not cover all positive samples in MOT16, the rates of FN are naturally high for all methods. However, our method addresses this decently. We see that our method outperforms these offline methods by a large margin, which shows that the effectiveness of the decision network where collaborative interaction maximizing the utilization of contextual information is effectively exploited to enhance the generalization ability of our network. Also, our FP gets the second place among both online and offline methods, which means our method has strong ability to eliminate false positives exist in the detection results. In Fig. 6 (a) the top image shows the provided public detection results which contain multiple detections of same people. However, in the tracking result below, our method successfully eliminate those redundant detections.

Fig. 6.
figure 6

(a) False positives eliminating (b) ID switch problems

Failure Cases: Figure 6(b) shows some failure examples of our method . For the first row, we could see that when people walk by each other, it is easy for them to switch their ids. For example the woman in white is initially in blue box, however the blue box moves to the man in blue in the next frames. For the second row, we could see that when occlusion lasts for long time, the reappeared person would be assigned with a new id (i.e. A bounding box with a new color in our picture). For instance, the man in white is initially in yellow box and he is hidden by another one in the second frame. When he reappears in the third frame, he is in a newly assigned yellow box. Our method has a relatively high ID switch and Frag (actually these two metrics are closely correlated) on both the MOT15 and MOT16 datasets, which indicates that our decision network is sometimes over-cautious when there are some changes on conditions. In such scenarios, our method will assign the object a new ID label. For the memory optimization, we keep the object in our model for several frames (where we set 2 in our experiments) if it got lost at a certain frame. For some videos of high sampling rates, the object lost for relatively more frames due to occlusion and this also caused ID switch as well. However, this can be relieved by saving the feature of possibly disappeared objects in our model for longer frames and training the network with more similar sequences so that the network can better utilize dynamic information. Another reason is that when two or more objects move to each other, both their position and appearance feature are extremely similar, which poses large challenges for MOT trackers.

5 Conclusion

In this paper, we have proposed a collaborative deep reinforcement learning method for multi-object tracking. Specifically, we have employed a prediction-network to estimate the location of objects in the next frame, and used deep reinforcement learning to combine the prediction results and detection results and make decisions of state updates to overcome the occlusion and the missed or false detection. Experimental results on both the challenging MOT15 and MOT16 benchmarks are presented to show the effectiveness of our approach. How to apply our method to camera-network multi-object tracking seems to be an interesting future work.