1 Introduction

Short-term visual object tracking has been an active research topic in computer vision due to its widespread application areas. In recent years, the community has witnessed rapid development and seen many successful trackers emerging thanks to standardized evaluation protocols and publicly available benchmarks and competitions [1,2,3,4,5,6]. In order to adapt to target appearance changes, short-term trackers update their tracking models over time. However, that makes them prone to model corruption and drifting in case of persistent occlusions when the tracker adapts to the occluding object and starts to track it.

To avoid corruption, a tracker should be able to discriminate between the target object and the rest of the scene so that it can stop model updating if the target is occluded. However, this is a challenging task in the RGB space if there are occluders with similar visual appearance. To alleviate this issue, adding the depth cue to short-term trackers is intuitive; even if a tracked object is occluded by another object with similar appearance, the difference in their depth levels will be distinctive and will help to detect the occlusion. The availability of affordable depth sensors makes adoption of the depth cue even more attractive.

Since the depth channel lacks texture, depth itself may not provide useful information for visual tracking. On the other hand, RGB trackers perform competitively as long as no occlusions occur (see Table 1). Therefore, our work aims at benefiting from the huge amount of effort that has been put on generic short-term RGB trackers and adopts depth as means for occlusion detection. As a novel solution, we propose a generic framework that can be used with any VOT compliant [6] short-term tracker to convert it into an RGBD tracker with depth-augmented occlusion detection. By applying the proposed framework through a clear interface and not changing the internal structure of a short-term tracker, a fast integration is ensured and the framework will benefit from ever improving short-term tracker performance in the future.

The proposed framework contains two main components: short-term failure detection and recovery from occlusion. Short-term failure detector continuously evaluates the target region to decide whether to allow the short-term tracker to update its model or switch to the recovery from occlusion mode. The framework also contains an optional, powerful third component which can be used with RGB trackers that accepts foreground masks that explicitly indicate occluded regions that do not belong to the target(e.g. CSR-DCF [7]).

Fig. 1.
figure 1

Overview of the proposed framework. The short-term RGB tracker provides bounding box coordinates to the framework to be used for segmenting the visible target region with the help of depth. Using the ratio of the visible region to the bounding box, occlusions are detected hence, short-term tracker update is stopped and recovery mode is started. If the object is re-detected during recovery, the RGB tracker is resumed.

The main contributions of this paper are:

  • A generic framework to convert an arbitrary RGB short-term tracker into an RGBD tracker.

  • Formulation of the framework’s core component - non-occluded foreground segmentation - as an energy function of three terms, depth, color and spatial prior, which is optimized using graph cuts.

  • RGBD versions of one baseline and three state-of-the-art short-term RGB trackers: DCF [8], ECO (ECOhc and ECOgpu variants) [9] and CSR-DCF [7]

The rest of the paper is organized as follows; Sect. 2 summarizes existing literature on generic, short-term tracking and RGBD trackers, Sect. 3 explains the proposed framework in detail, Sect. 4 provides the experiments and finally Sect. 5 concludes the paper.

2 Related Work

The aim of the provided generic framework is to convert any existing short-term RGB tracker into an RGBD tracker. We are motivated by the facts that the field of short-term RGB tracking progresses on a steady basis and RGB provides a strong cue for tracking. On the other hand, we also believe that depth can be used as a complementary cue to instruct when a short-term tracker should be stopped and switched to recovery mode. In this sense, the proposed framework benefits from state-of-the-art short-term trackers which are briefly surveyed in addition to the recent RGBD trackers (Fig. 1).

RGB Trackers – generic, short-term visual object tracking on RGB videos is a well-established research topic in computer vision and the main approaches can be grouped under two main categories. In Generative Trackers, a target model is stored and the goal is to find the best matching region in the next frame. A few descriptive examples for this category are Incremental Visual Tracking (IVT) [10], Structural Sparse Tracking (SST) [11] and kernel-based object tracking [12]. On the other hand, Discriminative Trackers continuously train a classifier using positive and negative samples that are acquired during the tracking process. Prominent examples of this category are Tracking-Learning-Detection (TLD) [13], Continuous Convolutional Operators Tracking (CCOT) [14], Multi-Domain Convolutional Neural Networks (MDNet) [15], Efficient Convolution Operators for Trackers (ECO) [9], and Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF) [7]. Due to their success in the last few years, discriminative trackers have been widely adopted in the recent works. For example, in the VOT 2017 challenge, 67% of the submissions were from this category [6]. However, training a classifier can be computationally expensive which has prompted the adoption of simple yet powerful methods for the training stage. Starting with the seminal work of Bolme et al.  [16], Discriminative Correlation Filter (DCF) based trackers have gained momentum due to their performance, fast model update (training) and mathematical elegance. Henriques et al.  [17] proposed a method for efficient training of multiple samples that improves performance while providing very high FPS. To suppress the border artefacts resulting from circular correlation, Galoogahi et al.  [8] posed the DCF learning as a more complex optimization problem which can still be efficiently solved with the help of the Augmented Lagrangian Method (ALM). Lukezic et al.  [7] further improved their idea by introducing spatial reliability maps to extract unpolluted foreground masks. In VOT 2017, DCF based algorithms constitute almost 50% of the submitted trackers [6] with ECO [9] and CSR-DCF [7] being among the top performers while CSR-DCF C++ implementation won the best real-time tracker award.

RGBD Trackers – as compared to generic, short-term tracking on RGB, RGBD tracking is a relatively unexplored area. This can be partly attributed to the lack of datasets with groundtruths until recently. Song et al.  [18] captured and annotated a dataset consisting of 100 videos with an online evaluation system and their benchmark is still the largest available. They also provided multiple baseline algorithms under two main categories; depth as an additional cue and point cloud tracking. Depth as an additional cue trackers treat depth as an extra channel to HOG features [19] whereas point cloud tracking methods use 3D point clouds for generating 3D bounding boxes. Among the ten proposed variations the one with RGBD HOG features and boosted by optical flow and occlusion detector achieved the best performance.

The seminal work of Song et al. inspired many followups. Meshgi et al.  [20] proposed an occlusion-aware particle filter based tracker that can handle persistent occlusions in a probabilistic manner. Bibi et al.  [21] also used a particle filter framework with sparse parts for appearance modeling. In their model, each particle is a sparse, linear combination of 3D cuboids which stays fixed during the tracking. Without occlusion, they first make a coarse estimation of the target location using 2D optical flow and then sample particles over the rotation R and translation T spaces. Occlusion is detected by counting the number of points in the 3D cuboid representation. Success of the DCF approach naturally caught the attention in the RGBD community as well. To the best of our knowledge, the first DCF based RGBD tracker was proposed by Camplani et al.  [22]. They first cluster the depth histogram and then apply a single Gaussian distribution to model the tracked object in the depth space. To extract the foreground object, they assume that the cluster with the smallest mean is the object. The second method using DCF was proposed by An et al.  [23] where Kernelized Correlation Filters (KCF) are used in conjunction with depth based segmentation for target localization. Heuristic approaches were adopted for detecting whether an object is in the occlusion state or not.

Recently, Kart et al.  [24] proposed an algorithm for using the depth as a means to generate masks for DCF updates. Although this work is in a similar spirit, our method differs from theirs in multiple, fundamental aspects; first of all, the authors incorporate neither color nor spatial cues for the mask creation. This results with the loss of very vital information sources. Especially in sequences where the target object and the occluding object have similar depth levels, it is very likely that the algorithm will not be able to discriminate in between even if they have different colors. Secondly, their foreground segmentation consists of a simple thresholding of depth probabilities which is an ad-hoc approach that requires careful fine tuning. Finally, the authors propose a brute force, full-frame grid search for recovering from the occlusion state whereas we propose to use the motion history of the target object to adaptively generate significantly smaller search areas to avoid redundant computational complexity.

3 RGBD Converter Framework

The proposed framework offers two levels of integration with the level-two being optional for trackers that can use a foreground mask in their model update. In the level-one integration, the framework continuously calculates the visibility state of a target object by casting the visibility problem as a pixel-wise foreground-background segmentation from multiple information sources: color, spatial proximity and depth. The segmentation result is the foreground mask. Without interfering with the internal structure of an RGB tracker, the framework uses the tracker output and bounding box to obtain a region of interest (ROI) for the segmentation step. If the ratio between the visible and occluded pixels is below a threshold, model updating of the RGB tracker is stopped and the framework goes into occlusion recovery mode. In the occlusion recovery mode, model update is stopped and the search region is continuously expanded around the last known location of the target. The search is performed by running the RGB tracker in a coarse-to-fine manner to find its maximum response r in the search region. The score is compared to the mean of last N valid responses (Sect. 3.4, Algorithm 1). Once the target is detected, RGB tracker updating resumes. The level-two integration is available for trackers that use foreground masks in their model update.

Fig. 2.
figure 2

The workflow diagram of the proposed framework. The framework uses bounding box coming from the RGB tracker and the depth frame to make a decision whether the target object is visible or not. In case it is visible, it allows the RGB tracker to update its model and continue tracking. If the target disappears, the framework runs the occlusion recovery module where the target object is searched using the last valid target model of the RGB tracker.

For foreground segmentation, we adopt the energy minimization formulation in [25]:

$$\begin{aligned} E(f) = E_{smooth}(f)+E_{data}(f) \end{aligned}$$
(1)

The goal is to find a pixel-wise labeling f (foreground/background) that minimizes the energy. \(E_{smooth}\) represents smoothness prior that penalizes neighboring pixels being labeled differently and \(E_{data}\) represents the observed data based energy. For \(E_{smooth}\), we adopt the efficient computation of smoothed priors in [26] and \(E_{data}\) we formulate as

$$\begin{aligned} E_{data}(f) = E_{color}(f)+E_{spatial}(f)+E_{depth}(f) \end{aligned}$$
(2)

where \(E_{color}\) measures the likelihood of observed pixel color given the target color model, \(E_{depth}\) models target region’s depth and finally \(E_{spatial}\) the spatial prior which is driven by the tracker location in the current frame. At the core of our approach are proper formulations of \(E_{color}\) (Sect. 3.1), \(E_{depth}\) (Sect. 3.2) and \(E_{spatial}\) (Sect. 3.3) so that the global optimum can be computed efficiently using the graph cuts algorithms [25, 27] (Fig. 2).

An example of the segmentation process is given in Fig. 3. As it can be seen, color based segmentation assigns both the target and the occluding object high confidence. However, the depth component is able to discriminate between the two while spatial component ensures high probability for the pixels that are close to the center of the tracking window.

Fig. 3.
figure 3

Energy components in (2) and the segmentation output. The depth provides a strong cue even if the tracked and the occluding object have a very similar appearance.

3.1 Color-Based Target-Background Model \(E_{color}\)

In our formulation, \(E_{color}\) represents conditional dependencies between random variables (pixel fg/bg labels) for which we adopt a conditional random field formulation. The formulation uses the foreground/background probabilities as

$$\begin{aligned} E_{color} = \sum _{i \in \mathcal {V}}\psi _i(x_i) \end{aligned}$$
(3)

where i is a graph vertex index (pixel) and \(x_i\) its corresponding label. \(\psi _i\) is encoded as a probability term

$$\begin{aligned} \begin{aligned} \psi _i(x_i = 0) = -\log \left( p(x_i \notin fg)\right) \\ \psi _i(x_i = 1) = -\log \left( p(x_i \in fg)\right) \end{aligned} \end{aligned}$$
(4)

Since tracking is a temporal process, we need to add the frame number indicator to our notation \(x_i \Rightarrow x^{(t)}_i\) where (t) is the current and (\(t-1\)) the previous frame.

The probabilities \(p(\cdot )\) can be efficiently computed using the color histograms of the foreground and background, \(h_{f}\) and \(h_{b}\), respectively. It should be noted that these histograms are updated after processing every frame for adapting to appearance changes. Therefore during processing frame t, the most recent color histograms are represented as \(h^{t-1}_{f}\) and \(h^{t-1}_{b}\). Now, the color probability term is

$$\begin{aligned} p\left( x^{t}_i \in fg\right) = p\left( x^{t}_i = 1 \mid \hbox {hsv}(x^{t}_i),h^{(t-1)}_{f},h^{(t-1)}_{b}\right) . \end{aligned}$$
(5)

where the \(hsv(\cdot )\) function returns the HSV color space value of the pixel corresponding the label \(x_i\) in the current frame. The histograms are computed in 3D using \(8\times 8\times 8=512\) uniformly distributed bins.

3.2 Depth-Based Target-Background Model \(E_{depth}\)

We model the depth induced energy \(E_{depth}\) similar to color using depth histograms \(\hat{h}_{f}\) and \(\hat{h}_{b}\)

$$\begin{aligned} p\left( x^{t}_i \in fg\right) = p\left( x^{t}_i = 1 \mid \hbox {depth}(x^{t}_i),\hat{h}^{(t-1)}_{f},\hat{h}^{(t-1)}_{b}\right) \end{aligned}$$
(6)

where the depth probability is defined via the Bayesian rule (we use d to denote \(\hbox {depth}(x^{(t)}_i)\) for more compact representation)

$$\begin{aligned} p\left( x^{(t)}_i = 1 \mid d,\hat{h}^{(t-1)}_{f}, \hat{h}^{(t-1)}_{b}\right) = \frac{p\left( x^{(t)}_i = 1 \mid d,\hat{h}^{(t-1)}_{f}\right) }{p\left( x^{(t)}_i = 1 \mid d,\hat{h}^{(t-1)}_{f}\right) + p\left( x^{(t)}_i = 0 \mid d,\hat{h}^{(t-1)}_{b}\right) } \end{aligned}$$
(7)

The above depth histograms are computationally efficient, but strongly biased against unseen depth levels. To be more precise, since probabilities for previously seen depth levels are high, the current frame (t) pixels with the same depth levels are more likely to be assigned to the foreground and the model easily fails to introduce new depth levels. For tackling this problem, we add foreground and background distribution priors in the spirit of Bayesian estimation theory. For the foreground histogram estimation prior, we adopt the triangle function which has a maximum at the foreground depth mode (d denotes \(\hbox {depth}(x_i)\) for more compact notation and \(||\cdot ||\) is the length of the histogram)

$$\begin{aligned} \varPsi _{f}(d) = tri(d) = \left( 1-\frac{|d-\hbox {mode}(\hat{h}^{(t-1)}_{f})|}{||\hat{h}^{(t-1)}_{f}||}\right) * \gamma \end{aligned}$$
(8)

and for the background histogram estimation, we adopt the uniform distribution as a non-informative prior

$$\begin{aligned} \varPsi _{b}(d) = \text {unif}(x_i) =\frac{1}{||\hat{h}^{(t-1)}_{b}||} * \theta \end{aligned}$$
(9)

\(\gamma \) and \(\theta \) are constants that control the prior gains. The choice of using a triangle distribution for foreground and uniform distribution for background stems from the following; in case of the foreground depth levels, it is expected that the newly seen depth levels will be similar to the current depth (e.g. a rotating object) and depth values in general are concentrated around the mode/mean. However, we cannot make any assumptions about the background and therefore we adopt the non-informative prior in (9).

To ensure continuous depth levels while not compromising from quick updates, we propose to apply a smoothening filter \(g^t(d)\) to the observed histogram in the updating stage where \(g^t(d)\) is a single Gaussian function centered at the histogram mode. By suppressing depth values that are highly unlikely to belong to the current observation, it provides a safety mechanism against wrong detections and drifting. Thus, the depth histogram updating process takes the following online update form:

$$\begin{aligned} \begin{aligned}&\hat{h}^{(t)}_{f} = \alpha \hat{h}^{(t-1)}_{f} + \left( (1-\alpha ) \hat{h}^{(t)}_{f}\right) \odot g^t(d) \\&\hat{h}^{(t)}_{b} = \alpha \hat{h}^{(t-1)}_{b} + \left( (1-\alpha ) \hat{h}^{(t)}_{b}\right) \end{aligned} \end{aligned}$$
(10)

3.3 Spatial Prior \(E_{spatial}\)

The third energy term in our model is a spatial prior that gives preference to foreground labels near the object center suggested by the short-term tracker;

$$\begin{aligned} p\left( x^{t}_i \in fg\right) = p\left( x^{t}_i = 1 \mid \mathbf {x}(x^{t}_i)\right) = k\left( \mathbf {x}(x^{t}_i); \sigma \right) \end{aligned}$$
(11)

where \(\mathbf {x}(\cdot )\) provides the spatial location (x,y) of the label \(x_i\) and \(k(x;\sigma )\) is a clipped Epanechnikov kernel commonly used in kernel density estimation.

3.4 Occlusion Recovery

Given the energy terms \(E_{color}\), \(E_{depth}\) and \(E_{spatial}\), graph cut [25] provides labeling of each pixel in the tracker window by minimizing the energy. If the number of foreground pixels falls below a threshold \(\tau \), the tracker is stopped and recovery process started. To this end, we propose to use the trained RGB model \(M^{t}\) as an object detector since the depth information is no longer reliable, especially when the occlusion is persistent.

The proposed recovery strategy is based on three principles: (i) target object will be found again near the spatial location where it was previously seen, (ii) tracker response of a recovered object must be similar (proportional by \(\varOmega \)) to the previous tracked frames (\(N=30\) in our experiments), and (iii) each region must be expanded with a speed proportional to the object’s average speed before the object was lost. By expanding the search region adaptively, computational redundancy of processing irrelevant spatial regions is avoided. Algorithm 1 summarizes the occlusion recovery process.

figure a

3.5 Target-Background Mask Extension for CSR-DCF

This section is related to the level-two integration explained in the beginning of Sect. 3 and as the example case we use the CSR-DCF tracker in [7]. Since the original idea of Discriminative Correlation Filter (DCF) for tracking [16, 28], many improvements have been proposed. An efficient solution in the Fourier domain was proposed by Henriques et al.  [29] and their work was followed by an important extension by Galoogahi et al.  [8] who relaxed the assumption of circular symmetrical filters. These extensions were adopted in CSR-DCF [7] which constructs a reliability mask that is used to mask out background regions during tracker updates. Intuitively, the CSR mask can be replaced with the proposed foreground mask which is the output of graph cuts optimization (see Fig. 3 for an example mask). In our experiments, it turns out that this significantly improves the performance of CSR-DCF since the proposed depth-based mask avoids model pollution more effectively. The level-two integration of our framework to CSR-DCF is simple: CSR mask is replaced with the mask produced by minimizing (1).

4 Experiments

In this section, we present the results for various trackers augmented with the proposed framework. Four generic, short-term trackers due to their proven success and efficiency are chosen; DCF [8], ECO [9] and CSR-DCF [7]. Since ECO has two variants, we applied the proposed framework to both ECO-gpu (deep features) and ECO-hc (hand crafted features).

4.1 Experimental Setup

Implementation Details – All experiments were run on a single laptop using a non-optimized Matlab code with Intel Core i7 3.6 GHz and Ubuntu 16.04 OS. The parameters for the proposed algorithm were empirically set and kept constant during the experiments. Tracking parameters were as in the original works with the exception of DCF and CSR-DCF filter learning update rates were set to 0.03 and the number of bins for color histograms to 512. The rest of the parameters can be found in the publicly available code of our framework [32].

Dataset – For validating the proposed framework we conducted experiments on the Princeton Tracking Benchmark (PTB) [18]. The dataset consists of 95 evaluation sequences and 5 validation sequences from 11 tracking categories, namely human, animal, rigid, large, small, slow, fast, occlusion, no occlusion, passive motion and active motion. The videos have been recorded with a standard Kinect 1.0 device and all frames annotated manually.

Evaluation Metrics – We use the metrics as they are provided by PTB [18].

However, the evaluation sequences do not contain publicly available ground truths except for the initial frame. To facilitate a fair comparison, Song et al.  [18] also provide an online system where the resulting coordinates are uploaded for obtaining the final scores and ranking. The results of other methods in our paper were taken from the online system’s website with the exception of DLST [23] who have not registered their methods. DLST scores were obtained from its paper. Bibi et al.  [21] depth images are adopted in the experiments.

4.2 Comparison to State-of-the-Art

The results of the converted short-term trackers and the other top performing trackers on the PTB dataset are given in Table 1. Since the evaluation server did not allow multiple simultaneous submissions, we submitted each method separately and generated the leaderboard using the official protocol; methods were first ranked in each category and then the average rank was calculated.

Table 1. Comparison of short-term RGB and RGBD tracking methods on the Princeton Tracking Benchmark (PTB) [18]. DCF [8] and three state-of-the-art trackers were used within the framework – ECOgpu [9], ECOhc [9] and CSR-DCF [7]; their level-one RGBD extensions are denoted DCF-rgbd, ECO-rgbd and CSR-DCF-rgbd, the level-two CSR-DCF integration where the original RGB-based mask is replaced by the proposed foreground mask (Sect. 3.5) is denoted CSR-DCF-rgbd++. (The table shows results for the Princeton Benchmark as of June 15, 2018)

The symbols \(\bullet \), \(\star \), \(\diamond \), and \(\circ \) in Table 1 mark the trackers that our framework applied to. As it can be observed, the proposed method clearly has a big impact on overall rankings for all three trackers. Especially in sequences with occlusions, this impact becomes more visible. CSR-DCF improves 8 ranks, ECOgpu ranking sees 7 ranks improvement and ECOhc rank improves by 8. In terms of accuracy, the improvement is as strong as in rankings. When the proposed framework (without foreground masked updates) is applied to CSR-DCF, its performance in occlusion sequences increases 18% while ECOhc grows by 6% and ECOgpu 11%. The level-two integration further boosts occlusion sequences accuracy for CSR-DCF to a total of 26%.

Fig. 4.
figure 4

Short-term and long-term occlusion examples comparing the original methods (red) and their RGBD versions (green). Top row – DCF, second row – ECOgpu, third row – ECOhc, bottom row – CSR-DCF. (Color figure online)

Unlike other top performing methods, CSR-DCF-rgbd++ also maintains a well-balanced performance over all the categories by staying among the top in every one. This suggests that it does not overfit to specific categories but it provides similar performance for different scenarios which makes it a very suitable candidate for real-life applications.

Figure 4 shows that the proposed framework adds to tracker’s occlusion resilience to both short-term and long-term occlusions. For example, ECOhc-rgbd was able to detect the occlusion and it also re-detected the target object when it reappeared in the scene instead of drifting due to model pollution. As a good example of long-term recovery example, CSR-DCF-rgbd++ was able to recover even after  35 frames of occlusion state since it avoided model corruption and expanded the search region gradually.

The reason why CSR-DCF-rgbd++ performs better than the other RGBD methods can be possibly explained by its masked DCF update mechanism which uses the foreground provided by the framework. In the discriminative tracking paradigm, the tracker’s target model is updated over the time for coping with the visual changes. However, when a rectangular bounding box is used for this purpose, it is likely to include background and occluding object’s pixels as well. This results with learning of irrelevant information that may cause drifting. Whereas in the masked update approach, the updates are done only using the pixels that are confidently belong to the target object. Thus, the target model stays uncorrupted which results with better performance.

5 Conclusions

A generic framework was proposed for converting existing short-term RGB trackers into RGBD trackers. The framework is easy to adopt as it only requires control of model updating (stop/resume) and a tracking bounding box which are both provided in any tracker that is VOT compliant [6]. At the core of the framework is a foreground model which uses depth, color and spatial cues to efficiently detect occluded regions which are utilized at two-levels: occlusion detection and optionally, masked tracker updates. In all experiments, existing RGB trackers improved their ranks in the publicly available Princeton tracking benchmark [18]. CSR-DCF tracker which allows level-two integration of the proposed foreground model achieved state-of-the-art accuracy and was ranked the best on the day of submission. The full source code of the framework is publicly available [32].