Keywords

1 Introduction

The recent years have witnessed growing interest in developing visual object tracking algorithms for various vision applications. Existing tracking-by-detection approaches mainly consist of two stages to perform tracking. The first stage draws a large number of samples around target objects in the previous frame and the second stage classifies each sample as the target object or as the background. In contrast, one-stage regression trackers [1,2,3,4,5,6,7,8] directly learn a mapping from a regularly dense sampling of target objects to soft labels generated by a Gaussian function to estimate target positions. One-stage regression trackers have recently received increasing attention due to their potential to be much faster and simpler than two-stage trackers. State-of-the-art one-stage trackers  [1,2,3,4,5] are predominantly on the basis of discriminative correlation filters (DCFs) rather than deep regression networks. Despite the top performance on recent benchmarks [9, 10], DCFs trackers take few advantages of end-to-end training as learning and updating DCFs are independent of deep feature extraction. In this paper, we investigate the performance bottleneck of deep regression trackers [6,7,8], where regression networks are fully differentiable and can be trained end-to-end. As regression networks have greater potential to take advantage of large-scale training data than DCFs, we believe that deep regression trackers can perform at least as well as DCFs trackers.

Fig. 1.
figure 1

Tracking results in comparison with state-of-the-art trackers. The proposed algorithm surpasses existing deep regression based trackers (CREST [8]), and performs well against the DCFs trackers (ECO [5], C-COT [4] and HCFT [3]).

We identify the main bottleneck impeding deep regression trackers from achieving state-of-the-art accuracy as the data imbalance [11] issue in regression learning. For the two-stage trackers built upon binary classifiers, data imbalance has been extensively studied. That is, positive samples are far less than negative samples and the majority of negative samples belong to easy training data, which contribute little to classifier learning. Despite the pertinence of data imbalance in regression learning as well, we note that current one-stage regression trackers [6,7,8] pay little attention to this issue. As the evidence of the effectiveness, state-of-the-art DCFs trackers improve tracking accuracy by re-weighting sample locations using Gaussian-like maps [12], spatial reliability maps [13] or binary maps [14]. In this work, to break the bottleneck, we revisit the shrinkage estimator [15] in regression learning. We propose a novel shrinkage loss to handle data imbalance during learning regression networks. Specifically, we use a Sigmoid-like function to penalize the importance of easy samples coming from the background (e.g., samples close to the boundary). This not only improves tracking accuracy but also accelerates network convergence. The proposed shrinkage loss differs from the recently proposed focal loss [16] in that our method penalizes the importance of easy samples only, whereas focal loss partially decreases the loss from valuable hard samples (see Sect. 3.2).

We observe that deep regression networks can be further improved by best exploiting multi-level semantic abstraction across multiple convolutional layers. For instance, the FCNT [6] fuses two regression networks independently learned on the conv4-3 and con5-3 layers of VGG-16 [17] to improve tracking accuracy. However, independently learning regression networks on multiple convolutional layers cannot make full use of multi-level semantics across convolutional layers. In this work, we propose to apply residual connections to respectively fuse multiple convolutional layers as well as their output response maps. All the connections are fully differentiable, allowing our regression network to be trained end-to-end. For fair comparison, we evaluate the proposed deep regression tracker using the standard benchmark setting, where only the ground-truth in the first frame is available for training. The proposed algorithm performs well against state-of-the-art methods especially in comparison with DCFs trackers. Figure 1 shows such examples on two challenging sequences.

The main contributions of this work are summarized below:

  • We propose the novel shrinkage loss to handle the data imbalance issue in learning deep regression networks. The shrinkage loss helps accelerate network convergence as well.

  • We apply residual connections to respectively fuse multiple convolutional layers as well as their output response maps. Our scheme fully exploits multi-level semantic abstraction across multiple convolutional layers.

  • We extensively evaluate the proposed method on five benchmark datasets. Our method performs well against state-of-the-art trackers. We succeed in narrowing the gap between deep regression trackers and DCFs trackers.

2 Related Work

Visual tracking has been an active research topic with comprehensive surveys [18, 19]. In this section, we first discuss the representative tracking frameworks using the two-stage classification model and the one-stage regression model. We then briefly review the data imbalance issue in classification and regression learning.

Two-Stage Tracking. This framework mainly consists of two stages to perform tracking. The first stage generates a set of candidate target samples around the previously estimated location using random sampling, regularly dense sampling [20], or region proposal [21, 22]. The second stage classifies each candidate sample as the target object or as the background. Numerous efforts have been made to learn a discriminative boundary between positive and negative samples. Examples include the multiple instance learning (MIL) [23] and Struck [24, 25] methods. Recent deep trackers, such as MDNet [26], DeepTrack [27] and CNN-SVM [28], all belong to the two-stage classification framework. Despite the favorable performance on the challenging object tracking benchmarks [9, 10], we note that two-stage deep trackers suffer from heavy computational load as they directly feed samples in the image level into classification neural networks. Different from object detection, visual tracking put more emphasis on slight displacement between samples for precise localization. Two-stage deep trackers benefit little from the recent advance of ROI pooling [29], which cannot highlight the difference between highly spatially correlated samples.

One-Stage Tracking. The one-stage tracking framework takes the whole search area as input and directly outputs a response map through a learned regressor, which learns a mapping between input features and soft labels generated by a Gaussian function. One representative category of one-stage trackers are based on discriminative correlation filters [30], which regress all the circularly shifted versions of input image into soft labels. By computing the correlation as an element-wise product in the Fourier domain, DCFs trackers achieve the fastest speed thus far. Numerous extensions include KCF [31], LCT [32, 33], MCF [34], MCPF [35] and BACF [14]. With the use of deep features, DCFs trackers, such as DeepSRDCF [1], HDT [2], HCFT [3], C-COT [4] and ECO [5], have shown superior performance on benchmark datasets. In [3], Ma et al. propose to learn multiple DCFs over different convolutional layers and empirically fuse output correlation maps to locate target objects. A similar idea is exploited in [4] to combine multiple response maps. In [5], Danelljan et al. reduce feature channels to accelerate learning correlation filters. Despite the top performance, DCFs trackers independently extract deep features to learn and update correlation filters. In the deep learning era, DCFs trackers can hardly benefit from end-to-end training. The other representative category of one-stage trackers are based on convolutional regression networks. The recent FCNT [6], STCT [7], and CREST [8] trackers belong to this category. The FCNT makes the first effort to learn regression networks over two CNN layers. The output response maps from different layers are switched according to their confidence to locate target objects. Ensemble learning is exploited in the STCT to select CNN feature channels. CREST [8] learns a base network as well as a residual network on a single convolutional layer. The output maps of the base and residual networks are fused to infer target positions. We note that current deep regression trackers do not perform as well as DCFs trackers. We identify the main bottleneck as the data imbalance issue in regression learning. By balancing the importance of training data, the performance of one-stage deep regression trackers can be significantly improved over state-of-the-art DCFs trackers.

Data Imbalance. The data imbalance issue has been extensively studied in the learning community [11, 36, 37]. Helpful solutions involve data re-sampling [38,39,40], and cost-sensitive loss [16, 41,42,43]. For visual tracking, Li et al. [44] use a temporal sampling scheme to balance positive and negative samples to facilitate CNN training. Bertinetto et al. [45] balance the loss of positive and negative examples in the score map for pre-training the Siamese fully convolution network. The MDNet [26] tracker shows that it is crucial to mine the hard negative samples during training classification networks. The recent work [16] on dense object detection proposes focal loss to decrease the loss from imbalance samples. Despite the importance, current deep regression trackers [6,7,8] pay little attention to data imbalance. In this work, we propose to utilize shrinkage loss to penalize easy samples which have little contribution to learning regression networks. The proposed shrinkage loss significantly differs from focal loss [16] in that we penalize the loss only from easy samples while keeping the loss of hard samples unchanged, whereas focal loss partially decreases the loss of hard samples as well.

Fig. 2.
figure 2

Overview of the proposed deep regression network for tracking. Left: Fixed feature extractor (VGG-16). Right: Regression network trained in the first frame and updated frame-by-frame. We apply residual connections to both convolution layers and output response maps. The proposed network effectively exploits multi-level semantic abstraction across convolutional layers. With the use of shrinkage loss, our network breaks the bottleneck of data imbalance in regression learning and converges fast.

3 Proposed Algorithm

We develop our tracker within the one-stage regression framework. Figure 2 shows an overview of the proposed regression network. To facilitate regression learning, we propose a novel shrinkage loss to handle data imbalance. We further apply residual connections to respectively fuse convolutional layers and their output response maps for fully exploiting multi-level semantics across convolutional layers. In the following, we first revisit learning deep regression networks briefly. We then present the proposed shrinkage loss in detail. Last, we discuss the residual connection scheme.

3.1 Convolutional Regression

Convolutional regression networks regress a dense sampling of inputs to soft labels which are usually generated by a Gaussian function. Here, we formulate the regression network as one convolutional layer. Formally, learning the weights of the regression network is to solve the following minimization problem:

$$\begin{aligned} \arg \min _{\mathbf {W}}\Vert \mathbf {W}*\mathbf {X}-\mathbf {Y}\Vert ^2+\lambda \Vert \mathbf {W}\Vert ^2, \end{aligned}$$
(1)

where \(*\) denotes the convolution operation and \(\mathbf {W}\) denotes the kernel weight of the convolutional layer. Note that there is no bias term in Eq. (1) as we set the bias parameters to 0. \(\mathbf {X}\) means the input features. \(\mathbf {Y}\) is the matrix of soft labels, and each label \(y\in \mathbf {Y}\) ranges from 0 to 1. \(\lambda \) is the regularization term. We estimate the target translation by searching for the location of the maximum value of the output response map. The size of the convolution kernel \(\mathbf {W}\) is either fixed (e.g., \(5\times 5\)) or proportional to the size of the input features \(\mathbf {X}\). Let \(\eta \) be the learning rate. We iteratively optimize \(\mathbf {W}\) by minimizing the square loss:

$$\begin{aligned} \begin{aligned} L(\mathbf {W})&= \Vert \mathbf {W}*\mathbf {X}-\mathbf {Y}\Vert ^2+\lambda \Vert \mathbf {W}\Vert ^2 \\ \mathbf {W}_t&= \mathbf {W}_{t-1} - \eta \frac{\partial {L}}{\partial {\mathbf {W}}}, \end{aligned} \end{aligned}$$
(2)
Fig. 3.
figure 3

(a) Input patch. (b) The corresponding soft labels \(\mathbf {Y}\) generated by Gaussian function for training. (c) The output regression map \(\mathbf {P}\). (d) The histogram of the absolute difference \(|\mathbf {P}-\mathbf {Y}|\). Note that easy samples with small absolute difference scores dominate the training data.

3.2 Shrinkage Loss

For learning convolutional regression networks, the input search area has to contain a large body of background surrounding target objects (Fig. 3(a)). As the surrounding background contains valuable context information, a large area of the background helps strengthen the discriminative power of target objects from the background. However, this increases the number of easy samples from the background as well. These easy samples produce a large loss in total to make the learning process unaware of the valuable samples close to targets. Formally, we denote the response map in every iteration by \(\mathbf {P}\), which is a matrix of size \(m\times n\). \(p_{i,j} \in \mathbf {P}\) indicates the probability of the position \(i\in [1,m],j\in [1,n]\) to be the target object. Let l be the absolute difference between the estimated possibility p and its corresponding soft label y, i.e., \(l=|p-y|\). Note that, when the absolute difference l is larger, the sample at the location (ij) is more likely to be the hard sample and vice versa. Figure 3(d) shows the histogram of the absolute differences. Note that easy samples with small absolute difference scores dominate the training data.

In terms of the absolute difference l, the square loss in regression learning can be formulated as:

$$\begin{aligned} L_2= |p-y|^2 = l^2. \end{aligned}$$
(3)

The recent work [16] on dense object detection shows that adding a modulating factor to the entropy loss helps alleviate the data imbalance issue. The modulating factor is a function of the output possibility with the goal to decrease the loss from easy samples. In regression learning, this amounts to re-weighting the square loss using an exponential form of the absolute difference term l as follows:

$$\begin{aligned} L_{F}=l^\gamma \cdot L_2 = l^{2+\gamma }. \end{aligned}$$
(4)

For simplicity, we set the parameter \(\gamma \) to 1 as we observe that the performance is not sensitive to this parameter. Hence, the focal loss for regression learning is equal to the \(L_3\) loss, i.e., \(L_{F}=l^3\). Note that, as a weight, the absolute difference l, \(l\in [0,1]\), not only penalizes an easy sample (i.e., \(l<0.5\)) but also penalizes a hard sample (i.e., \(l>0.5\)). By revisiting the shrinkage estimator [15] and the cost-sensitive weighting strategy [37] in learning regression networks, instead of using the absolute difference l as weight, we propose a modulating factor with respect to l to re-weight the square loss to penalize easy samples only. The modulating function is with the shape of a Sigmoid-like function as:

$$\begin{aligned} f(l) = \frac{1}{ 1 + \exp \left( a\cdot (c-l) \right) }, \end{aligned}$$
(5)

where a and c are hyper-parameters controlling the shrinkage speed and the localization respectively. Figure 4(a) shows the shapes of the modulating function with different hyper-parameters. When applying the modulating factor to weight the square loss, we have the proposed shrinkage loss as:

$$\begin{aligned} L_{S} = \frac{l^2}{ 1 + \exp \left( a\cdot (c-l) \right) }. \end{aligned}$$
(6)

As shown in Fig. 4(b), the proposed shrinkage loss only penalizes the importance of easy samples (when \(l<0.5\)) and keeps the loss of hard samples unchanged (when \(l>0.5\)) when compared to the square loss (\(L_2\)). The focal loss (\(L_3\)) penalizes both the easy and hard samples.

Fig. 4.
figure 4

(a) Modulating factors in (5) with different hyper-parameters. (b) Comparison between the square loss (\(L_2\)), focal loss (\(L_3\)) and the proposed shrinkage loss for regression learning. The proposed shrinkage loss only decreases the loss from easy samples (\(l<0.5\)) and keeps the loss from hard samples (\(l>0.5\)) unchanged.

When applying the shrinkage loss to Eq. (1), we take the cost-sensitive weighting strategy [37] and utilize the values of soft labels as an importance factor, e.g., \(\exp (\mathbf {Y})\), to highlight the valuable rare samples. In summary, we rewrite Eq. (1) with the shrinkage loss for learning regression networks as:

$$\begin{aligned} \begin{aligned} L_S(\mathbf {W}) = \frac{{\exp (\mathbf {Y})}\cdot \Vert \mathbf {W}*\mathbf {X}-\mathbf {Y}\Vert ^2}{1+\exp (a\cdot (c-(\mathbf {W}*\mathbf {X}-\mathbf {Y})))} + \lambda \Vert \mathbf {W}\Vert ^2. \end{aligned} \end{aligned}$$
(7)

We set the value of a to be 10 to shrink the weight function quickly and the value of c to be 0.2 to suit for the distribution of l, which ranges from 0 to 1. Extensive comparison with the other losses shows that the proposed shrinkage loss not only improves the tracking accuracy but also accelerates the training speed (see Sect. 5.3) (Fig. 11).

3.3 Convolutional Layer Connection

It has been known that CNN models consist of multiple convolutional layers emphasizing different levels of semantic abstraction. For visual tracking, early layers with fine-grained spatial details are helpful in precisely locating target objects; while the later layers maintain semantic abstraction that are robust to significant appearance changes. To exploit both merits, existing deep trackers [3, 5, 6] develop independent models over multiple convolutional layers and integrate the corresponding output response maps with empirical weights. For learning regression networks, we observe that semantic abstraction plays a more important role than spatial detail in dealing with appearance changes. The FCNT exploit both the conv4 and conv5 layers and the CREST [8] merely uses the conv4 layer. Our studies in Sect. 5.3 also suggest that regression trackers perform well when using the conv4 and conv5 layers as the feature backbone. For integrating the response maps generated over convolutional layers, we use a residual connection block to make full use of multiple-level semantic abstraction of target objects. In Fig. 3, we compare our scheme with the ECO [5] and CREST [8] methods. The DCFs tracker ECO [5] independently learns correlation filters over the conv1 and conv5 layers. The CREST [8] learns a base and a residual regression network over the conv4 layer. The proposed method in Fig. 3(c) fuses the conv4 and conv5 layers before learning the regression networks. Here we use the deconvolution operation to upsample the conv5 layer before connection. We reduce feature channels to ease the computational load as in [46, 47]. Our connection scheme resembles the Option C of constructing the residual network [46]. Ablation studies affirm the effectiveness of this scheme to facilitate regression learning (see Sect. 5.3).

Fig. 5.
figure 5

Different schemes to fuse convolutional layers. ECO [5] independently learns correlation filters over multiple convolutional layers. CREST [8] learns a base and a residual regression network over a single convolutional layer. We first fuse multiple convolutional layers using residual connection and then perform regression learning. Our regression network makes full use of multi-level semantics across multiple convolutional layers rather than merely integrating response maps as ECO and CREST.

4 Tracking Framework

We detail the pipeline of the proposed regression tracker. In Fig. 2, we show an overview of the proposed deep regression network, which consists of model initialization, target object localization, scale estimation and model update. For training, we crop a patch centered at the estimated location in the previous frame. We use the VGG-16 [17] model as the backbone feature extractor. Specifically, we take the output response of the \(conv4\_3\) and \(conv5\_3\) layers as features to represent each patch. The fused features via residual connection are fed into the proposed regression network. During tracking, given a new frame, we crop a search patch centered at the estimated position in the last frame. The regression networks take this search patch as input and output a response map, where the location of the maximum value indicates the position of target objects. Once obtaining the estimated position, we carry out scale estimation using the scale pyramid strategy as in [48]. To make the model adaptive to appearance variations, we incrementally update our regression network frame-by-frame. To alleviate noisy updates, the tracked results and soft labels in the last T frames are used for the model update.

5 Experiments

In this section, we first introduce the implementation details. Then, we evaluate the proposed method on five benchmark datasets including OTB-2013 [49], OTB-2015 [9], Temple128 [50], UAV123 [51] and VOT-2016 [10] in comparison with state-of-the-art trackers. Last, we present extensive ablation studies on different types of losses as well as their effect on the convergence speed.

5.1 Implementation Details

We implement the proposed Deep Shrinkage Loss Tracker (DSLT) in Matlab using the Caffe toolbox [52]. All experiments are performed on a PC with an Intel i7 4.0 GHz CPU and an NVIDIA TITAN X GPU. We use VGG-16 as the backbone feature extractor. We apply a \(1 \times 1\) convolution layer to reduce the channels of \(conv4\_3\) and \(conv5\_3\) from 512 to 128. We train the regression networks with the Adam [53] algorithm. Considering the large gap between maximum values of the output regression maps over different layers, we set the learning rate \(\eta \) to 8e-7 in \(conv5\_3\) and 2e-8 in \(conv4\_3\). During online update, we decrease the learning rates to 2e-7 and 5e-9, respectively. The length of frames T for model update is set to 7. The soft labels are generated by a two-dimensional Gaussian function with a kernel width proportional (0.1) to the target size. For scale estimation, we set the ratio of scale changes to 1.03 and the levels of scale pyramid to 3. The average tracking speed including all training process is 5.7 frames per second. The source code is available at https://github.com/chaoma99/DSLT.

Fig. 6.
figure 6

Overall performance on the OTB-2013 [49] and OTB-2015 [9] datasets using one-pass evaluation (OPE). Our tracker performs well against state-of-the-art methods.

5.2 Overall Performance

We extensively evaluate our approach on five challenging tracking benchmarks. We follow the protocol of the benchmarks for fair comparison with state-of-the-art trackers. For the OTB [9, 49] and Temple128 [50] datasets, we report the results of one-pass evaluation (OPE) with distance precision (DP) and overlap success (OS) plots. The legend of distance precision plots contains the thresholded scores at 20 pixels, while the legend of overlap success plots contains area-under-the-curve (AUC) scores for each tracker. See the complete results on all benchmark datasets in the supplementary document.

OTB Dataset. There are two versions of this dataset. The OTB-2013 [49] dataset contains 50 challenging sequences and the OTB-2015 [9] dataset extends the OTB-2013 dataset with additional 50 video sequences. All the sequences cover a wide range of challenges including occlusion, illumination variation, rotation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutter and low resolution. We fairly compare the proposed DSLT with state-of-the-art trackers, which mainly fall into three categories: (i) one-stage regression trackers including CREST [8], FCNT [6], GOTURN [54], SiameseFC [45]; (ii) one-stage DCFs trackers including ECO [5], C-COT [4], BACF [14], DeepSRDCF [1], HCFT [3], HDT [2], SRDCF [12], KCF [31], and MUSTer [55]; and (iii) two-stage trackers including MEEM [56], TGPR [57], SINT [58], and CNN-SVM [28]. As shown in Fig. 6, the proposed DSLT achieves the best distance precision (93.4%) and the second best overlap success (68.3%) on OTB-2013. Our DSLT outperforms the state-of-the-art deep regression trackers (CREST [8] and FCNT [6]) by a large margin. We attribute the favorable performance of our DSLT to two reasons. First, the proposed shrinkage loss effectively alleviate the data imbalance issue in regression learning. As a result, the proposed DSLT can automatically mine the most discriminative samples and eliminate the distraction caused by easy samples. Second, we exploit the residual connection scheme to fuse multiple convolutional layers to further facilitate regression learning as multi-level semantics across convolutional layers are fully exploited. As well, our DSLT performs favorably against all DCFs trackers such as C-COT, HCFT and DeepSRDCF. Note that ECO achieves the best results by exploring both deep features and hand-crafted features. On OTB-2015, our DSLT ranks second in both distance precision and overlap success.

Fig. 7.
figure 7

Overall performance on the Temple Color 128 [50] dataset using one-pass evaluation. Our method ranks first in distance precision and second in overlap success.

Fig. 8.
figure 8

Overall performance on the UAV-123 [51] dataset using one-pass evaluation (OPE). The proposed DSLT method ranks first.

Temple Color 128 Dataset. This dataset [50] consists of 128 colorful video sequences. The evaluation setting of Temple 128 is same to the OTB dataset. In addition to the aforementioned baseline methods, we fairly compare with all the trackers including Struck [24], Frag [59], KCF [31], MEEM [56], MIL [23] and CN2 [47] evaluated by the authors of Temple 128. Figure 7 shows that the proposed method achieves the best distance precision by a large margin compared to the ECO, C-COT and CREST trackers. Our method ranks second in terms of overlap success. It is worth mentioning that our regression tracker performs well in tracking small targets. Temple-128 contains a large number of small target objects. Our method achieves the best precision of 80.73%, far better than the state-of-the-art.

UAV123 Dataset. This dataset [51] contains 123 video sequences obtained by unmanned aerial vehicles (UAVs). We evaluate the proposed DSLT with several representative methods including ECO [5], SRDCF [12], KCF [31], MUSTer [55], MEEM [56], TGPR [57], SAMF [60], DSST [58], CSK [61], Struck [24], and TLD [62]. Figure 8 shows that the performance of the proposed DSLT is slightly superior to ECO in terms of distance precision and overlap success rate.

Table 1. Overall performance on VOT-2016 in comparison to the top 7 trackers. EAO: Expected average overlap. AR: Accuracy rank. RR: Robustness rank.

VOT-2016 Dataset. The VOT-2016 [10] dataset contains 60 challenging videos, which are annotated by the following attributes: occlusion, illumination change, motion change, size change, and camera motion. The overall performance is measured by the expected average overlap (EAO), accuracy rank (AR) and robustness rank (RR). The main criteria, EAO, takes into account both the per-frame accuracy and the number of failures. We compare our method with state-of-the-art trackers including ECO [5], C-COT [4], CREST [8], Staple [63], SRDCF [12], DeepSRDCF [1], MDNet [26]. Table 1 shows that our method performs slightly worse than the top performing ECO tracker but significantly better than the others such as the recent C-COT and CREST trackers. The VOT-2016 report [10] suggests a strict state-of-the-art bound as 0.251 with the EAO metric. The proposed DSLT achieves a much higher EAO of 0.3321.

5.3 Ablation Studies

We first analyze the contributions of the loss function and the effectiveness of the residual connection scheme. We then discuss the convergence speed of different losses in regression learning.

Loss Function Analysis. First, we replace the proposed shrinkage loss with square loss (\(L_2\)) or focal loss (\(L_3\)). We evaluate the alternative implementations on the OTB-2015 [9] dataset. Overall, the proposed DSLT with shrinkage loss significantly advances the square loss (\(L_2\)) and focal loss (\(L_3\)) by a large margin. We present the qualitative results on two sequences in Fig. 9 where the trackers with \(L_2\) loss or \(L_3\) loss both fail to track the targets undergoing large appearance changes, whereas the proposed DSLT can locate the targets robustly. Figure 10 presents the quantitative results on the OTB-2015 dataset. Note that the baseline tracker with \(L_2\) loss performs much better than CREST [8] in both distance precision (87.0% vs. 83.8%) and overlap success (64.2% vs. 63.2%). This clearly proves the effectiveness of the convolutional layer connection scheme, which applies residual connection to both convolutional layers and output regression maps rather than only to the output regression maps as CREST does. In addition, we implement an alternative approach using online hard negative mining (OHNM) [26] to completely exclude the loss from easy samples. We empirically set the mining threshold to 0.01. Our DSLT outperforms the OHNM method significantly. Our observation is thus well aligned to [16] that easy samples still contribute to regression learning but they should not dominate the whole gradient. In addition, the OHNM method manually sets a threshold, which is hardly applicable to all videos.

Fig. 9.
figure 9

Quantitative results on the Biker and Skating1 sequences. The proposed DSLT with shrinkage loss can locate the targets more robustly than \(L_2\) loss and \(L_3\) loss.

Fig. 10.
figure 10

Ablation studies with different losses and different layer connections on the OTB-2015 [9] dataset.

Fig. 11.
figure 11

Training loss plot (left) and average training iterations per sequence on the OTB-2015 dataset (right). The shrinkage loss converges the fastest and requires the least number of iterations to converge.

Feature Analysis. We further evaluate the effectiveness of convolutional layers. We first remove the connections between convolutional layers. The resulted DSLT_m algorithm resembles the CREST. Figure 10 shows that DSLT_m has performance drops of around 0.3% (DP) and 0.1% (OS) when compared to the DSLT. This affirms the importance of fusing features before regression learning. In addition, we fuse \(conv3\_3\) with \(conv4\_3\) or \(conv5\_3\). The inferior performance of DSLT_34 and DSLT_35 shows that semantic abstraction is more important than spatial detail for learning regression networks. As the kernel size of the convolutional regression layer is proportional to the input feature size, we do not evaluate earlier layers for computational efficiency.

Convergence Speed. Figure 11 compares the convergence speed and the required training iterations using different losses on the OTB-2015 dataset [9]. Overall, the training loss using the shrinkage loss descends quickly and stably. The shrinkage loss thus requires the least iterations to converge during tracking.

6 Conclusion

We revisit one-stage trackers based on deep regression networks and identify the bottleneck that impedes one-stage regression trackers from achieving state-of-the-art results, especially when compared to DCFs trackers. The main bottleneck lies in the data imbalance in learning regression networks. We propose the novel shrinkage loss to facilitate learning regression networks with better accuracy and faster convergence speed. To further improve regression learning, we exploit multi-level semantic abstraction of target objects across multiple convolutional layers as features. We apply the residual connections to both convolutional layers and their output response maps. Our network is fully differentiable and can be trained end-to-end. We succeed in narrowing the performance gap between one-stage deep regression trackers and DCFs trackers. Extensive experiments on five benchmark datasets demonstrate the effectiveness and efficiency of the proposed tracker when compared to state-of-the-art algorithms.