Keywords

1 Introduction

Visual object tracking, which locates a specified target in a changing video sequence automatically, is a fundamental problem in many computer vision topics such as visual analysis, automatic driving and pose estimation. A core problem of tracking is how to detect and locate the object accurately and efficiently in challenging scenarios with occlusions, out-of-view, deformation, background cluttering and other variations [38].

Recently, Siamese networks, which follow a tracking by similarity comparison strategy, have drawn great attention in visual tracking community because of favorable performance [2, 7, 8, 16, 31, 33, 36, 37]. SINT [31], GOTURN [8], SiamFC [2] and RASNet [36] learn a priori deep Siamese similarity function and use it in a run-time fixed way. CFNet [33] and DSiam [7] can online update the tracking model via a running average template and a fast transformation learning module, respectively. SiamRPN [16] introduces a region proposal network after the Siamese network, thus formulating the tracking as a one-shot local detection task.

Although these tracking approaches obtain balanced accuracy and speed, there are 3 problems that should be addressed: firstly, features used in most Siamese tracking approaches can only discriminate foreground from the non-semantic background. The semantic backgrounds are always considered as distractors, and the performance can not be guaranteed when the backgrounds are cluttered. Secondly, most Siamese trackers can not update the model [2, 8, 16, 31, 36]. Although their simplicity and fixed-model nature lead to high speed, these methods lose the ability to update the appearance model online which is often critical to account for drastic appearance changes in tracking scenarios. Thirdly, recent Siamese trackers employ a local search strategy, which can not handle the full occlusion and out-of-view challenges.

In this paper, we explore to learn Distractor-aware Siamese Region Proposal Networks (DaSiamRPN) for accurate and long-term tracking. SiamFC uses a weighted loss function to eliminate class imbalance of the positive and negative examples. However, it is inefficient as the training procedure is still dominated by easily classified background examples. In this paper, we identify that the imbalance of the non-semantic background and semantic distractor in the training data is the main obstacle for the representation learning. As shown in Fig. 1, the response maps on the SiamFC can not distinguish the people, even the athlete in the white dress can get a high similarity with the target person. High quality training data is crucial for the success of end-to-end learning tracker. We conclude that the quality of the representation network heavily depends on the distribution of training data. In addition to introducing positive pairs from existing large-scale detection datasets, we explicitly generate diverse semantic negative pairs in the training process. To further encourage discrimination, an effective data augmentation strategy customizing for visual tracking are developed.

After the offline training, the representation networks can generalize well to most categories of objects, which makes it possible to track general targets. During inference, classic Siamese trackers only use nearest neighbour search to match the positive templates, which might perform poorly when the target undergoes significant appearance changes and background clutters. Particularly, the presence of similar looking objects (distractors) in the context makes the tracking task more arduous. To address this problem, the surrounding contextual and temporal information can provide additional cues about the targets and help to maximize the discrimination abilities. In this paper, a novel distractor-aware module is designed, which can effectively transfer the general embedding to the current video domain and incrementally catch the target appearance variations during inference.

Besides, most recent trackers are tailored to short-term scenario, where the target object is always present. These works have focused exclusively on short sequences of a few tens of seconds, which is poorly representative of practitioners’ needs. Except the challenging situations in short-term tracking, severe out-of-view and full occlusion introduce extra challenges in long-term tracking. Since conventional Siamese trackers lack discriminative features and adopt local search region, they are unable to handle these challenges. Benefiting from the learned distractor-aware features in DaSiamRPN, we extend the proposed approach for long-term tracking by introducing a simple yet effective local-to-global search region strategy. This significantly improves the performance of our tracker in out-of-view and full occlusion challenges.

We validate the effectiveness of proposed DaSiamRPN framework on extensive short-term and long-term tracking benchmarks: VOT2016 [14], VOT2017 [12], OTB2015 [38], UAV20L and UAV123 [22]. On short-term VOT2016 dataset, DaSiamRPN achieves a 9.6% relative gain in Expected Average Overlap compared to the top ranked method ECO [3]. On long-term UAV20L dataset, DaSiamRPN obtains 61.7% in Area Under Curve which outperforms the current best-performing tracker by relative 35.9%. Besides the favorable performance, our tracker can perform at far beyond real-time speed: 160 FPS on short-term datasets and 110 FPS on long-term datasets. All these consistent improvements demonstrate that the proposed approach establish a new state-of-the-art in visual tracking.

1.1 Contributions

The contributions of this paper can be summarized in three folds as follows:

1, The features used in conventional Siamese trackers are analyzed in detail. And we find that the imbalance of the non-semantic background and semantic distractor in the training data is the main obstacle for the learning.

2, We propose a novel Distractor-aware Siamese Region Proposal Networks (DaSiamRPN) framework to learn distractor-aware features in the off-line training, and explicitly suppress distractors during the inference of online tracking.

3, We extend the DaSiamRPN to perform long-term tracking by introducing a simple yet effective local-to-global search region strategy, which significantly improves the performance of our tracker in out-of-view and full occlusion challenges. In comprehensive experiments of short-term and long-term visual tracking benchmarks, the proposed DaSiamRPN framework obtains state-of-the-art accuracy while performing at far beyond real-time speed.

2 Related Work

Siamese Networks Based Tracking.

Siamese trackers follow a tracking by similarity comparison strategy. The pioneering work is SINT [31], which simply searches for the candidate most similar to the exemplar given in the starting frame, using a run-time fixed but learns a priori deep Siamese similarity function. As a follow-up work, Bertinetto et.al [2] propose a fully convolutional Siamese network (SiamFC) to estimate the feature similarity region-wise between two frames. RASNet [36] advances this similarity metric by learning the attention mechanism with a Residual Attentional Network. Different from SiamFC and RASNet, in GOTURN tracker [8], the motion between successive frames is predicted using a deep regression network. These threee trackers are able to perform at 86 FPS, 83FPS and 100 FPS respectively on GPU because no fine-tuning is performed online. CFNet [33] interprets the correlation filters as a differentiable layer in a Siamese tracking framework, thus achieving an end-to-end representation learning. But the performance improvement is limited compared with SiamFC. FlowTrack [40] exploits motion information in Siamese architecture to improve the feature representation and the tracking accuracy. It is worth noting that CFNet and FlowTrack can efficiently online update the tracking model. Recently, SiamRPN [16] formulates the tracking as a one-shot local detection task by introducing a region proposal network after a Siamese network, which is end-to-end trained off-line with large-scale image pairs.

Features for Tracking.

Visual features play a significant role in computer vision tasks including visual tracking. Possegger et.al [26] propose a distractor-aware model term to suppress visually distracting regions, while the color histograms features used in their framework are less robust than the deep features. DLT [35] is the seminal deep learning tracker which uses a multi-layer autoencoder network. The feature is pretrained on part of the 80M Tiny Image dataset [32] in an unsupervised fashion. Wang et al. [34] learn a two-layer neural network on a video repository, where temporally slowness constraints are imposed for feature learning. DeepTrack [17] learns two-layer CNN classifiers from binary samples and does not require a pre-training procedure. UCT [39] formulates the features learning and tracking process into a unified framework, enabling learned features are tightly coupled to tracking process.

Long-Term Tracking.

Traditional long-term tracking frameworks can be divided into two groups: earlier methods regard tracking as local key point descriptors matching with a geometrical model [21, 24, 25], and recent approaches perform long-term tracking by combining a short-term tracker with a detector. The seminal work of latter categories is TLD [10], which proposes a memory-less flock of flows as a short-term tracker and a template-based detector run in parallel. Ma et al.  [20]propose a combination of KCF tracker and a random ferns classifier as a detector that is used to correct the tracker. Similarly, MUSTer [9] is a long-term tracking framework that combines KCF tracker with a SIFT-based detector that is also used to detect occlusions. Fan and Ling [6] combines a DSST tracker [4] with a CNN detector [31] that verifies and potentially corrects proposals of the short-term tracker.

Fig. 1.
figure 1

Visualization of the response heatmaps of Siamese network trackers. (a) shows the search images. (b–e) show the heatmaps that produced by SiamFC, SiamRPN, SiamRPN+ (trained with distractors) and the DaSiamRPN.

3 Distractor-Aware Siamese Networks

3.1 Features and Drawbacks in Traditional Siamese Networks

Before the detailed discussion of our proposed framework, we first revisit the features of conventional Siamese network based tracking [2, 16]. Siamese trackers use metric learning at their core. The goal is to learn an embedding space that can maximize the interclass inertia between different objects and minimize the intraclass inertia for the same object. The key contribution leading to the popularity and success of Siamese trackers is their balanced accuracy and speed.

Figure 1 visualizes of response maps of SiamFC and SiamRPN. It can be seen that for the targets, those with large differences in the background also achieve high scores, and even some extraneous objects get high scores. The representations obtained in SiamFC usually serve the discriminative learning of the categories in training data. In SiamFC and SiamRPN, pairs of training data come from different frames of the same video, and for each search area, the non-semantic background occupies the majority, while semantic entities and distractor occupy less. This imbalanced distribution makes the training model hard to learn instance-level representation, but tending to learn the differences between foreground and background.

During inference, nearest neighbor is used to search the most similar object in the search region, while the background information labelled in the first frame are omitted. The background information in the tracking sequences can be effectively utilized to increase the discriminative capability as shown in Fig. 1e.

To eliminate these issues, we propose to actively generate more semantics pairs in the offline training process and explicitly suppress the distractors in the online tracking.

3.2 Distractor-Aware Training

High quality training data is crucial for the success of end-to-end representation learning in visual tracking. We introduce series of strategies to improve the generalization of the learned features and eliminate the imbalanced distribution of the training data.

Fig. 2.
figure 2

(a) Positive pairs generated from detection datasets through augmenting still images. (b) negative pairs from the same category. (c) negative pairs from different categories.

Diverse Categories of Positive Pairs Can Promote the Generalization Ability.

The original SiamFC is trained on the ILSVRC video detection datasets, which consists of only about 4,000 videos annotated frame-by-frame [28]. Recently, SiamRPN [16] explores to use sparsely labelled Youtube-BB [27] videos which consists of more than 200,000 videos annotated once in every 30 frames. In these two methods, target pairs of training data come from different frames in the same video. However, these video detection datasets only contain few categories (20 for VID [28], 30 for Youtube-BB [27]), which is not sufficient to train high-quality and generalized features for Siamese tracking. Besides, the bounding box regression branch in the SiamRPN may get inferior predictions when encountering new categories. Since labelling videos is time-consuming and expensive, in this paper, we greatly expand the categories of positive pairs by introducing large-scale ImageNet Detection [28] and COCO Detection [18] datasets. As shown in Fig. 2(a), through augmentation techniques (translation, resize, grayscale et.al), still images from detection datasets can be used to generate image pairs for training. The diversity of positive pairs is able to improve the tracker’s discriminative ability and regression accuracy.

Semantic Negative Pairs Can Improve the Discriminative Ability.

We attribute the less discriminative representation in SiamFC [2] and SiamRPN [16] to two level of imbalanced training data distribution. The first imbalance is the rare semantic negative pairs. Since the background occupies the majority in the training data of SiamFC and SiamRPN, most negative samples are non-semantic (not real object, just background), and they can be easily classified. That is to say, SiamFC and SiamRPN learn the differences between foreground and background, and the losses between semantic objects are overwhelmed by the vast number of easy negatives. Another imbalance comes from the intraclass distractors, which usually perform as hard negative samples in the tracking process. In this paper, semantic negative pairs are added into the training process. The constructed negative pairs consist of labelled targets both in the same categories and different categories. The negative pairs from different categories can help tracker to avoid drifting to arbitrary objects in challenges such as out-of-view and full occlusion, while negative pairs from the same categories make the tracker focused on fine-grained representation. The negative examples are shown in Fig. 2(b) and (c).

Customizing Effective Data Augmentation for Visual Tracking.

To unleash the full potential of the Siamese network, we customize several data augmentation strategies for training. Except the common translation, scale variations and illumination changes, we observe that the motion pattern can be easily modeled by the shallow layers in the network. We explicitly introduce motion blur in the data augmentation.

Fig. 3.
figure 3

Illustrations of our proposed Distractor-aware Siamese Region Proposal Networks (DaSiamRPN). The target and the background information are fully utilized in DaSiamRPN, which can suppress the influence of distractor during tracking.

3.3 Distractor-Aware Incremental Learning

The training strategy in the last subsection can significantly improve the discrimination power on the offline training process. However, it is still hard to distinguish two objects with the similar attributes like Fig. 3a. SiamFC and SiamRPN use a cosine window to suppress the distractors. In this way, the performance is not guaranteed when the motion of objects are messy. Most existing Siamese network based approaches provide inferior performance when encountering with fast motion or background clutter. In summary, the potential flaw is mainly due to the misalignment of the general representation domain and the specifical target domains. In this section, we propose a distractor-aware module to effectively transfer the general representation to the video domain.

The Siamese tracker learns a similarity metric f(zx) to compare an exemplar image z to a candidate image x in the embedding space \(\varphi \):

(1)

where \(\star \) denotes cross correlation between two feature maps, denotes a bias which is equated in every location. The most similar object of the exemplar will be selected as the target.

To make full use of the label information, we integrate the hard negative samples (distractors) in context of the target into the similarity metric. In DaSiamRPN, the Non Maximum Suppression (NMS) is adopted to select the potential distractors \(d_i\) in each frames, and then we collect a distractor set \({\mathcal {D}}:=\{\forall d_i\in {\mathcal {D}}, f(z,d_i) > h \cap d_i \ne z_t\}\), where h is the predefined threshold, \(z_t\) is the selected target in frame t and the number of this set \(|{\mathcal {D}}|=n\). Specifically, we get \(17*17*5\) proposals in each frame at first, and then we use NMS to reduce redundant candidates. The proposal with highest score will be selected as the target \(z_t\). For the remaining, the proposals with scores greater than a threshold are selected as distractors.

After that, we introduce a novel distractor-aware objective function to re-rank the proposals \({\mathcal {P}}\) which have top-k similarities with the exemplar. The final selected object is denoted as q:

$$\begin{aligned} q = \underset{p_k \in {\mathcal {P}}}{argmax} ~~f(z,p_k)- \frac{{\hat{\alpha }}\sum _{i=1}^{n} \alpha _i f(d_i,p_k)}{\sum _{i=1}^{n}\alpha _i } \end{aligned}$$
(2)

the weight factor \({\hat{\alpha }}\) control the influence of the distractor learning, the weight factor \(\alpha _i\) is used to control the influence for each distractor \(d_i\). It is worth noting that the computational complexity and memory usage increase n times by a direct calculation. Since cross correlation operation in the Eq. (1) is a linear operator, we utilize this property to speed up the distractor-aware objective:

$$\begin{aligned} q = \underset{p_k \in {\mathcal {P}}}{argmax} ~~ (\varphi (z) - \frac{{\hat{\alpha }}\sum _{i=1}^{n} \alpha _i \varphi (d_i)}{\sum _{i=1}^{n}\alpha _i} )\star \varphi (p_k) \end{aligned}$$
(3)

it enables the tracker run in the comparable speed in comparisons with SiamRPN. This associative law also inspires us to incrementally learn the target templates and distractor templates with a learning rate \(\beta _t\):

$$\begin{aligned} q_{T+1} = \underset{p_k \in {\mathcal {P}}}{argmax} ~~ (\frac{\sum _{t=1}^{T} \beta _t \varphi (z_t)}{\sum _{t=1}^{T} \beta _t} - \frac{\sum _{t=1}^{T}\beta _t{\hat{\alpha }}\sum _{i=1}^{n} \alpha _i \varphi (d_{i,t})}{\sum _{t=1}^{T} \beta _t\sum _{i=1}^{n}\alpha _i} )\star \varphi (p_k) \end{aligned}$$
(4)

This distractor-aware tracker can adapt the existing similarity metric (general) to a similarity metric for a new domain (specific). The weight factor \(\alpha _i\) can be viewed as the dual variables with sparse regularization, and the exemplars and distractors can be viewed as positive and negative samples in correlation filters. Actually, an online classifier is modeled in our framework. So the adopted classifier is expected to perform better than these only use general similarity metric.

Fig. 4.
figure 4

The tracking results of video person7 in out-of-view challenge. First row: tracking snapshots of SiamRPN and DaSiamRPN. Second row: detection scores and according overlaps of the two methods. The overlaps are defined as intersection-over-union (IOU) between tracking results and ground truth. (Color figure online)

3.4 DaSiamRPN for Long-Term Tracking

In this section, the DaSiamRPN framework is extended for long-term tracking. Besides the challenging situations in short-term tracking, severe out-of-view and full occlusion introduce extra challenges in long-term tracking, which are shown in Fig. 4. The search region in short-term tracking (SiamRPN) can not cover the target when it reappears, thus failing to track the following frames. We propose a simple yet effective switch method between short-term tracking phase and failure cases. In failure cases, an iterative local-to-global search strategy is designed to re-detect the target.

In order to perform switches, we need to identify the beginning and the end of failed tracking. Since the distractor-aware training and inference enable high-quality detection score, it can be adopted to indicate the quality of tracking results. Figure 4 shows the detection scores and according tracking overlaps in SiamRPN and DaSiamRPN. The detection scores of SiamRPN are not indicative, which can be still high even in out-of-view and full occlusion. That is to say, SiamRPN tends to find an arbitrary objectness in these challenges which causes drift in tracking. In DaSiamRPN, detection scores successfully indicate status of the tracking phase.

During failure cases, we gradually increase the search region by local-to-global strategy. Specifically, the size of search region is iteratively growing with a constant step when failed tracking is indicated. As shown in Fig. 4, the local-to-global search region covers the target to recover the normal tracking. It is worth noting that our tracker employs bounding box regression to detect the target, so the time-consuming image pyramids strategy can be discarded. In experiments, the proposed DaSiamRPN can perform at 110 FPS on long-term tracking benchmark.

4 Experiments

Experiments are performed on extensive challenging tracking datasets, including VOT2015 [13], VOT2016 [14] and VOT2017 [12], each with 60 videos, UAV20L [22] with 20 long-term videos, UAV123 [22] with 123 videos and OTB2015 [38] with 100 videos. All the tracking results are provided by official implementations to ensure a fair comparison.

Fig. 5.
figure 5

Expected average overlap plot for VOT2016 (a) and VOT2017 (b).

4.1 Experimental Details

The modified AlexNet [15] pretrained using ImageNet [28] is used as described in SiamRPN [16]. The parameters of the first three convolution layers are fixed and only the last two convolution layers are fine-tuned. There are totally 50 epoches performed and the learning rate is decreased in log space from \(10^{-2}\) to \(10^{-4}\). We extract image pairs from VID [28] and Youtube-BB [27] by choosing frames with interval less than 100 and performing crop procedure as described in Sect. 3.2. In ImageNet Detection [28] and COCO Detection [18] datasets, image pairs are generated for training by augmenting still images. To handle the gray videos in benchmarks, 25% of the pairs are converted to grayscale during training. The translation is randomly performed within 12 pixels, and the range of random resize varies from 0.85 to 1.15.

During inference phase, the distractor factor \({\hat{\alpha }}\) in Eq. (2) is set to 0.5, \(\alpha _i\) is set to 1 for each distractor, and the incremental learning factor \(\beta _t\) in Eq. (4) is set to \(\sum _{i=0}^{t-1}(\frac{\eta }{1-\eta })^i\), where \(\eta =0.01\). In the long-term tracking, we find that one step iteration of local-to-global is sufficient. Specifically, the sizes of the search region in short-term phase and defined failure cases are set to 255 and 767, respectively. The thresholds to enter and leave failure cases are set to 0.8 and 0.95. Our experiments are implemented using PyTorch on a PC with an Intel i7, 48G RAM, NVIDIA TITAN X. The proposed tracker can perform at 160 FPS on short-term benchmarks and 110 FPS on long-term benchmarks.

4.2 State-of-the-Art Comparisons on VOT Datasets

In this section the latest version of the Visual Object Tracking toolkit (vot2017-challenge) is used. The toolkit applies a reset-based methodology. Whenever a failure (zero overlap with the ground truth) is detected, the tracker is re-initialized five frames after the failure. The performance is measured in terms of accuracy (A), robustness (R), and expected average overlap (EAO). In addition, VOT2017 also introduces a real-time experiment. We report all these metrics compared with a number of the latest state-of-the-art trackers on VOT2015, VOT2016 and VOT2017.

The EAO curve evaluated on VOT2016 is presented in Fig. 5a and 70 other state-of-the-art trackers are compared. The EAO of our baseline tracker SiamRPN on VOT2016 is 0.3441, which already outperforms most of state-of-the-arts. However, there is still a gap compared with the top-ranked tracker ECO (0.375), which improves continuous convolution operators on multi-level feature maps. Most remarkably, the proposed DaSiamRPN obtains a EAO of 0.411, outperforming state-of-the-arts by relative 9.6%. Furthermore, our tracker runs at state-of-the-art speed with 160FPS, which is \(500\times \) faster than C-COT and \(20\times \) faster than ECO.

For the evaluation on VOT2017, Fig. 5b reports the results of ours against 51 other state-of-the-art trackers with respect to the EAO score. DaSiamRPN ranks first with an EAO score of 0.326. Among the top 5 trackers, CFWCR, CFCF, ECO, and Gnet apply continuous convolution operator as the baseline approach. The top performer LSART [30] decomposes the target into patches and applies a weighted combination of patch-wise similarities into a kernelized ridge regression. While our method is conceptually much simpler, powerful and is also easy to follow.

Figure 5b also reveals the EAO values in the real-time experiment denoted by red points. Our tracker obviously is the top-performer with a real-time EAO of 0.326 and outperforms the latest state-of-the-art real-time tracker CSRDCF++ by relative 53.8%.

Table 1 shows accuracy (A) and robustness (R), as well as expected average overlap (EAO) on VOT2015, VOT2016 and VOT2017. The baseline approach SiamRPN can process an astounding 200 frames per second while still getting an comparable performance with the state-of-the-arts. We find the performance gains of SiamRPN are mainly due to their accurate multi-anchors regression mechanism. We propose the distractor-aware module to improve the robustness, which can make our tracker much more harmonious. As a result, our approach, with the EAO of 0.446, 0.411 and 0.326 on three benchmarks, outperforms all the existing trackers by a large margin. We believe that the consistent improvements demonstrate that our approach makes real contributions by both the training process and online inference.

Table 1. Performance comparisons on public short-term benchmarks. OP: mean overlap precision at the threshold of 0.5; DP: mean distance precision of 20 pixels; EAO: expected average overlap, and mean speed (FPS). The fonts and fonts indicate the best and the second best performance.

4.3 State-of-the-Art Comparisons on UAV Datasets

The UAV [22] videos are captured from low-altitude unmanned aerial vehicles. The dataset contains a long-term evaluation subset UAV20L and a short-term evaluation subset UAV123. The evaluation is based on two metrics: precision plot and success plot.

Fig. 6.
figure 6

Success and precision plots on UAV [22] dataset. First and second sub-figures are results of UAV20L, third and last sub-figures are results of UAV123.

Results on UAV20L.

UAV20L is a long-term tracking benchmark that contains 20 sequences with average sequence length 2934 frames. Besides the challenging situations in short-term tracking, severe out-of-view and full occlusion introduce extra challenges. In this experiment, the proposed method is compared against recent trackers in [22]. Besides, ECO [3] (state-of-the-art short-term tracker), PTAV [6] (state-of-the-art long-term tracker), SiamRPN [16] (the baseline), SiamFC [2] and CFNet [33] (representative Siamese trackers) are added for comparison.

The results including success plots and precision plots are illustrated in Fig. 6. It clearly illustrates that our algorithm, denoted by DaSiamRPN, outperforms the state-of-the-art trackers significantly in both measures. In the success plot, our approach obtains an AUC score of 0.617, significantly outperforming state-of-the-art short-term trackers SiamRPN [16] and ECO [3]. The improvement ranges are relative 35.9% and 41.8%, respectively. Compared with PTAV [6], MUSTer [9] and TLD [10] which are qualified to perform long-term tracking, the proposed DaSiamRPN outperforms these trackers by relative 45.8%, 87.5% and 213.2%. In the precision plot, our approach obtains a score of 0.838, outperforming state-of-the-art long-term tracker (PTAV [6]) and short-term tracker (SiamRPN [16]) by relative 34.3% and 35.8%, respectively. The excellent performance of DaSiamRPN in this long-term tracking dataset can be attributed to the distractor-aware features and local-to-global search strategy.

For detailed performance analysis, we also report the results on various challenge attributes in UAV20L, i.e. full occlusion, out-of-view, background clutter and partial occlusion. Figure 7 demonstrates that our tracker effectively handles these challenging situations while other trackers obtain lower scores. Specially, in full occlusion and background clutter attributes, the proposed DaSiamRPN outperforms SiamRPN [16] by relative 153.1% and 393.2%.

Fig. 7.
figure 7

Success plots with attributes on UAV20L. Best viewed on color display.

Results on UAV123.

UAV123 dataset includes 123 sequences with average sequence length of 915 frames. Besides the recent trackers in [22], ECO [3], PTAV [6], SiamRPN [16], SiamFC [2], CFNet [33] are added for comparison. Figure 6 illustrates the precision and success plots of the compared trackers. The proposed DaSiamRPN approach outperforms all the other trackers in terms of success and precision scores. Specifically, our method achieves a success score of 0.586, which outperforms the SiamRPN (0.527) and ECO (0.525) method with a large margin.

4.4 State-of-the-Art Comparisons on OTB Datasets

We evaluate the proposed algorithms with numerous fast and state-of-the-art trackers including SiamFC [2], CFNet [33], Staple [1], CSRDCF [19], BACF [11], ECO-HC [3], CREST [29], MDNet [23], CCOT [5], ECO [3], and the baseline tracker SiamRPN [16]. All the trackers are initialized with the ground-truth object state in the first frame. Mean overlap precision (OP) and mean distance precision (DP) are reported in Table 1.

Among the real-time trackers, SiamFC and CFNet are latest Siamese network based trackers while the accuracies is still left far behind the state-of-the-art BACF and ECO-HC with HOG features. The proposed DaSiamRPN tracker outperforms all these trackers by a large margin on both the accuracy and speed.

For state-of-the-art comparisons on OTB, MDNet, trained on visual tracking datasets, performs the best against the other trackers at a speed of 1 FPS. C-COT and ECO achieve state-of-the-art performance, but their tracking speeds are not fast enough for real-time applications. The baseline tracker SiamRPN obtains an OP score of \(81.9\%\), which is slightly less accurate than CCOT. The bottleneck of SiamRPN is its inferior robust performance. Since the distractor-aware mechanisms in both training and inference focus on improving the robustness, the proposed DaSiamRPN tracker achieves \(3.0\%\) improvement on DP and performs best OP score of \(86.5\%\) on OTB2015.

4.5 Ablation Analyses

To verify the contributions of each component in our algorithm, we implement and evaluate four variations of our approach. Analyses results include EAO on VOT2016 [14] and AUC on UAV20L [22].

As shown in Table 2, SiamRPN is our baseline algorithm. In VOT2016, the EAO criterion increases to 0.368 from 0.344 when detection data is added in training. Similarly, when negative pairs and distractor-aware learning are adopted in training and inference, both the performance increases by near 2%. In UAV20L, detection data, negative pairs in training and distractor-aware inference gain the performance by 1%–2%. The AUC criterion increases to 61.7% from 49.8% when long-term tracking module is adopted.

Table 2. Ablation analyses of our algorithm on VOT2016 [14] and UAV20L [22]

5 Conclusions

In this paper, we propose a distractor-aware Siamese framework for accurate and long-term tracking. During offline training, a distractor-aware feature learning scheme is proposed, which can significantly boost the discriminative power of the networks. During inference, a novel distractor-aware module is designed, effectively transferring the general embedding to the current video domain. In addition, we extend the proposed approach for long-term tracking by introducing a simple yet effective local-to-global search strategy. The proposed tracker obtains state-of-the-art accuracy in comprehensive experiments of short-term and long-term visual tracking benchmarks, while the overall system speed is still far from being real-time.