1 Introduction

Visual object tracking has consistently been a popular research area over the last two decades. The popularity has been propelled by significant research challenges tracking offers as well as the industrial potential of tracking-based applications. Several initiatives have been established to promote tracking, such as PETS [95], CAVIARFootnote 1, i-LIDSFootnote 2, ETISEOFootnote 3, CDC [25], CVBASEFootnote 4, FERET [67], LTDTFootnote 5, MOTC [44, 76] and VideonetFootnote 6, and since 2013 short-term single target visual object tracking has been receiving a strong push toward performance evaluation standardisation from the VOTFootnote 7 initiative. The primary goal of VOT is establishing datasets, evaluation measures and toolkits as well as creating a platform for discussing evaluation-related issues through organization of tracking challenges. Since 2013, five challenges have taken place in conjunction with ICCV2013 (VOT2013 [41]), ECCV2014 (VOT2014 [42]), ICCV2015 (VOT2015 [40]), ECCV2016 (VOT2016 [39]) and ICCV2017 (VOT2017 [38]).

This paper presents the VOT2018 challenge, organized in conjunction with the ECCV2018 Visual Object Tracking Workshop, and the results obtained. The VOT2018 challenge addresses two classes of trackers. The first class has been considered in the past five challenges: single-camera, single-target, model-free, causal trackers, applied to short-term tracking. The model-free property means that the only training information provided is the bounding box in the first frame. The short-term tracking means that trackers are assumed not to be capable of performing successful re-detection after the target is lost and they are therefore reset after such an event. Causality requires that the tracker does not use any future frames, or frames prior to re-initialization, to infer the object position in the current frame. The second class of trackers is introduced this year in the first VOT long-term sub-challenge. This subchallenge considers single-camera, single-target, model-free long-term trackers. The long-term tracking means that the trackers are required to perform re-detection after the target has been lost and are therefore not reset after such an event. In the following, we overview the most closely related works and point out the contributions of VOT2018.

1.1 Related Work in Short-Term Tracking

A lot of research has been invested into benchmarking and performance evaluation in short-term visual object tracking [38,39,40,41,42,43, 47, 51, 61, 62, 75, 83, 92, 96, 101]. The currently most widely-used methodologies have been popularized by two benchmark papers: “Online Tracking Benchmark” (OTB) [92] and “Visual Object Tracking challenge” (VOT) [41]. The methodologies differ in the evaluation protocols as well as the performance measures.

The OTB-based evaluation approaches initialize the tracker in the first frame and let it runs until the end of the sequence. The benefit of this protocol is its implementation simplicity. But target predictions become irrelevant for tracking accuracy of short-term trackers after the initial failure, which introduces variance and bias in the results [43]. The VOT evaluation approach addresses this issue by resetting the tracker after each failure.

All recent performance evaluation protocols measure tracking accuracy primarily by intersection over union (IoU) between the ground truth and tracker prediction bounding boxes. A legacy center-based measure initially promoted by Babenko et al. [3] and later adopted by [90] is still often used, but is theoretically brittle and inferior to the overlap-based measure [83]. In the no-reset-based protocols the overall performance is summarized by the average IoU over the dataset (i.e., average overlap) [83, 90]. In the VOT reset-based protocols, two measures are used to probe the performance: (i) accuracy and (ii) robustness. They measure the overlap during successful tracking periods and the number of times the tracker fails. Since 2015, the VOT primary measure is the expected average overlap (EAO) – a principled combination of accuracy and robustness. The VOT reports the so-called state-of-the-art bound (SotA bound) on all their annual challenges. Any tracker exceeding SotA bound is considered state-of-the-art by VOT standard. This bound was introduced to counter the trend of considering state-of-the-art only those trackers that rank number one on benchmarks. By SotA bound, the hope was to remove the need of fine-tuning to benchmarks and to incent community-wide exploration of a wider spectrum of trackers, not necessarily getting the number one rank.

Tracking speed was recognized as an important tracking factor in VOT2014 [42]. Initially the speed was measured in terms of equivalent filtering operations [42] to reduce the varying hardware influence. This measure was abandoned due to limited normalization capability and due to the fact that speed often varies a lot during tracking. Since VOT2017 [42] speed aspects are measured by a protocol that requires real-time processing of incoming frames.

Most tracking datasets [47, 51, 61, 75, 92] have partially followed the trend in computer vision of increasing the number of sequences. But quantity does not necessarily reflect diversity nor richness in attributes. Over the years, the VOT [38,39,40,41,42,43] has developed a dataset construction methodology for constructing moderately large challenging datasets from a large pool of sequences. Through annual discussions at VOT workshops, the community expressed a request for evaluating trackers on a sequestered dataset. In response, the VOT2017 challenge introduced a sequestered dataset evaluation for winner identification in the main short-term challenge. In 2015 VOT introduced a sub-challenge for evaluating short-term trackers on thermal and infra-red sequences (VOT-TIR2015) with a dataset specially designed for that purpose [21]. Recently, datasets focusing on various short-term tracking aspects have been introduced. The UAV123 [61] and [101] proposed datasets for tracking from drones. Lin et al. [94] proposed a dataset for tracking faces by mobile phones. Galoogahi et al. [22] introduced a high-frame-rate dataset to analyze trade-offs between tracker speed and robustness. Čehovin et al. [96] proposed a dataset with an active camera view control using omni directional videos. Mueller et al. [62] recently re-annotated selected sequences from Youtube bounding boxes [69] to consider tracking in the wild. Despite significant activity in dataset construction, the VOT dataset remains unique for its carefully chosen and curated sequences guaranteeing relatively unbiased assessment of performance with respect to attributes.

1.2 Related Work in Long-Term Tracking

Long-term (LT) trackers have received far less attention than short-term (ST) trackers. A major difference between ST and LT trackers is that LT trackers are required to handle situations in which the target may leave the field of view for a longer duration. This means that LT trackers have to detect target absence and re-detect the target when it reappears. Therefore a natural evaluation protocol for LT tracking is a no-reset protocol.

A typical structure of a long-term tracker is a short-term component with a relatively small search range responsible for frame-to-frame association and a detector component responsible for detecting target reappearance. In addition, an interaction mechanism between the short-term component and the detector is required that appropriately updates the visual models and switches between target tracking and detection. This structure originates from two seminal papers in long-term tracking TLD [37] and Alien [66], and has been reused in all subsequent LT trackers (e.g., [20, 34, 57, 59, 65, 100]).

The set of performance measures in long-term tracking is quite diverse and has not been converging like in the short-term tracking. The early long-term tracking papers [37, 66] considered measures from object detection literature since detectors play a central role in LT tracking. The primary performance measures were precision, recall and F-measure computed at 0.5 IoU (overlap) threshold. But for tracking, the overlap of 0.5 is over-restrictive as discussed in [37, 43] and does not faithfully reflect the overall tracking capabilities. Furthermore, the approach requires a binary output – either target is present or absent. In general, a tracker can report the target position along with a presence certainty score which offers a more accurate analysis, but this is prevented by the binary output requirement. In addition to precision/recall measures, the authors of [37, 66] proposed using average center error to analyze tracking accuracy. But center-error-based measures are even more brittle than IoU-based measures, are resolution-dependent and are computed only in frames where the target is present and the tracker reports its position. Thus most papers published in the last few years (e.g, [20, 34, 57]) have simply used the short-term average overlap performance measure from [61, 90]. But this measure does not account for the tracker’s ability to correctly report target absence and favors reporting target positions at every frame. Attempts were made to address this drawback [60, 79] by specifying an overlap equal to 1 when the tracker correctly predicts the target absence, but this does not clearly separate re-detection ability from tracking accuracy. Recently, Lukežič et al. [56] have proposed tracking precision, tracking recall and tracking F-measure that avoid dependence on the IoU threshold and allow analyzing trackers with presence certainty outputs without assuming a predefined scale of the outputs. They have shown that their primary measure, the tracking F-measure, reduces to a standard short-term measure (average overlap) when computed in a short-term setup.

Only few datasets have been proposed in long-term tracking. The first dataset was introduced by the LTDT challenge (See footnote 5), which offered a collection of specific videos from [37, 45, 66, 75]. These videos were chosen using the following definition of the long-term sequence: “long-term sequence is a video that is at least 2 min long (at 25–30 fps), but ideally 10 min or longer” (See footnote 5). Mueller et al. [61] proposed a UAV20L dataset containing twenty long sequences with many target disappearances recorded from drones. Recently, three benchmarks that propose datasets with many target disappearances have almost concurrently appeared on pre-pub [36, 56, 60]. The benchmark [60] primarily analyzes performance of short-term trackers on long sequences, and [36] proposes a huge dataset constructed from Youtube bounding boxes [69]. To cope with significant dataset size, [36] annotate the tracked object every few frames. The benchmark [60] does not distinguish between short-term and long-term trackers architectures but considers LT tracking as the ability to track long sequences attributing most of performance boosts to robust visual models. The benchmarks [36, 56], on the other hand, point out the importance of re-detection and [56] uses this as a guideline to construct a moderately sized dataset with many long-term specific attributes. In fact, [56] argue that long-term tracking does not just refer to the sequence length, but more importantly to the sequence properties (number of target disappearances, etc.) and the type of tracking output expected. They argue that there are several levels of tracker types between pure short-term and long-term trackers and propose a new short-term/long-term tracking taxonomy covering four classes of ST/LT trackers. For these reasons, we base the VOT long-term dataset and evaluation protocols described in Sect. 3 on [56].

1.3 The VOT2018 Challenge

VOT2018 considers short-term as well as long-term trackers in separate sub-challenges. The evaluation toolkit and the datasets are provided by the VOT2018 organizers. These were released on April 26th 2018 for beta-testing. The challenge officially opened on May 5th 2018 with approximately a month available for results submission.

The authors participating in the challenge were required to integrate their tracker into the VOT2018 evaluation kit, which automatically performed a set of standardized experiments. The results were analyzed according to the VOT2018 evaluation methodology.

Participants were encouraged to submit their own new or previously published trackers as well as modified versions of third-party trackers. In the latter case, modifications had to be significant enough for acceptance. Participants were expected to submit a single set of results per tracker. Changes in the parameters did not constitute a different tracker. The tracker was required to run with fixed parameters in all experiments. The tracking method itself was allowed to internally change specific parameters, but these had to be set automatically by the tracker, e.g., from the image size and the initial size of the bounding box, and were not to be set by detecting a specific test sequence and then selecting the parameters that were hand-tuned for this sequence.

Each submission was accompanied by a short abstract describing the tracker, which was used for the short tracker descriptions in Appendix A. In addition, participants filled out a questionnaire on the VOT submission page to categorize their tracker along various design properties. Authors had to agree to help the VOT technical committee to reproduce their results in case their tracker was selected for further validation. Participants with sufficiently well-performing submissions, who contributed with the text for this paper and agreed to make their tracker code publicly available from the VOT page were offered co-authorship of this results paper.

To counter attempts of intentionally reporting large bounding boxes to avoid resets, the VOT committee analyzed the submitted tracker outputs. The committee reserved the right to disqualify the tracker should such or a similar strategy be detected.

To compete for the winner of VOT2018 challenge, learning from the tracking datasets (OTB, VOT, ALOV, NUSPRO and TempleColor) was prohibited. The use of class labels specific to VOT was not allowed (i.e., identifying a target class in each sequence and applying pre-trained class-specific trackers is not allowed). An agreement to publish the code online on VOT webpage was required. The organizers of VOT2018 were allowed to participate in the challenge, but did not compete for the winner of the VOT2018 challenge title. Further details are available from the challenge homepageFootnote 8.

Like VOT2017, the VOT2018 was running the main VOT2018 short-term sub-challenge and the VOT2018 short-term real-time sub-challenge, but did not run the short-term thermal and infrared VOT-TIR sub-challenge. As a significant novelty, the VOT2018 introduces a new VOT2018 long-term tracking challenge, adopting the methodology from [56]. The VOT2018 toolkit has been updated to allow seamless use in short-term and long-term tracking evaluation. In the following we overview the sub-challenges.

2 The VOT2018 Short-Term Challenge

The VOT2018 short-term challenge contains the main VOT2018 short-term sub-challenge and the VOT2018 realtime sub-challenge. Both sub-challenges used the same dataset, but different evaluation protocols.

The VOT2017 results have indicated that the 2017 dataset has not saturated, therefore the dataset was used unchanged in the VOT2018 short-term challenge. The dataset contains 60 sequences released to public (i.e., VOT2017 public dataset) and another 60 sequestered sequences (i.e., VOT2017 sequestered dataset). Only the former dataset was released to the public, while the latter was not disclosed and was used only to identify the winner of the main VOT2018 short-term challenge. The target in the sequences is annotated by a rotated bounding box and all sequences are per-frame annotated by the following visual attributes: (i) occlusion, (ii) illumination change, (iii) motion change, (iv) size change and (v) camera motion. Frames that did not correspond to any of the five attributes were denoted as (vi) unassigned.

2.1 Performance Measures and Evaluation Protocol

As in VOT2017 [38], three primary measures were used to analyze the short-term tracking performance: accuracy (A), robustness (R) and expected average overlap (EAO). In the following, these are briefly overviewed and we refer to [40, 43, 83] for further details.

The VOT short-term challenges apply a reset-based methodology. Whenever a tracker predicts a bounding box with zero overlap with the ground truth, a failure is detected and the tracker is re-initialized five frames after the failure. Accuracy and robustness [83] are the basic measures used to probe tracker performance in the reset-based experiments. The accuracy is the average overlap between the predicted and ground truth bounding boxes during successful tracking periods. The robustness measures how many times the tracker loses the target (fails) during tracking. The potential bias due to resets is reduced by ignoring ten frames after re-initialization in the accuracy measure (note that a tracker is reinitialized five frames after failure), which is quite a conservative margin [43]. Average accuracy and failure-rates are reported for stochastic trackers, which are run 15 times.

The third, primary measure, called the expected average overlap (EAO), is an estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences with the same visual properties as the given dataset. The measure addresses the problem of increased variance and bias of AO [92] measure due to variable sequence lengths. Please see [40] for further details on the average expected overlap measure. For reference, the toolkit also ran a no-reset experiment and the AO [92] was computed (available in the online results).

2.2 The VOT2018 Real-Time Sub-challenge

The VOT2018 real-time sub-challenge was introduced in VOT2017 [38] and is a variation of the main VOT2018 short-term sub-challenge. The main VOT2018 short-term sub-challenge does not place any constraint on the time for processing a single frame. In contrast, the VOT2018 real-time sub-challenge requires predicting bounding boxes faster or equal to the video frame-rate. The toolkit sends images to the tracker via the Trax protocol [10] at 20 fps. If the tracker does not respond in time, the last reported bounding box is assumed as the reported tracker output at the available frame (zero-order hold dynamic model).

The toolkit applies a reset-based VOT evaluation protocol by resetting the tracker whenever the tracker bounding box does not overlap with the ground truth. The VOT frame skipping is applied as well to reduce the correlation between resets.

2.3 Winner Identification Protocol

On the main VOT2018 short-term sub-challenge, the winner is identified as follows. Trackers are ranked according to the EAO measure on the public dataset. Top ten trackers are re-run by the VOT2018 committee on the sequestered dataset. The top ranked tracker on the sequestered dataset not submitted by the VOT2018 committee members is the winner of the main VOT2018 short-term challenge. The winner of the VOT2018 real-time challenge is identified as the top-ranked tracker not submitted by the VOT2018 committee members according to the EAO on the public dataset.

3 The VOT2018 Long-Term Challenge

The VOT2018 long-term challenge focuses on the long-term tracking properties. In a long-term setup, the object may leave the field of view or become fully occluded for a long period. Thus in principle, a tracker is required to report the target absence. To make the integration with the toolkit compatible with the short-term setup, we require the tracker to report the target position in each frame and provide a confidence score of target presence. The VOT2018 adapts long-term tracker definitions, dataset and the evaluation protocol from [56]. We summarize these in the following and direct the reader to the original paper for more details.

3.1 The Short-Term/Long-Term Tracking Spectrum

The following definitions from [56] are used to position the trackers on the short-term/long-term spectrum:

  1. 1.

    Short-term tracker (\(\mathrm {ST}_0\)). The target position is reported at each frame. The tracker does not implement target re-detection and does not explicitly detect occlusion. Such trackers are likely to fail at the first occlusion as their representation is affected by any occluder.

  2. 2.

    Short-term tracker with conservative updating (\(\mathrm {ST}_1\)). The target position is reported at each frame. Target re-detection is not implemented, but tracking robustness is increased by selectively updating the visual model depending on a tracking confidence estimation mechanism.

  3. 3.

    Pseudo long-term tracker (\(\mathrm {LT}_0\)). The target position is not reported in frames when the target is not visible. The tracker does not implement explicit target re-detection but uses an internal mechanism to identify and report tracking failure.

  4. 4.

    Re-detecting long-term tracker (\(\mathrm {LT}_1\)). The target position is not reported in frames when the target is not visible. The tracker detects tracking failure and implements explicit target re-detection.

3.2 The Dataset

Trackers are evaluated on the LTB35 dataset [56]. This dataset contains 35 sequences, carefully selected to obtain a dataset with long sequences containing many target disappearances. Twenty sequences were obtained from the UAVL20 [61], three from [37], six sequences were taken from Youtube and six sequences were generated from the omnidirectional view generator AMP [96] to ensure many target disappearances. Sequence resolutions range between \(1280 \times 720\) and \(290 \times 217\). The dataset contains 14687 frames, with 433 target disappearances. Each sequence contains on average 12 long-term target disappearances, each lasting on average 40 frames.

The targets are annotated by axis-aligned bounding boxes. Sequences are annotated by the following visual attributes: (i) Full occlusion, (ii) Out-of-view, (iii) Partial occlusion, (iv) Camera motion, (v) Fast motion, (vi) Scale change, (vii) Aspect ratio change, (viii) Viewpoint change, (ix) Similar objects. Note this is per-sequence, not per-frame annotation and a sequence can be annotated by several attributes.

3.3 Performance Measures

We use three long-term tracking performance measures proposed in [56]: tracking precision (Pr), tracking recall (Re) and tracking F-score. These are briefly described in the following.

Let \(G_t\) be the ground truth target pose, let \(A_t(\tau _\theta )\) be the pose predicted by the tracker, \(\theta _t\) the prediction certainty score at time-step t, \(\tau _\theta \) be a classification (detection) threshold. If the target is absent, the ground truth is an empty set, i.e., \(G_t=\emptyset \). Similarly, if the tracker did not predict the target or the prediction certainty score is below a classification threshold i.e., \(\theta _t < \tau _\theta \), the output is \(A_t(\tau _\theta )=\emptyset \). Let \(\varOmega (A_t(\tau _\theta ), G_t)\) be the intersection over union between the tracker prediction and the ground truth and let \(N_g\) be the number of frames with \(G_t\ne \emptyset \) and \(N_p\) the number of frames with existing prediction, i.e., \(A_t(\tau _\theta ) \ne \emptyset \).

In detection literature, the prediction matches the ground truth if the overlap \(\varOmega (A_t(\tau _\theta ), G_t)\) exceeds a threshold \(\tau _\varOmega \), which makes precision and recall dependent on the minimal classification certainty as well as minimal overlap thresholds. This problem is addressed in [56] by integrating the precision and recall over all possible overlap thresholdsFootnote 9. The tracking precision and tracking recall at classification threshold \(\tau _\theta \) are defined as

$$\begin{aligned} Pr(\tau _\theta ) = \frac{1}{N_p} \sum _{t \in \{ t : A_t(\theta _t) \ne \emptyset \} } \varOmega (A_t(\theta _t), G_t), \end{aligned}$$
(1)
$$\begin{aligned} Re(\tau _\theta ) = \frac{1}{N_g} \sum _{t \in \{ t : G_t \ne \emptyset \} } \varOmega (A_t(\theta _t), G_t). \end{aligned}$$
(2)

Precision and accuracy are combined into a single score by computing the tracking F-measure:

$$\begin{aligned} F(\tau _\theta ) = 2 Pr(\tau _\theta ) Re(\tau _\theta ) / (Pr(\tau _\theta ) + Re(\tau _\theta )). \end{aligned}$$
(3)

Long-term tracking performance can thus be visualized by tracking precision, tracking accuracy and tracking F-measure plots by computing these scores for all thresholds \(\tau _\theta \).

The primary long-term tracking measure [56] is F-score, defined as the highest score on the F-measure plot, i.e., taken at the tracker-specific optimal threshold. This avoids arbitrary manual-set thresholds in the primary performance measure.

3.4 Re-detection Experiment

We also adapt an experiment from [56] designed to test the tracker’s re-detection capability separately from the short-term component. This experiment generates an artificial sequence in which the target does not change appearance but only location. An initial frame of a sequence is padded with zeros to the right and down to the three times original size. This frame is repeated for the first five frames in the artificial sequence. For the remainder of the frames, the target is cropped from the initial image and placed in the bottom right corner of the frame with all other pixels set to zero.

A tracker is initialized in the first frame and the experiment measures the number of frames required to re-detect the target after position change. This experiment is re-run over artificial sequences generated from all sequences in the LTB35 dataset.

3.5 Evaluation Protocol

A tracker is evaluated on a dataset of several sequences by initializing on the first frame of a sequence and run until the end of the sequence without re-sets. The precision-recall graph from (1) is calculated on each sequence and averaged into a single plot. This guarantees that the result is not dominated by extremely long sequences. The F-measure plot is computed according to (3) from the average precision-recall plot. The maximal score on the F-measure plot (F-score) is taken as the long-term tracking primary performance measure.

3.6 Winner Identification Protocol

The winner of the VOT2018 long-term tracking challenge is identified as the top-ranked tracker not submitted by the VOT2018 committee members according to the F-score on the LTB35 dataset.

4 The VOT2018 Short-Term Challenge Results

This section summarizes the trackers submitted to the VOT short-term (VOT2018 ST) challenge, results analysis and winner identification.

4.1 Trackers Submitted

In all, 56 valid entries were submitted to the VOT2018 short-term challenge. Each submission included the binaries or source code that allowed verification of the results if required. The VOT2018 committee and associates additionally contributed 16 baseline trackers. For these, the default parameters were selected, or, when not available, were set to reasonable values. Thus in total 72 trackers were tested on the VOT2018 short-term challenge. In the following we briefly overview the entries and provide the references to original papers in the Appendix A where available.

Of all participating trackers, 51 trackers (\(71\%\)) were categorized as ST\(_0\), 18 trackers (\(25\%\)) as ST\(_1\), and three (\(4\%\)) as LT\(_1\). 76\(\%\) applied discriminative and \(24\%\) applied generative models. Most trackers – \(75\%\) – used holistic model, while \(25\%\) of the participating trackers used part-based models. Most trackers applied either a locally uniform dynamic modelFootnote 10 (\(76\%\)), a nearly-constant-velocity (\(7\%\)), or a random walk dynamic model (\(15\%\)), while only a single tracker applied a higher order dynamic model (\(1\%\)).

The trackers were based on various tracking principles: 4 trackers (\(6\%\)) were based on CNN matching (ALAL A.2, C3DT A.72, LSART A.40, RAnet A.57), one tracker was based on recurrent neural network (ALAL A.2), 14 trackers (\(18\%\)) applied Siamese networks (ALAL A.2, DensSiam A.23, DSiam A.30, LWDNTm A.41, LWDNTthi A.42, MBSiam A.48, SA\(\_\)Siam\(\_\)A.59, SA_Siam_R A.60, SiamFC A.34, SiamRPN A.35, SiamVGG A.63, STST A.66, UpdateNet A.1), 3 trackers (\(4\%\)) applied support vector machines (BST A.6, MEEM A.47, struck2011 A.68), 38 trackers (\(53\%\)) applied discriminative correlation filters (ANT A.3, BoVW\(\_\)CFT A.4, CCOT A.11, CFCF A.13, CFTR A.15, CPT A.7, CPT\(\_\)fast A.8, CSRDCF A.24, CSRTPP A.25, CSTEM A.9, DCFCF A.22, DCFNet A.18, DeepCSRDCF A.17, DeepSTRCF A.20, DFPReco A.29, DLSTpp A.28, DPT A.21, DRT A.16, DSST A.26, ECO A.31, HMMTxD A.53, KCF A.38, KFebT A.37, LADCF A.39, MCCT A.50, MFT A.51, MRSNCC A.49, R\(\_\)MCPF A.56, RCO A.12, RSECF A.14, SAPKLTF A.62, SRCT A.58, SRDCF A.64, srdcf\(\_\)deep A.19, srdcf\(\_\)dif A.32, Staple A.67, STBACF A.65, TRACA A.69, UPDT A.71), 6 trackers (\(8\%\)) applied mean shift (ASMS A.61, CPOINT A.10, HMMTxD A.53, KFebT A.37, MRSNCC A.49, SAPKLTF A.62) and 8 trackers (\(11\%\)) applied optical flow (ANT A.3, CPOINT A.10, FoT A.33, Fragtrac A.55, HMMTxD A.53, LGT A.43, MRSNCC A.49, SAPKLTF A.62).

Many trackers used combinations of several features. CNN features were used in 62\(\%\) of trackers – these were either trained for discrimination (32 trackers) or localization (13 trackers). Hand-crafted features were used in 44\(\%\) of trackers, keypoints in 14\(\%\) of trackers, color histograms in 19\(\%\) and grayscale features were used in \(24\%\) of trackers.

4.2 The Main VOT2018 Short-Term Sub-challenge Results

The results are summarized in the AR-raw plots and EAO curves in Fig. 1 and the expected average overlap plots in Fig. 2. The values are also reported in Table 2. The top ten trackers according to the primary EAO measure (Fig. 2) are LADCF A.39, MFT A.51, SiamRPN A.35, UPDT A.71, RCO A.12, DRT A.16, DeepSTRCF A.20, SA_Siam_R A.60, CPT A.7 and DLSTpp A.28. All these trackers apply a discriminatively trained correlation filter on top of multidimensional features except from SiamRPN and SA\(\_\)Siam\(\_\)R, which apply siamese networks. Common networks used by the top ten trackers are Alexnet, Vgg and Resnet in addition to localization pre-trained networks. Many trackers combine the deep features with HOG, Colornames and a grayscale patch.

Fig. 1.
figure 1

The AR-raw plots generated by sequence pooling (left) and EAO curves (right). (Color figure online)

The top performer on public dataset is LADCF (A.39). This tracker trains a low-dimensional DCF by using an adaptive spatial regularizer. Adaptive spatial regularization and temporal consistency are combined into a single objective function. The tracker uses HOG, Colournames and ResNet-50 features. Data augmentation by flipping, rotating and blurring is applied to the Resnet features. The second-best ranked tracker is MFT (A.51). This tracker adopts CFWCR [31] as a baseline feature learning algorithm and applies a continuous convolution operator [15] to fuse multiresolution features. The different resolutions are trained independently for target position prediction, which, according to the authors, significantly boosts the robustness. The tracker uses ResNet-50, SE-ResNet-50, HOG and Colornames.

The top trackers in EAO are also among the most robust trackers, which means that they are able to track longer without failing. The top trackers in robustness (Fig. 1) are MFT A.51, LADCF A.39, RCO A.12, UPDT A.71, DRT A.16, LSART A.40, DeepSTRCF A.20, DLSTpp A.28, CPT A.7 and SA_Siam_R A.60. On the other hand, the top performers in accuracy are SiamRPN A.35, SA_Siam_R A.60, FSAN A.70, DLSTpp A.28, UPDT A.71, MCCT A.50, SiamVGG A.63, ALAL A.2, DeepSTRCF A.20 and SA\(\_\)Siam\(\_\)A.59.

Fig. 2.
figure 2

Expected average overlap graph with trackers ranked from right to left. The right-most tracker is the top-performing according to the VOT2018 expected average overlap values. The dashed horizontal line denotes the average performance of ten state-of-the-art trackers published in 2017 and 2018 at major computer vision venues. These trackers are denoted by gray circle in the bottom part of the graph. (Color figure online)

The trackers which have been considered as baselines or state-of-the-art even few years ago, i.e., MIL (A.52), IVT (A.36), Struck [28] and KCF (A.38) are positioned at the lower part of the AR-plots and at the tail of the EAO rank list. This speaks of the significant quality of the trackers submitted to VOT2018. In fact, 19 tested trackers (26\(\%\)) have been recently (2017/2018) published at computer vision conferences and journals. These trackers are indicated in Fig. 2, along with their average performance, which constitutes a very strict VOT2018 state-of-the-art bound. Approximately 26\(\%\) of submitted trackers exceed this bound.

Table 1. Tracking difficulty with respect to the following visual attributes: camera motion (CM), illumination change (IC), motion change (MC), occlusion (OC) and size change (SC).
Table 2. The table shows expected average overlap (EAO), as well as accuracy and robustness raw values (A,R) for the baseline and the realtime experiments. For the unsupervised experiment the no-reset average overlap AO [91] is used. The last column contains implementation details (first letter: (D)eterministic or (S)tohastic, second letter: tracker implemented in (M)atlab, (C)++, or (P)ython, third letter: tracker is using (G)PU or only (C)PU).
Fig. 3.
figure 3

Failure rate with respect to the visual attributes.

The number of failures with respect to the visual attributes is shown in Fig. 3. The overall top performers remain at the top of per-attribute ranks as well, but none of the trackers consistently outperforms all others with respect to each attribute. According to the median robustness and accuracy over each attribute (Table 1) the most challenging attributes in terms of failures are occlusion, illumination change and motion change, followed by camera motion and scale change. Occlusion is the most challenging attribute for tracking accuracy.

The VOT-ST2018 Winner Identification. Top 10 trackers from the baseline experiment (Table 2) were selected to be re-run on the sequestered dataset. Despite significant effort, our team was unable to re-run DRT and SA\(\_\)Siam\(\_\)R due to library incompatibility errors in one case and significant system modifications requirements in the other. These two trackers were thus removed from the winner identification process on the account of the code provided not being results re-production-ready. The scores of the remaining trackers are shown in Table 3. The top tracker according to the EAO is MFT A.51 and is thus the VOT2018 short-term challenge winner.

Table 3. The top eight trackers from Table 2 re-ranked on the VOT2018 sequestered dataset.

4.3 The VOT2018 Short-Term Real-Time Sub-challenge Results

The EAO scores and AR-raw plots for the real-time experiment are shown in Figs. 4 and 5. The top ten real-time trackers are SiamRPN A.35, SA_Siam_R A.60, SA\(\_\)Siam\(\_\)A.59, SiamVGG A.63, CSRTPP A.25, LWDNTm A.41, LWDNTthi A.42, CSTEM A.9, MBSiam A.48 and UpdateNet A.1. Eight of these (SiamRPN, SA\(\_\)Siam\(\_\)R, SA\(\_\)Siam\(\_\)P, SiamVGG, LWDNTm, LWDNTthi, MBSiam, UpdateNet) are extensions of the Siamese architecture SiamFC [6]. These trackers apply pre-traind CNN features that maximize correlation localization accuracy and require a GPU. But since feature extraction as well as correlation are carried out on the GPU, they achieve significant speed in addition to extraction of highly discriminative features. The remaining two trackers (CSRTPP and CSTEM) are extensions of the CSRDCF [53] – a correlation filter with boundary constraints and segmentation for identifying reliable target pixels. These two trackers apply hand-crafted features, i.e., HOG and Colornames.

Fig. 4.
figure 4

The AR plot (left) and the EAO curves (right) for the VOT2017 realtime experiment. (Color figure online)

Fig. 5.
figure 5

The EAO plot (right) for the realtime experiment. (Color figure online)

The VOT-RT2018 Winner Identification. The winning real-time tracker of the VOT2018 is the Siamese region proposal network SiamRPN [48] (A.35). The tracker is based on a Siamese subnetwork for feature extraction and a region proposal subnetwork which includes a classification branch and a regression branch. The inference is formulated as a local one-shot detection task.

5 The VOT2018 Long-Term Challenge Results

The VOT2018 LT challenge received 11 valid entries. The VOT2018 committee contributed additional 4 baselines, thus 15 trackers were considered in the VOT2018 LT challenge. In the following we briefly overview the entries and provide the references to original papers in the Appendix B where available.

Some of the submitted trackers were in principle ST\(_0\) trackers. But the submission rules required exposing a target localizsation/presence certainty score which can be used by thresholding to form a target presence classifier. In this way, these trackers were elevated to LT\(_0\) level according to the ST-LT taxonomy from Sect. 3.1. Five trackers were from the ST\(_0\) (elevated to LT\(_0\)) class: SiamVGG B.15, SiamFC B.5, ASMS B.11, FoT B.3 and SLT B.14. Ten trackers were from LT\(_1\) class: DaSiam\(\_\)LT B.2, MMLT B.1, PTAVplus B.10, MBMD B.8, SAPKLTF B.12, LTSINT B.7, SYT B.13, SiamFCDet B.4, FuCoLoT B.6, HMMTxD B.9.

Ten trackers applied CNN features (nine of these in Siamese architecture) and four trackers applied DCFs. Six trackers never updated the short-term component (DaSiam\(\_\)LT, SYT, SiamFCDet, SiamVGG, SiamFC and SLT), four updated the component only when confident (MMLT, SAPKLTF, LTSINT, FuCoLoT), two applied exponential forgetting (HMMTxD, ASMS), two applied updates at fixed intervals (PTAVplus, MBMD) and one applied robust partial updates (FoT). Seven trackers never updated the long-term component (DaSiam\(\_\)LT, MBMD, SiamFCDet, HMMTxD, SiamVGG, SiamFC, SLT), and six updated the model only when confident (MMLT, PTAVplus, SAPKLTF, LTSINT, SYT, FuCoLoT).

Table 4. List of trackers that participated in the VOT2018 long-term challenge along with their performance scores (F-score, Pr, Re), ST/LT categorization and results of the re-detection experiment in the last column with the average number of frames required for re-detection (Frames) and the percentage of sequences with successful re-detection (Success).

Results of the re-detection experiment are summarized in the last column of Table 4. MMLT, SLT, MBMD, FuCoLoT and LTSINT consistently re-detect the target while SiamFCDet succeeded in all but one sequence. Some trackers (SYT, PTAVplus) were capable of re-detection in only a few cases, which indicates a potential issue with the detector. All these eight trackers pass the re-detection test and are classified as LT\(_1\) trackers. Trackers DaSiam\(\_\)LT, SAPKLTF, SiamVGG and SiamFC did not pass the test, which means that they do not perform image-wide re-detection, but only re-detect in a extended local region. These trackers are classified as LT\(_0\).

Fig. 6.
figure 6

Long-term tracking performance. The average tracking precision-recall curves (left), the corresponding F-score curves (right). Tracker labels are sorted according to maximum of the F-score. (Color figure online)

Fig. 7.
figure 7

Maximum F-score averaged over overlap thresholds for the visual attributes. The most challenging attributes are fast motion, out of view, aspect ratio change and full occlusion. (Color figure online)

The overall performance is summarized in Fig. 6. The highest ranked tracker is the MobileNet-based tracking by detection algorithm (MBMD), which applies a bounding box regression network and an MDNet-based verifier [64]. The bounding box regression network is trained on ILSVRC 2015 video detection dataset and ILSVRC 2014 detection dataset is used to train a regression to any object in a search region by ignoring the classification labels. The bounding box regression result is verified by MDNet [64]. If the score of regression module is below a threshold, the MDNet localizes the target by a particle filter. The MDNet is updated online, while the bounding box regression network is not updated.

The second highest ranked tracker is DaSiam\(\_\)LT – an LT\(_1\) class tracker. This tracker is an extension of a Siamese Region Proposal Network (SiamRPN) [48]. The original SiamRPN cannot recover a target after it re-appears, thus the extension implements an effective global-to-local search strategy. The search region size is gradually grown at a constant rate after target loss, akin to [55]. Distractor-aware training and inference are also added to implement a high-quality tracking reliability score.

Figure 7 shows tracking performance with respect to nine visual attributes from Sect. 3.2. The most challenging attributes are fast motion, out of view, aspect ratio change and full occlusion.

The VOT-LT2018 Winner Identification. According to the F-score, MBMD (F-score = 0,610) is slightly ahead of DaSiam\(\_\)LT (F-score = 0,607). The trackers reach approximately the same tracking recall (0,588216 for MBMD vs 0,587921 for DaSiam\(\_\)LT), which implies a comparable target re-detection success. But MBMD has a greater tracking precision which implies better target localization capabilities. Overall, the best tracking precision is obtained by SiamFC, while the best tracking recall is obtained by MBMD. According to the VOT winner rules, the VOT2018 long-term challenge winner is therefore MBMD B.8.

6 Conclusion

Results of the VOT2018 challenge were presented. The challenge is composed of the following three sub-challenges: the main VOT2018 short-term tracking challenge (VOT-ST2018), the VOT2018 real-time short-term tracking challenge (VOT-RT2018) and VOT2018 long-term tracking challenge (VOT-LT2018), which is a new challenge introduced this year.

The overall results of the challenges indicate that discriminative correlation filters and deep networks remain the dominant methodologies in visual object tracking. Deep features in DCFs and use of CNNs as classifiers in the trackers have been recognized as efficient tracking ingredients already in VOT2015. But their use among top performers has become wide-spread over the following years. In contrast to previous years we observe a wider use of localization-trained CNN features and CNN trackers based on Siamese architectures. Bounding box regression is being used in trackers more frequently than in previous challenges as well.

The top performer on the VOT-ST2018 public dataset is LADCF (A.39) – a regularized discriminative correlation filter trained on a low-dimensional projection of ResNet50, HOG and Colornames features. The top performer on the sequestered dataset and the VOT-ST2018 challenge winner is MFT (A.51) – a continuous convolution discriminative correlation filter with per-channel independently trained localization learned features. This tracker uses ResNet-50, SE-ResNet-50, HOG and Colornames.

The top performer and the winner of the VOT-RT2018 challenge is SiamRPN (A.35) – a Siamese region proposal network. The tracker requires a GPU, but otherwise has the best tradeoff between robustness and processing speed. Note that nearly all top ten trackers on realtime challenge applied Siamese nets (two applied DCFs and run on CPU). The dominant methodology in real-time tracking therefore appears to be Siamese CNNs.

The top performer and the winner of the VOT-LT2018 challenge is MBMD (B.8) – a bounding box regression network with MDNet [64] for regression verification and localization upon target loss. This tracker is from LT\(_1\) class, identifies a potential target loss, performs target re-detection and applies conservative updates of the visual model.

The VOT primary objective is to establish a platform for discussion of tracking performance evaluation and contributing to the tracking community with verified annotated datasets, performance measures and evaluation toolkits. The VOT2018 was a sixth effort toward this, following the very successful VOT2013, VOT2014, VOT2015, VOT2016 and VOT2017.