1 Introduction

Drones, or general UAVs, equipped with cameras have been fast deployed to a wide range of applications, including agricultural, aerial photography, fast delivery, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, which makes computer vision and drones more and more closely. Despite the great progresses in general computer vision algorithms, such as tracking and detection, these algorithms are not usually optimal for dealing with sequences or images generated by drones, due to various challenges such as view point change and scales.

Developing and evaluating new vision algorithms for drone generated visual data is a key problem in drone-based applications. However, as pointed out in recent studies (e.g., [26, 43]), the lack of public large-scale benchmarks or datasets is the bottleneck to achieve this goal. Some recent preliminary efforts [26, 43, 49] have been devoted to construct datasets with drone platforms focusing on single-object tracking. These datasets are still limited in size and scenarios covered, due to the difficulties in data collection and annotation. Thus, a more general and comprehensive benchmark is desired for further boost research on computer vision problems with drones.

To advance the developments in single-object tracking, we organize the Vision Meets Drone Single-Object Tracking (VisDrone-SOT2018) challenge, which is one track of the “Vision Meets Drone: A Challenge”Footnote 1 on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. In particular, we collected a single-object tracking dataset with various drone models, e.g., DJI Mavic, and Phantom series 3, 3A, in different scenarios with various weather and lighting conditions. All video sequences are labelled per-frame with different visual attributes to aid a less biased analysis of the tracking results. The objects to be tracked are of various types including pedestrians, cars, buses, and sheep. We invite the authors to submit the tracking results in the VisDrone-SOT2018 dataset. The authors of submitted algorithms in the challenge have an opportunity to share their ideas in the workshop and further publish the source code at our website: http://www.aiskyeye.com/, which are helpful to push the development of the single-object tracking field.

2 Related Work

Single-object tracking or visual tracking, is one of the fundamental problems in computer vision, which aims to estimate the trajectory of a target in a video sequence, given its initial state. In this section, we briefly review the related datasets and recent tracking algorithms.

Existing Datasets. In recent years, numerous datasets have been developed for single object tracking. Wu et al. [65] create a standard benchmark to evaluate the single-object tracking algorithms, which includes 50 video sequences. After that, they further extend the dataset with 100 video sequences. Concurrently, Liang et al. [36] collect 128 video sequences for evaluating the color enhanced trackers. To track the progress in single-object tracking field, Kristan et al. [29,30,31, 56] organize the VOT competition from 2013 to 2018, where the new datasets and evaluation strategies are proposed for tracking evaluation. The series of competitions promote the developments of visual tracking. Smeulders et al. [52] present the ALOV300 dataset, containing 314 video sequences with 14 visual attributes, such as long duration, zooming camera, moving camera and transparency. Li et al. [32] construct a large-scale dataset with 365 video sequences, covering 12 different kinds of objects captured from moving cameras. Du et al. [15] design a dataset with 50 fully annotated video sequences, focusing on deformable object tracking in unconstrained environments. To evaluate tracking algorithms in high frame rate videos (e.g., 240 frame per second), Galoogahi et al. [21] propose a dataset containing 100 video clips (380, 000 frames in total), recorded in real world scenarios. Besides using video sequences captured by RGB cameras, Felsberg et al. [20, 30, 57] organize a series of competitions from 2015 to 2017, focusing on single-object tracking on thermal video sequences recorded by 8 different types of sensors. In [53], a RGB-D tracking dataset is presented, which includes 100 RGB-D video clips with manually annotated ground truth bounding boxes. UAV123 [43] is a large UAV dataset including 123 fully annotated high-resolution video sequences captured from the low-altitude aerial view points. Similarly, UAVDT [16] describes a new UAV benchmark focusing on several different complex scenarios. Müller et al. [45] present a large-scale benchmark for object tracking in the wild, which includes more than 30, 000 videos with more than 14 million dense bounding box annotations. Recently, Fan et al. [18] propose a large tracking benchmark with 1, 400 videos, with each frame manually annotated. Most of the above datasets cover a large set of object categories, but do not focus on drone based scenarios as our dataset.

Review of Recent Single-Object Tracking Methods. Single-object tracking is a hot topic with various applications (e.g., video surveillance, behavior analysis and human-computer interaction). It attracts much research such as graph model [4, 15, 35, 64], subspace learning [28, 50, 62, 63] and sparse coding [39, 42, 47, 69]. Recently, the correlation filter algorithm becomes popular in visual tracking field due to its high efficiency. Henriques et al. [25] derive a kernelized correlation filter and propose a fast multi-channel extension of linear correlation filters using a linear kernel. Danelljan et al. [10] propose to learn discriminative correlation filters based on the scale pyramid representation to improve the tracking performance. To model the distribution of feature attention, Choi et al. [7] develop an attentional feature-based correlation filter evolved with multiple trained elementary trackers. The Staple method [2] achieves a large gain in performance over previous methods by combining color statistics and correlations. Danelljan et al. [11] demonstrate that learning the correlation filter coefficients with spatial regularization is effective for tracking task. Li et al. [34] integrate the temporal regularization into the SRDCF framework [11] with single sample, and propose the spatial-temporal regularized correlation filters to provide a more robust appearance model in the case of large appearance variations. Du et al. [17] design a correlation filter based method that integrates the target part selection, part matching, and state estimation into a unified energy minimization framework.

On the other hand, the deep learning based methods achieve a dominant position in the single-object tracking field with the impressive performance. Some methods directly use the deep Convolutional Neural Networks (CNNs) to extract the features to replace the hand-crafted features in the correlation filter framework, such as CF2 [41], C-COT [13], ECO [9], CFNet [59], and PTAV [19]. In [60], different types of features are combined to construct multiple experts through discriminative correlation filter algorithm, and each of them tracks the target independently. With the proposed robustness evaluation strategy, the most confident expert is selected to produce the tracking results in each frame. Besides, another way is to construct an end-to-end deep model to complete target appearance learning and tracking [3, 14, 33, 46, 54, 55, 67]. In SiamFC [3] and SINT [55], the researchers employ siamese deep neural network to learn the matching function between the initial patch of the target in the first frame and the candidate in the subsequent frames. Li et al. [33] propose the siamese region proposal network, which consists of a siamese sub-network for feature extraction and a region proposal sub-network for classification and regression. MDNet [46] uses a pre-trained CNN model on a large set of video sequences with manually annotated ground-truths to obtain a generic target representation, and then evaluates the candidate windows randomly sampled around the previous target state to find the optimal location for tracking. After that, Song et al. [54] present the VITAL algorithm to generate more discriminative training samples via adversarial learning. Yun et al. [67] design a tracker controlled by sequentially pursuing actions learned by deep reinforcement learning. Dong et al. [14] propose a hyperparameter optimization method that is able to find the optimal hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning.

3 The VisDrone-SOT2018 Challenge

As described above, to track and promote the developments in single-object tracking field, we organized the Vision Meets Drone Single-Object Tracking (or VisDrone-SOT2018, for short) challenge, which is one track of the workshop challenge “Vision Meets Drone: A Challenge” on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. The VisDrone-SOT2018 challenge focuses on single-object tracking on the drone platform. Specifically, given an initial bounding box enclosing the target in the first frame, the submitted algorithm is required to estimate the region of target in the subsequent video frames. We released a single-object tracking dataset, i.e., the VisDrone-SOT2018 dataset, which consists of 132 video sequences formed by 106, 354 frames, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse to crowded scenes). We invited researchers to participate the challenge and to evaluate and discuss their research on the VisDrone-SOT2018 dataset at the workshop. We believe the workshop challenge will be helpful to the research in the video object tracking community (Table 1).

Table 1. Comparison of Current State-of-the-Art Benchmarks and Datasets. Note that the resolution indicates the maximum resolution of the video frames included in the dataset. Notably, we have \(1k=1,000\).

3.1 Dataset

The released VisDrone-SOT2018 dataset in this workshop includes 132 video clips with 106, 354 frames, which is divided into three non-overlapping subsets, i.e., training set (86 sequences with 69, 941 frames), validation set (11 sequences with 7, 046 frames), and testing set (35 sequences with 29, 367 frames). The video clips in these three subsets are taken at different locations, but share similar environments and attributes. The dataset is collected in various real-world scenarios by various drone platforms (i.e., different drone models) under various weather and lighting conditions, which is helpful for the researchers to improve the algorithm performance in real-world scenarios. We manually annotated the bounding boxes of targets (e.g., pedestrians, dogs, and vehicles) as well as several useful attributes (e.g., occlusion, background clutter, and camera motion) for algorithm analysis. We present the number of frames vs. the aspect ratio (i.e., object height divided by width) change rate with respect to the first frame in Fig. 2(a), and show the number of frames vs. the area change rate with respect to the first frame in Fig. 2(b). We plot the distributions of the number of frames of video clips in the training, validation, and testing sets in Fig. 2(c). In addition, some annotated examples in the VisDrone-SOT2018 dataset are presented in Fig. 1.

Fig. 1.
figure 1

Some annotated example video frames of single object tracking. The first frame with the bounding box of the target object is shown for each sequence.

3.2 Evaluation Protocol

Following the evaluation methodology in [66], we use the success and precision scores to evaluate the performance of the trackers. The success score is defined as the area under the success plot. That is, with each bounding box overlap threshold \(t_o\) in the interval [0, 1], we compute the percentage of successfully tracked frames to generate the successfully tracked frames vs. bounding box overlap threshold plot. The overlap between the tracker prediction \(B_t\) and the ground truth bounding box \(B_g\) is defined as \({ O}=\frac{|B_t\bigcap {B}_g|}{|B_t\bigcup {B}_g|}\), where \(\bigcap \) and \(\bigcup \) represent the intersection and union between the two regions, respectively, and \(|\cdot |\) calculates the number of pixels in the region. Meanwhile, the precision score is defined as the percentage of frames whose estimated location is within the given threshold distance of the ground truth based on the Euclidean distance in the image plane. Here, we set the distance threshold to 20 pixels in evaluation. Notably, the success score is used as the primary metric for ranking methods.

Fig. 2.
figure 2

(a) The number of frames vs. the aspect ratio (height divided by width) change rate with respect to the first frame, (b) the number of frames vs. the area change rate with respect to the first frame, and (c) the distributions of the number of frames of video clips, in the training, validation, and testing sets for single object tracking.

3.3 Trackers Submitted

We have received 17 entries from 26 different institutes in the VisDrone-SOT2018 challenge. The VisDrone committee additionally evaluates 5 baseline trackers with the default parameters on the VisDrone-SOT2018 dataset. If the default parameters are not available, some reasonable values are used for evaluation. Thus, there are in total 22 algorithms are included in the single-object tracking task of VisDrone2018 challenge. In the following we briefly overview the submitted algorithms and provide their descriptions in the Appendix A.

Fig. 3.
figure 3

The success and precision plots of the submitted trackers. The success and precision scores for each tracker are presented in the legend.

Among in the submitted algorithms, 4 trackers are improved based on the correlation filter algorithm, including CFWCRKF (A.3), CKCF (A.6), DCST (A.16) and STAPLE_SRCA (A.17). Four trackers, i.e., C3DT (A.4), VITALD (A.5), DeCom (A.8) and BTT (A.10), are developed based on the MDNet [46] algorithm, which is the winner of the VOT2015 challenge [31]. Seven trackers combine the CNN models and correlation filter algorithm, namely OST (A.1), CFCNN (A.7), TRACA+ (A.9), LZZ-ECO (A.11), SECFNet (A.12), SDRCO (A.14) and DCFNet (A.15), where OST (A.1), CFCNN (A.7) and LZZ-ECO (A.11) apply object detectors to conduct target re-detection. One tracker (i.e., AST (A.2)) is based on saliency map, and another tracker (i.e., IMT3 (A.13)) is based on the normalized cross correlation filter.

3.4 Overall Performance

The overall success and precision plots of all submissions are shown in Fig. 3. Meanwhile, we also report the success and precision scores, tracking speed, implementation details, pre-trained dataset, and the references of each method in Table 2. As shown in Table 2 and Appendix A, we find that the majority of the top 5 trackers are using the deep CNN model. LZZ-ECO (A.11) employs the deep detector YOLOv3 [48] as the re-detection module and use the ECO [9] algorithm as the tracking module, which achieves the best results among all the 22 submitted trackers. VITALD (A.5) (rank 2), BTT (A.10) (rank 4) and DeCom (A.8) (rank 5) are all improved from the MDNet [46] algorithm, and VITALD (A.5) fine-tunes the state-of-the-art object detector RefineDet [68] on the VisDrone-SOT2018 training set to re-detect the target to mitigate the drifting problem in tracking. Only the STAPLE_SRCA algorithm (A.17) (rank 3) in top 5 is the variant of the correlation filter integrated with context information. SDRCO (A.14) (rank 6) is an improved version of the correlation filter based tracker CFWCR [24], which uses the ResNet50 [23] network to extract discriminative features. AST (A.2) (rank 7) calculates the saliency map via aggregation signature for target re-detection, which is effective to track small target. CFCNN (A.7) combines multiple BACF trackers [22]) with the CNN model (i.e., VGG16) by accumulating the weighted response of both trackers. This method ranks 8 among all the 22 submissions. Notably, most of the submitted trackers are improved from recently (after year 2015) leading computer vision conferences and journals.

Fig. 4.
figure 4

The success plots for the submitted trackers in different attributes, e.g., aspect ratio change, background clutter, camera motion, etc.). The number presented in the title indicates the number of sequences with that attribute.

4 Results and Analysis

According to the success scores, the best tracker is LZZ-ECO (A.11), followed by the VITALD method (A.5). STAPLE_SRCA (A.17) performs slightly worse with the gap of \(0.9\%\). In terms of precision scores, LZZ-ECO (A.11) also performs the best. The second and third best trackers based on the precision score are STAPLE_SRCA (A.17) and VITALD (A.5). It is worth pointing out that the top two trackers employ the combination of state-of-the-art object detectors (e.g., YOLOv3 [48] and RefineDet [68]) for target re-detection and an accurate object tracking algorithm (e.g., ECO [9] and VITAL [54]) for object tracking.

In addition, the baseline trackers (i.e., KCF (A.18), Staple (A.19), ECO (A.20), MDNet (A.21) and SRDCF (A.22)) submitted by the VisDrone committee, rank at the lower middle level of all the 22 submissions based on the success and precision scores. This phenomenon demonstrates that the submitted methods achieve significant improvements from the baseline algorithms.

Fig. 5.
figure 5

The precision plots for the submitted trackers in different attributes, e.g., aspect ratio change, background clutter, and camera motion. The number presented in the title indicates the number of sequences with that attribute.

4.1 Performance Analysis by Attributes

Similar to [43], we annotate each sequence with 12 attributes and construct subsets with different dominant attributes that facilitate the analysis of the performance of trackers under different challenging factors. We show the performance of each tracker of 12 attributes in Figs. 4 and 5. We present the descriptions of 12 attributes used in evaluation, and report the median success and precision scores under different attributes of all 22 submissions in Table 3. We find that the most challenging attributes in terms of success score are Similar Object (\(36.1\%\)), Background Clutter (\(41.2\%\)) and Out-of-View (\(41.5\%\)).

Table 2. Comparison of all submissions in the VisDrone-SOT2018 challenge. The success score, precision score, tracking speed (in FPS), implementation details (M indicates Matlab, P indicates Python, and G indicates GPU), pre-trained dataset (I indicates ImageNet, L indicates ILSVRC, P indicates PASCAL VOC, V indicates the VisDrone-SOT2018 training set, O indicates other additional datasets, and \({\times }\) indicates that the methods do not use the pre-trained datasets) and the references are reported. The \(*\) mark indicates the methods submitted by the VisDrone committee.
Table 3. Attributes used to characterize each sequence from the drone-based tracking perspective. The median success and precision scores under different attributes of all 22 submissions are reported to describe the tracking difficulties. The three most challenging attributes are presented in bold, italic and bolditalic fonts, respectively.

As shown in Figs. 4 and 5, LZZ-ECO (A.11) achieves the best performance in all 12 attribute subsets, and other trackers rank the second place in turn. Specifically, VITALD (A.5) achieves the second best success score in terms of the Aspect Ratio Change, Camera Motion, Fast Motion, Illumination Variation, Out-of-View and Scale Variation attributes. We speculate that the object detection module in VITALD is effective to re-detect the target to mitigate the drift problem to produce more accurate results. STAPLE_SRCA (A.17) performs the second best in Background Clutter, Full Occlusion, Low Resolution, Partial Occlusion and Similar Object attributes, which demonstrates the effectiveness of the proposed sparse response context-aware correlation filters. BTT (A.10) only performs worse than LZZ-ECO (A.11) in Viewpoint Change attribute, which benefits from the backtracking-term, short-term and long-term model updating mechanism based on the discriminative training samples.

We also report the comparison between the MDNet and ECO trackers in the subsets of different attributes in Fig. 6. The MDNet and ECO trackers are two popular methods in single-object tracking field. We believe the analysis is important to understand the progress of the tracking algorithms on the drone-based platform. As shown in Fig. 6, ECO achieves favorable performance against MDNet in the subsets of the fast motion (FM), illumination variation (IV), and low resolution (LR) attributes, while MDNet performs better than ECO in the other attribute subsets. In general, the deep CNN model based MDNet is able to produce more accurate results than ECO. However, the ECO tracker still has some advantages worth to learn. For the FM subset, it is difficult for MDNet to train a reliable model using such limited training data. To solve this issue, BTT (A.10) uses an extra backtracking-term updating strategy when the tracking score is not reliable. For the IV subset, ECO constructs a compact appearance representation of target to prevent overfitting, producing better performance than MDNet. For the LR subset, the appearance of small object is no longer informative after several convolutional layers, resulting in inferior performance of deep CNN based methods (e.g., MDNet and VITALD (A.5)). Improved from MDNet, DeCoM (A.8) introduces an auxiliary tracking algorithm based on color template matching when deep tracker fails. It seems that color cue is effective to distinguish small objects.

Fig. 6.
figure 6

Comparison of the MDNet and ECO algorithms with each attribute. The x-axis is the abbreviation of the 12 attributes, and the y-axis is the success scores of MDNet and ECO.

4.2 Discussion

Compared to previous single-object tracking datasets and benchmarks, such as OTB100 [66], VOT2016 [29], and UAV123 [43], the VisDrone-SOT2018 dataset involves very wide viewpoint, small objects and fast camera motion challenges, which puts forward the higher requirements of the single-object tracking algorithms. To make the tracker more effective in such scenarios, there are several directions worth to explore, described as follows.

  • Object detector based target re-identification. Since the target appearance is easily changed in drone view, it is quite difficult for traditional trackers to describe the appearance variations accurately for a long time. State-of-the-art object detectors, such as YOLOv3 [48], R-FCN [8] and RefineDet [68], are able to help the trackers recover from the drifting problem and generate more accurate results, especially for the targets with large deformation or in the fast moving camera. For example, LZZ-ECO (A.11) outperforms the ECO (A.20) tracker with a large margin, i.e., generates \(19\%\) higher success score and \(26.8\%\) higher precision score.

  • Searching region. Since the video sequences in the VisDrone-SOT2018 dataset often involves wide viewpoint, it is critical to expand the search region to ensure that the target is able to be detected by the tracker, even if the fast motion or occlusion happen. For example, BTT (A.10) improves \(8.1\%\) and \(7.3\%\) higher success and precision scores, compared to MDNet (A.21).

  • Spatio-temporal context. The majority of the CNN-based trackers only consider the appearance features in the video frames, and are hard to benefit from the consistent information included in consecutive frames. The spatio-temporal context information is useful to improve the robustness of the trackers, such as the optical flow [1], RNN [61] and 3DCNN [58] algorithms. In addition, the spatio-temporal regularized correlation filter (e.g., DCST (A.16)) is another effective algorithm to deal with the appearance variations by exploiting the spatio-temporal information.

  • Multi-modal features. It is important for the trackers to employ multiple types of features (e.g., deep features, texture features and color features) to improve the robustness in different scenarios in tracking. The comparison results between DeCoM (A.8) and MDNet (A.21) show that the integration of different features is very useful to improve the tracking accuracy. Moreover, adding the appropriate weights on the responses of correlation filters is effective in tracking task (see SDRCO (A.14)).

  • Long-term and short-term updating. During the tracking process, the foreground and background samples are usually exploited to update the appearance model to prevent the drifting problem when fast motion and occlusion happen. Long-term and short-term updates are always used to capture gradual and instantaneous variations of object appearance (see LZZ-ECO (A.11)). It is important to design an appropriate updating mechanism for both long-term and short-term updating for better performance.

5 Conclusions

In this paper, we give a brief review of the VisDrone-SOT2018 challenge. The challenge releases a dataset formed by 132 video sequences, i.e., 86 sequences with 69, 941 frames for training, 11 sequences with 7, 046 frames for validation, and 35 sequences with 29, 367 frames for testing. We provide fully annotated bounding boxes of targets as well as several useful attributes, e.g., occlusion, background clutter, and camera motion. A total of 22 trackers have been evaluated on the collected dataset. A large percentage of them are inspired from the state-of-the-art object algorithms. The top three trackers are LZZ-ECO (A.11), VITALD (A.5), and STAPLE_SRCA (A.17), achieving 68.0, 62.8, and 61.9 success scores, respectively.

We are glad to organize the VisDrone-SOT2018 challenge in conjunction with ECCV 2018 in Munich, Germany, successfully. A large amount of researchers participate the workshop to share their research progress. This workshop will not only serve as a meeting place for researchers in this area but also present major issues and potential opportunities. We believe the released dataset allows for the development and comparison of the algorithms in the single-object tracking field, and workshop challenge provide a way to track the process. Our future work will be focused on improving the dataset and the evaluation kit based on the feedbacks from the community.