VisDrone-SOT2018: The Vision Meets Drone Single-Object Tracking Challenge Results

Wen, Longyin; Zhu, Pengfei; Du, Dawei; Bian, Xiao; Ling, Haibin; Hu, Qinghua; Liu, Chenfeng; Cheng, Hao; Liu, Xiaoyu; Ma, Wenya; Nie, Qinqin; Wu, Haotian; Wang, Lianjie; Perera, Asanka G.; Zhang, Baochang; Heo, Byeongho; Liu, Chunlei; Li, Dongdong; Michail, Emmanouil; Chen, Hanlin; Liu, Hao; Li, Haojie; Kompatsiaris, Ioannis; Cheng, Jian; Fan, Jiaqing; Zhang, Jie; Choi, Jin Young; Li, Jing; Yang, Jinyu; Choi, Jongwon; Zhao, Juanping; Han, Jungong; Zhang, Kaihua; Duan, Kaiwen; Song, Ke; Avgerinakis, Konstantinos; Lee, Kyuewang; Ding, Lu; Lauer, Martin; Giannakeris, Panagiotis; Zhang, Peizhen; Wang, Qiang; Xu, Qianqian; Huang, Qingming; Liu, Qingshan; Laganière, Robert; Zhang, Ruixin; Yun, Sangdoo; Zhu, Shengyin; Wu, Sihang; Vrochidis, Stefanos; Tian, Wei; Zhang, Wei; Chen, Weidong; Hu, Weiming; Wang, Wenhao; Zhang, Wenhua; Ding, Wenrui; He, Xiaohao; Li, Xiaotong; Zhang, Xin; Luo, Xinbin; Hu, Xixi; Meng, Yang; Kuai, Yangliu; Zhao, Yanyun; Li, Yaxuan; Yang, Yifan; Zhang, Yifan; Wang, Yong; Qi, Yuankai; Deng, Zhipeng; He, Zhiqun

doi:10.1007/978-3-030-11021-5_28

VisDrone-SOT2018: The Vision Meets Drone Single-Object Tracking Challenge Results

Longyin Wen¹⁴,
Pengfei Zhu¹⁵,
Dawei Du¹⁶,
Xiao Bian¹⁷,
Haibin Ling¹⁸,
Qinghua Hu¹⁵,
Chenfeng Liu¹⁵,
Hao Cheng¹⁵,
Xiaoyu Liu¹⁵,
Wenya Ma¹⁵,
Qinqin Nie¹⁵,
Haotian Wu¹⁵,
Lianjie Wang¹⁵,
Asanka G. Perera³⁶,
Baochang Zhang²¹,
Byeongho Heo³⁰,
Chunlei Liu²¹,
Dongdong Li³⁵,
Emmanouil Michail²⁸,
Hanlin Chen²³,
Hao Liu³⁵,
Haojie Li²⁵,
Ioannis Kompatsiaris²⁸,
Jian Cheng^41,42,
Jiaqing Fan⁴²,
Jie Zhang³⁴,
Jin Young Choi³⁰,
Jing Li⁴⁰,
Jinyu Yang²¹,
Jongwon Choi^30,32,
Juanping Zhao¹⁹,
Jungong Han²²,
Kaihua Zhang⁴³,
Kaiwen Duan²⁷,
Ke Song³³,
Konstantinos Avgerinakis²⁸,
Kyuewang Lee³⁰,
Lu Ding¹⁹,
Martin Lauer²⁹,
Panagiotis Giannakeris²⁸,
Peizhen Zhang³⁸,
Qiang Wang⁴¹,
Qianqian Xu⁴⁴,
Qingming Huang^26,27,
Qingshan Liu⁴³,
Robert Laganière²⁰,
Ruixin Zhang³⁷,
Sangdoo Yun³¹,
Shengyin Zhu²⁴,
Sihang Wu²⁵,
Stefanos Vrochidis²⁸,
Wei Tian²⁹,
Wei Zhang³³,
Weidong Chen²⁷,
Weiming Hu⁴¹,
Wenhao Wang³³,
Wenhua Zhang³⁴,
Wenrui Ding²¹,
Xiaohao He³⁹,
Xiaotong Li³⁴,
Xin Zhang³⁴,
Xinbin Luo¹⁹,
Xixi Hu³³,
Yang Meng³⁴,
Yangliu Kuai³⁵,
Yanyun Zhao²⁴,
Yaxuan Li³³,
Yifan Yang²⁷,
Yifan Zhang^41,42,
Yong Wang²⁰,
Yuankai Qi²⁶,
Zhipeng Deng³⁵ &
…
Zhiqun He²⁴

Conference paper
First Online: 23 January 2019

3267 Accesses
17 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11133))

Abstract

Single-object tracking, also known as visual tracking, on the drone platform attracts much attention recently with various applications in computer vision, such as filming and surveillance. However, the lack of commonly accepted annotated datasets and standard evaluation platform prevent the developments of algorithms. To address this issue, the Vision Meets Drone Single-Object Tracking (VisDrone-SOT2018) Challenge workshop was organized in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) to track and advance the technologies in such field. Specifically, we collect a dataset, including 132 video sequences divided into three non-overlapping sets, i.e., training (86 sequences with 69, 941 frames), validation (11 sequences with 7, 046 frames), and testing (35 sequences with 29, 367 frames) sets. We provide fully annotated bounding boxes of the targets as well as several useful attributes, e.g., occlusion, background clutter, and camera motion. The tracking targets in these sequences include pedestrians, cars, buses, and animals. The dataset is extremely challenging due to various factors, such as occlusion, large scale, pose variation, and fast motion. We present the evaluation protocol of the VisDrone-SOT2018 challenge and the results of a comparison of 22 trackers on the benchmark dataset, which are publicly available on the challenge website: http://www.aiskyeye.com/. We hope this challenge largely boosts the research and development in single object tracking on drone platforms.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Drones, or general UAVs, equipped with cameras have been fast deployed to a wide range of applications, including agricultural, aerial photography, fast delivery, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, which makes computer vision and drones more and more closely. Despite the great progresses in general computer vision algorithms, such as tracking and detection, these algorithms are not usually optimal for dealing with sequences or images generated by drones, due to various challenges such as view point change and scales.

Developing and evaluating new vision algorithms for drone generated visual data is a key problem in drone-based applications. However, as pointed out in recent studies (e.g., [26, 43]), the lack of public large-scale benchmarks or datasets is the bottleneck to achieve this goal. Some recent preliminary efforts [26, 43, 49] have been devoted to construct datasets with drone platforms focusing on single-object tracking. These datasets are still limited in size and scenarios covered, due to the difficulties in data collection and annotation. Thus, a more general and comprehensive benchmark is desired for further boost research on computer vision problems with drones.

To advance the developments in single-object tracking, we organize the Vision Meets Drone Single-Object Tracking (VisDrone-SOT2018) challenge, which is one track of the “Vision Meets Drone: A Challenge”^{Footnote 1} on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. In particular, we collected a single-object tracking dataset with various drone models, e.g., DJI Mavic, and Phantom series 3, 3A, in different scenarios with various weather and lighting conditions. All video sequences are labelled per-frame with different visual attributes to aid a less biased analysis of the tracking results. The objects to be tracked are of various types including pedestrians, cars, buses, and sheep. We invite the authors to submit the tracking results in the VisDrone-SOT2018 dataset. The authors of submitted algorithms in the challenge have an opportunity to share their ideas in the workshop and further publish the source code at our website: http://www.aiskyeye.com/, which are helpful to push the development of the single-object tracking field.

2 Related Work

Single-object tracking or visual tracking, is one of the fundamental problems in computer vision, which aims to estimate the trajectory of a target in a video sequence, given its initial state. In this section, we briefly review the related datasets and recent tracking algorithms.

Existing Datasets. In recent years, numerous datasets have been developed for single object tracking. Wu et al. [65] create a standard benchmark to evaluate the single-object tracking algorithms, which includes 50 video sequences. After that, they further extend the dataset with 100 video sequences. Concurrently, Liang et al. [36] collect 128 video sequences for evaluating the color enhanced trackers. To track the progress in single-object tracking field, Kristan et al. [29,30,31, 56] organize the VOT competition from 2013 to 2018, where the new datasets and evaluation strategies are proposed for tracking evaluation. The series of competitions promote the developments of visual tracking. Smeulders et al. [52] present the ALOV300 dataset, containing 314 video sequences with 14 visual attributes, such as long duration, zooming camera, moving camera and transparency. Li et al. [32] construct a large-scale dataset with 365 video sequences, covering 12 different kinds of objects captured from moving cameras. Du et al. [15] design a dataset with 50 fully annotated video sequences, focusing on deformable object tracking in unconstrained environments. To evaluate tracking algorithms in high frame rate videos (e.g., 240 frame per second), Galoogahi et al. [21] propose a dataset containing 100 video clips (380, 000 frames in total), recorded in real world scenarios. Besides using video sequences captured by RGB cameras, Felsberg et al. [20, 30, 57] organize a series of competitions from 2015 to 2017, focusing on single-object tracking on thermal video sequences recorded by 8 different types of sensors. In [53], a RGB-D tracking dataset is presented, which includes 100 RGB-D video clips with manually annotated ground truth bounding boxes. UAV123 [43] is a large UAV dataset including 123 fully annotated high-resolution video sequences captured from the low-altitude aerial view points. Similarly, UAVDT [16] describes a new UAV benchmark focusing on several different complex scenarios. Müller et al. [45] present a large-scale benchmark for object tracking in the wild, which includes more than 30, 000 videos with more than 14 million dense bounding box annotations. Recently, Fan et al. [18] propose a large tracking benchmark with 1, 400 videos, with each frame manually annotated. Most of the above datasets cover a large set of object categories, but do not focus on drone based scenarios as our dataset.

Review of Recent Single-Object Tracking Methods. Single-object tracking is a hot topic with various applications (e.g., video surveillance, behavior analysis and human-computer interaction). It attracts much research such as graph model [4, 15, 35, 64], subspace learning [28, 50, 62, 63] and sparse coding [39, 42, 47, 69]. Recently, the correlation filter algorithm becomes popular in visual tracking field due to its high efficiency. Henriques et al. [25] derive a kernelized correlation filter and propose a fast multi-channel extension of linear correlation filters using a linear kernel. Danelljan et al. [10] propose to learn discriminative correlation filters based on the scale pyramid representation to improve the tracking performance. To model the distribution of feature attention, Choi et al. [7] develop an attentional feature-based correlation filter evolved with multiple trained elementary trackers. The Staple method [2] achieves a large gain in performance over previous methods by combining color statistics and correlations. Danelljan et al. [11] demonstrate that learning the correlation filter coefficients with spatial regularization is effective for tracking task. Li et al. [34] integrate the temporal regularization into the SRDCF framework [11] with single sample, and propose the spatial-temporal regularized correlation filters to provide a more robust appearance model in the case of large appearance variations. Du et al. [17] design a correlation filter based method that integrates the target part selection, part matching, and state estimation into a unified energy minimization framework.

On the other hand, the deep learning based methods achieve a dominant position in the single-object tracking field with the impressive performance. Some methods directly use the deep Convolutional Neural Networks (CNNs) to extract the features to replace the hand-crafted features in the correlation filter framework, such as CF2 [41], C-COT [13], ECO [9], CFNet [59], and PTAV [19]. In [60], different types of features are combined to construct multiple experts through discriminative correlation filter algorithm, and each of them tracks the target independently. With the proposed robustness evaluation strategy, the most confident expert is selected to produce the tracking results in each frame. Besides, another way is to construct an end-to-end deep model to complete target appearance learning and tracking [3, 14, 33, 46, 54, 55, 67]. In SiamFC [3] and SINT [55], the researchers employ siamese deep neural network to learn the matching function between the initial patch of the target in the first frame and the candidate in the subsequent frames. Li et al. [33] propose the siamese region proposal network, which consists of a siamese sub-network for feature extraction and a region proposal sub-network for classification and regression. MDNet [46] uses a pre-trained CNN model on a large set of video sequences with manually annotated ground-truths to obtain a generic target representation, and then evaluates the candidate windows randomly sampled around the previous target state to find the optimal location for tracking. After that, Song et al. [54] present the VITAL algorithm to generate more discriminative training samples via adversarial learning. Yun et al. [67] design a tracker controlled by sequentially pursuing actions learned by deep reinforcement learning. Dong et al. [14] propose a hyperparameter optimization method that is able to find the optimal hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning.

3 The VisDrone-SOT2018 Challenge

As described above, to track and promote the developments in single-object tracking field, we organized the Vision Meets Drone Single-Object Tracking (or VisDrone-SOT2018, for short) challenge, which is one track of the workshop challenge “Vision Meets Drone: A Challenge” on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. The VisDrone-SOT2018 challenge focuses on single-object tracking on the drone platform. Specifically, given an initial bounding box enclosing the target in the first frame, the submitted algorithm is required to estimate the region of target in the subsequent video frames. We released a single-object tracking dataset, i.e., the VisDrone-SOT2018 dataset, which consists of 132 video sequences formed by 106, 354 frames, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse to crowded scenes). We invited researchers to participate the challenge and to evaluate and discuss their research on the VisDrone-SOT2018 dataset at the workshop. We believe the workshop challenge will be helpful to the research in the video object tracking community (Table 1).

Table 1. Comparison of Current State-of-the-Art Benchmarks and Datasets. Note that the resolution indicates the maximum resolution of the video frames included in the dataset. Notably, we have \(1k=1,000\).

Full size table

3.1 Dataset

The released VisDrone-SOT2018 dataset in this workshop includes 132 video clips with 106, 354 frames, which is divided into three non-overlapping subsets, i.e., training set (86 sequences with 69, 941 frames), validation set (11 sequences with 7, 046 frames), and testing set (35 sequences with 29, 367 frames). The video clips in these three subsets are taken at different locations, but share similar environments and attributes. The dataset is collected in various real-world scenarios by various drone platforms (i.e., different drone models) under various weather and lighting conditions, which is helpful for the researchers to improve the algorithm performance in real-world scenarios. We manually annotated the bounding boxes of targets (e.g., pedestrians, dogs, and vehicles) as well as several useful attributes (e.g., occlusion, background clutter, and camera motion) for algorithm analysis. We present the number of frames vs. the aspect ratio (i.e., object height divided by width) change rate with respect to the first frame in Fig. 2(a), and show the number of frames vs. the area change rate with respect to the first frame in Fig. 2(b). We plot the distributions of the number of frames of video clips in the training, validation, and testing sets in Fig. 2(c). In addition, some annotated examples in the VisDrone-SOT2018 dataset are presented in Fig. 1.

3.2 Evaluation Protocol

Following the evaluation methodology in [66], we use the success and precision scores to evaluate the performance of the trackers. The success score is defined as the area under the success plot. That is, with each bounding box overlap threshold \(t_o\) in the interval [0, 1], we compute the percentage of successfully tracked frames to generate the successfully tracked frames vs. bounding box overlap threshold plot. The overlap between the tracker prediction \(B_t\) and the ground truth bounding box \(B_g\) is defined as \({ O}=\frac{|B_t\bigcap {B}_g|}{|B_t\bigcup {B}_g|}\), where \(\bigcap \) and \(\bigcup \) represent the intersection and union between the two regions, respectively, and \(|\cdot |\) calculates the number of pixels in the region. Meanwhile, the precision score is defined as the percentage of frames whose estimated location is within the given threshold distance of the ground truth based on the Euclidean distance in the image plane. Here, we set the distance threshold to 20 pixels in evaluation. Notably, the success score is used as the primary metric for ranking methods.

3.3 Trackers Submitted

We have received 17 entries from 26 different institutes in the VisDrone-SOT2018 challenge. The VisDrone committee additionally evaluates 5 baseline trackers with the default parameters on the VisDrone-SOT2018 dataset. If the default parameters are not available, some reasonable values are used for evaluation. Thus, there are in total 22 algorithms are included in the single-object tracking task of VisDrone2018 challenge. In the following we briefly overview the submitted algorithms and provide their descriptions in the Appendix A.

Among in the submitted algorithms, 4 trackers are improved based on the correlation filter algorithm, including CFWCRKF (A.3), CKCF (A.6), DCST (A.16) and STAPLE_SRCA (A.17). Four trackers, i.e., C3DT (A.4), VITALD (A.5), DeCom (A.8) and BTT (A.10), are developed based on the MDNet [46] algorithm, which is the winner of the VOT2015 challenge [31]. Seven trackers combine the CNN models and correlation filter algorithm, namely OST (A.1), CFCNN (A.7), TRACA+ (A.9), LZZ-ECO (A.11), SECFNet (A.12), SDRCO (A.14) and DCFNet (A.15), where OST (A.1), CFCNN (A.7) and LZZ-ECO (A.11) apply object detectors to conduct target re-detection. One tracker (i.e., AST (A.2)) is based on saliency map, and another tracker (i.e., IMT3 (A.13)) is based on the normalized cross correlation filter.

3.4 Overall Performance

The overall success and precision plots of all submissions are shown in Fig. 3. Meanwhile, we also report the success and precision scores, tracking speed, implementation details, pre-trained dataset, and the references of each method in Table 2. As shown in Table 2 and Appendix A, we find that the majority of the top 5 trackers are using the deep CNN model. LZZ-ECO (A.11) employs the deep detector YOLOv3 [48] as the re-detection module and use the ECO [9] algorithm as the tracking module, which achieves the best results among all the 22 submitted trackers. VITALD (A.5) (rank 2), BTT (A.10) (rank 4) and DeCom (A.8) (rank 5) are all improved from the MDNet [46] algorithm, and VITALD (A.5) fine-tunes the state-of-the-art object detector RefineDet [68] on the VisDrone-SOT2018 training set to re-detect the target to mitigate the drifting problem in tracking. Only the STAPLE_SRCA algorithm (A.17) (rank 3) in top 5 is the variant of the correlation filter integrated with context information. SDRCO (A.14) (rank 6) is an improved version of the correlation filter based tracker CFWCR [24], which uses the ResNet50 [23] network to extract discriminative features. AST (A.2) (rank 7) calculates the saliency map via aggregation signature for target re-detection, which is effective to track small target. CFCNN (A.7) combines multiple BACF trackers [22]) with the CNN model (i.e., VGG16) by accumulating the weighted response of both trackers. This method ranks 8 among all the 22 submissions. Notably, most of the submitted trackers are improved from recently (after year 2015) leading computer vision conferences and journals.

4 Results and Analysis

According to the success scores, the best tracker is LZZ-ECO (A.11), followed by the VITALD method (A.5). STAPLE_SRCA (A.17) performs slightly worse with the gap of \(0.9\%\). In terms of precision scores, LZZ-ECO (A.11) also performs the best. The second and third best trackers based on the precision score are STAPLE_SRCA (A.17) and VITALD (A.5). It is worth pointing out that the top two trackers employ the combination of state-of-the-art object detectors (e.g., YOLOv3 [48] and RefineDet [68]) for target re-detection and an accurate object tracking algorithm (e.g., ECO [9] and VITAL [54]) for object tracking.

In addition, the baseline trackers (i.e., KCF (A.18), Staple (A.19), ECO (A.20), MDNet (A.21) and SRDCF (A.22)) submitted by the VisDrone committee, rank at the lower middle level of all the 22 submissions based on the success and precision scores. This phenomenon demonstrates that the submitted methods achieve significant improvements from the baseline algorithms.

4.1 Performance Analysis by Attributes

Similar to [43], we annotate each sequence with 12 attributes and construct subsets with different dominant attributes that facilitate the analysis of the performance of trackers under different challenging factors. We show the performance of each tracker of 12 attributes in Figs. 4 and 5. We present the descriptions of 12 attributes used in evaluation, and report the median success and precision scores under different attributes of all 22 submissions in Table 3. We find that the most challenging attributes in terms of success score are Similar Object (\(36.1\%\)), Background Clutter (\(41.2\%\)) and Out-of-View (\(41.5\%\)).

Table 2. Comparison of all submissions in the VisDrone-SOT2018 challenge. The success score, precision score, tracking speed (in FPS), implementation details (M indicates Matlab, P indicates Python, and G indicates GPU), pre-trained dataset (I indicates ImageNet, L indicates ILSVRC, P indicates PASCAL VOC, V indicates the VisDrone-SOT2018 training set, O indicates other additional datasets, and \({\times }\) indicates that the methods do not use the pre-trained datasets) and the references are reported. The \(*\) mark indicates the methods submitted by the VisDrone committee.

Full size table

Table 3. Attributes used to characterize each sequence from the drone-based tracking perspective. The median success and precision scores under different attributes of all 22 submissions are reported to describe the tracking difficulties. The three most challenging attributes are presented in bold, italic and bolditalic fonts, respectively.

Full size table

As shown in Figs. 4 and 5, LZZ-ECO (A.11) achieves the best performance in all 12 attribute subsets, and other trackers rank the second place in turn. Specifically, VITALD (A.5) achieves the second best success score in terms of the Aspect Ratio Change, Camera Motion, Fast Motion, Illumination Variation, Out-of-View and Scale Variation attributes. We speculate that the object detection module in VITALD is effective to re-detect the target to mitigate the drift problem to produce more accurate results. STAPLE_SRCA (A.17) performs the second best in Background Clutter, Full Occlusion, Low Resolution, Partial Occlusion and Similar Object attributes, which demonstrates the effectiveness of the proposed sparse response context-aware correlation filters. BTT (A.10) only performs worse than LZZ-ECO (A.11) in Viewpoint Change attribute, which benefits from the backtracking-term, short-term and long-term model updating mechanism based on the discriminative training samples.

We also report the comparison between the MDNet and ECO trackers in the subsets of different attributes in Fig. 6. The MDNet and ECO trackers are two popular methods in single-object tracking field. We believe the analysis is important to understand the progress of the tracking algorithms on the drone-based platform. As shown in Fig. 6, ECO achieves favorable performance against MDNet in the subsets of the fast motion (FM), illumination variation (IV), and low resolution (LR) attributes, while MDNet performs better than ECO in the other attribute subsets. In general, the deep CNN model based MDNet is able to produce more accurate results than ECO. However, the ECO tracker still has some advantages worth to learn. For the FM subset, it is difficult for MDNet to train a reliable model using such limited training data. To solve this issue, BTT (A.10) uses an extra backtracking-term updating strategy when the tracking score is not reliable. For the IV subset, ECO constructs a compact appearance representation of target to prevent overfitting, producing better performance than MDNet. For the LR subset, the appearance of small object is no longer informative after several convolutional layers, resulting in inferior performance of deep CNN based methods (e.g., MDNet and VITALD (A.5)). Improved from MDNet, DeCoM (A.8) introduces an auxiliary tracking algorithm based on color template matching when deep tracker fails. It seems that color cue is effective to distinguish small objects.

4.2 Discussion

Compared to previous single-object tracking datasets and benchmarks, such as OTB100 [66], VOT2016 [29], and UAV123 [43], the VisDrone-SOT2018 dataset involves very wide viewpoint, small objects and fast camera motion challenges, which puts forward the higher requirements of the single-object tracking algorithms. To make the tracker more effective in such scenarios, there are several directions worth to explore, described as follows.

Object detector based target re-identification. Since the target appearance is easily changed in drone view, it is quite difficult for traditional trackers to describe the appearance variations accurately for a long time. State-of-the-art object detectors, such as YOLOv3 [48], R-FCN [8] and RefineDet [68], are able to help the trackers recover from the drifting problem and generate more accurate results, especially for the targets with large deformation or in the fast moving camera. For example, LZZ-ECO (A.11) outperforms the ECO (A.20) tracker with a large margin, i.e., generates \(19\%\) higher success score and \(26.8\%\) higher precision score.
Searching region. Since the video sequences in the VisDrone-SOT2018 dataset often involves wide viewpoint, it is critical to expand the search region to ensure that the target is able to be detected by the tracker, even if the fast motion or occlusion happen. For example, BTT (A.10) improves \(8.1\%\) and \(7.3\%\) higher success and precision scores, compared to MDNet (A.21).
Spatio-temporal context. The majority of the CNN-based trackers only consider the appearance features in the video frames, and are hard to benefit from the consistent information included in consecutive frames. The spatio-temporal context information is useful to improve the robustness of the trackers, such as the optical flow [1], RNN [61] and 3DCNN [58] algorithms. In addition, the spatio-temporal regularized correlation filter (e.g., DCST (A.16)) is another effective algorithm to deal with the appearance variations by exploiting the spatio-temporal information.
Multi-modal features. It is important for the trackers to employ multiple types of features (e.g., deep features, texture features and color features) to improve the robustness in different scenarios in tracking. The comparison results between DeCoM (A.8) and MDNet (A.21) show that the integration of different features is very useful to improve the tracking accuracy. Moreover, adding the appropriate weights on the responses of correlation filters is effective in tracking task (see SDRCO (A.14)).
Long-term and short-term updating. During the tracking process, the foreground and background samples are usually exploited to update the appearance model to prevent the drifting problem when fast motion and occlusion happen. Long-term and short-term updates are always used to capture gradual and instantaneous variations of object appearance (see LZZ-ECO (A.11)). It is important to design an appropriate updating mechanism for both long-term and short-term updating for better performance.

5 Conclusions

In this paper, we give a brief review of the VisDrone-SOT2018 challenge. The challenge releases a dataset formed by 132 video sequences, i.e., 86 sequences with 69, 941 frames for training, 11 sequences with 7, 046 frames for validation, and 35 sequences with 29, 367 frames for testing. We provide fully annotated bounding boxes of targets as well as several useful attributes, e.g., occlusion, background clutter, and camera motion. A total of 22 trackers have been evaluated on the collected dataset. A large percentage of them are inspired from the state-of-the-art object algorithms. The top three trackers are LZZ-ECO (A.11), VITALD (A.5), and STAPLE_SRCA (A.17), achieving 68.0, 62.8, and 61.9 success scores, respectively.

We are glad to organize the VisDrone-SOT2018 challenge in conjunction with ECCV 2018 in Munich, Germany, successfully. A large amount of researchers participate the workshop to share their research progress. This workshop will not only serve as a meeting place for researchers in this area but also present major issues and potential opportunities. We believe the released dataset allows for the development and comparison of the algorithms in the single-object tracking field, and workshop challenge provide a way to track the process. Our future work will be focused on improving the dataset and the evaluation kit based on the feedbacks from the community.

Notes

1.
http://www.aiskyeye.com/.

References

Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM Comput. Surv. 27(3), 433–467 (1995)
Article Google Scholar
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.S.: Staple: complementary learners for real-time tracking. In: CVPR, pp. 1401–1409 (2016)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Cai, Z., Wen, L., Lei, Z., Vasconcelos, N., Li, S.Z.: Robust deformable and occluded object tracking with dynamic graph. TIP 23(12), 5497–5509 (2014)
MathSciNet MATH Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)
Google Scholar
Choi, J., et al.: Context-aware deep feature compression for high-speed visual tracking. In: CVPR (2018)
Google Scholar
Choi, J., Chang, H.J., Jeong, J., Demiris, Y., Choi, J.Y.: Visual tracking using attention-modulated disintegration and integration. In: CVPR, pp. 4321–4330 (2016)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: CVPR, pp. 6931–6939 (2017)
Google Scholar
Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: BMVC (2014)
Google Scholar
Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp. 4310–4318 (2015)
Google Scholar
Danelljan, M., Khan, F.S., Felsberg, M., van de Weijer, J.: Adaptive color attributes for real-time visual tracking. In: CVPR, pp. 1090–1097 (2014)
Google Scholar
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 472–488. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_29
Chapter Google Scholar
Dong, X., Shen, J., Wang, W., Liu, Y., Shao, L., Porikli, F.: Hyperparameter optimization for tracking with continuous deep q-learning. In: CVPR, pp. 518–527 (2018)
Google Scholar
Du, D., Qi, H., Li, W., Wen, L., Huang, Q., Lyu, S.: Online deformable object tracking based on structure-aware hyper-graph. TIP 25(8), 3572–3584 (2016)
MathSciNet Google Scholar
Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 375–391. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_23
Chapter Google Scholar
Du, D., Wen, L., Qi, H., Huang, Q., Tian, Q., Lyu, S.: Iterative graph seeking for object tracking. TIP 27(4), 1809–1821 (2018)
MathSciNet Google Scholar
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. arXiv (2018)
Google Scholar
Fan, H., Ling, H.: Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In: ICCV, pp. 5487–5495 (2017)
Google Scholar
Felsberg, M., et al.: The thermal infrared visual object tracking VOT-TIR2015 challenge results. In: ICCVWorkshops, pp. 639–651 (2015)
Google Scholar
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. In: ICCV, pp. 1134–1143 (2017)
Google Scholar
Galoogahi, H.K., Fagg, A., Lucey, S.: Learning background-aware correlation filters for visual tracking. In: ICCV, pp. 1144–1152 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, Z., Fan, Y., Zhuang, J., Dong, Y., Bai, H.: Correlation filters with weighted convolution responses. In: ICCVWorkshops, pp. 1992–2000 (2017)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. TPAMI 37(3), 583–596 (2015)
Article Google Scholar
Hsieh, M., Lin, Y., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: ICCV (2017)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. CoRR abs/1709.01507 (2017). http://arxiv.org/abs/1709.01507
Hu, W., Li, X., Zhang, X., Shi, X., Maybank, S.J., Zhang, Z.: Incremental tensor subspace learning and its applications to foreground segmentation and tracking. IJCV 91(3), 303–327 (2011)
Article Google Scholar
Kristan, M., et al.: The visual object tracking VOT2016 challenge results. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 777–823. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_54
Chapter Google Scholar
Kristan, M., et al.: The visual object tracking VOT2017 challenge results. In: ICCVWorkshops, pp. 1949–1972 (2017)
Google Scholar
Kristan, M., et al.: The visual object tracking VOT2015 challenge results. In: ICCVWorkshops, pp. 564–586 (2015)
Google Scholar
Li, A., Li, M., Wu, Y., Yang, M.H., Yan, S.: NUS-PRO: a new visual tracking challenge. IEEE Trans. Pattern Anal. Mach. Intell., 1–15 (2015)
Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: CVPR, pp. 8971–8980 (2018)
Google Scholar
Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.: Learning spatial-temporal regularized correlation filters for visual tracking. In: CVPR (2018)
Google Scholar
Li, S., Du, D., Wen, L., Chang, M., Lyu, S.: Hybrid structure hypergraph for online deformable object tracking. In: ICPR, pp. 1127–1131 (2017)
Google Scholar
Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. TIP 24(12), 5630–5644 (2015)
MathSciNet Google Scholar
Liang, P., Wu, Y., Lu, H., Wang, L., Liao, C., Ling, H.: Planar object tracking in the wild: a benchmark. In: ICRA (2018)
Google Scholar
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)
Google Scholar
Liu, B., Huang, J., Kulikowski, C.A., Yang, L.: Robust visual tracking using local sparse appearance model and k-selection. TPAMI 35(12), 2968–2981 (2013)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
Article MathSciNet Google Scholar
Ma, C., Huang, J., Yang, X., Yang, M.: Hierarchical convolutional features for visual tracking. In: ICCV, pp. 3074–3082 (2015)
Google Scholar
Mei, X., Ling, H.: Robust visual tracking using \(\ell 1\) minimization. In: ICCV, pp. 1436–1443 (2009)
Google Scholar
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Chapter Google Scholar
Mueller, M., Smith, N., Ghanem, B.: Context-aware correlation filter tracking. In: CVPR, pp. 1387–1395 (2017)
Google Scholar
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19
Chapter Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)
Google Scholar
Qi, Y., Qin, L., Zhang, J., Zhang, S., Huang, Q., Yang, M.: Structure-aware local sparse coding for visual tracking. TIP 27(8), 3857–3869 (2018)
MathSciNet Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. CoRR abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767
Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: human trajectory understanding in crowded scenes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_33
Chapter Google Scholar
Ross, D.A., Lim, J., Lin, R., Yang, M.: Incremental learning for robust visual tracking. IJCV 77(1–3), 125–141 (2008)
Article Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI 36(7), 1442–1468 (2014)
Article Google Scholar
Song, S., Xiao, J.: Tracking revisited using RGBD camera: unified benchmark and baselines. In: ICCV, pp. 233–240 (2013)
Google Scholar
Song, Y., et al.: VITAL: visual tracking via adversarial learning. In: CVPR (2018)
Google Scholar
Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: CVPR, pp. 1420–1429 (2016)
Google Scholar
Kristan, M., et al.: The visual object tracking VOT2014 challenge results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 191–217. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_14
Chapter Google Scholar
Felsberg, M., et al.: The thermal infrared visual object tracking VOT-TIR2016 challenge results. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 824–849. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_55
Chapter Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: CVPR, pp. 4489–4497 (2015)
Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: CVPR, pp. 5000–5008 (2017)
Google Scholar
Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., Li, H.: Multi-cue correlation filters for robust visual tracking. In: CVPR, pp. 4844–4853 (2018)
Google Scholar
Wang, Q., Gao, J., Xing, J., Zhang, M., Hu, W.: DCFNet: discriminant correlation filters network for visual tracking. CoRR abs/1704.04057 (2017). http://arxiv.org/abs/1704.04057
Wen, L., Cai, Z., Lei, Z., Yi, D., Li, S.Z.: Online spatio-temporal structural context learning for visual tracking. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 716–729. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_51
Chapter Google Scholar
Wen, L., Cai, Z., Lei, Z., Yi, D., Li, S.Z.: Robust online learned spatio-temporal context model for visual tracking. TIP 23(2), 785–796 (2014)
MathSciNet MATH Google Scholar
Wu, T., Lu, Y., Zhu, S.: Online object tracking, learning and parsing with and-or graphs. TPAMI 39(12), 2465–2480 (2017)
Article Google Scholar
Wu, Y., Lim, J., Yang, M.: Online object tracking: a benchmark. In: CVPR, pp. 2411–2418 (2013)
Google Scholar
Wu, Y., Lim, J., Yang, M.: Object tracking benchmark. TPAMI 37(9), 1834–1848 (2015)
Article Google Scholar
Yun, S., Choi, J., Yoo, Y., Yun, K., Choi, J.Y.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR (2017)
Google Scholar
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural network for object detection. In: CVPR (2018)
Google Scholar
Zhong, W., Lu, H., Yang, M.: Robust object tracking via sparse collaborative appearance model. TIP 23(5), 2356–2368 (2014)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61502332 and Grant 61732011, in part by Natural Science Foundation of Tianjin under Grant 17JCZDJC30800, in part by US National Science Foundation under Grant IIS-1407156 and Grant IIS-1350521, in part by Beijing Seetatech Technology Co., Ltd and GE Global Research.

Author information

Authors and Affiliations

JD Finance, Mountain View, CA, USA
Longyin Wen
Tianjin University, Tianjin, China
Pengfei Zhu, Qinghua Hu, Chenfeng Liu, Hao Cheng, Xiaoyu Liu, Wenya Ma, Qinqin Nie, Haotian Wu & Lianjie Wang
University at Albany, SUNY, Albany, NY, USA
Dawei Du
GE Global Research, Niskayuna, NY, USA
Xiao Bian
Temple University, Philadelphia, PA, USA
Haibin Ling
Shanghai Jiao Tong University, Shanghai, China
Juanping Zhao, Lu Ding & Xinbin Luo
University of Ottawa, Ottawa, Canada
Robert Laganière & Yong Wang
Beihang University, Beijing, China
Baochang Zhang, Chunlei Liu, Jinyu Yang & Wenrui Ding
Lancaster University, Lancaster, UK
Jungong Han
Shenyang Aerospace University, Shenyang, China
Hanlin Chen
Beijing University of Posts and Telecommunications, Beijing, China
Shengyin Zhu, Yanyun Zhao & Zhiqun He
South China University of Technology, Guangzhou, China
Haojie Li & Sihang Wu
Harbin Institute of Technology, Harbin, China
Qingming Huang & Yuankai Qi
University of Chinese Academy of Sciences, Beijing, China
Kaiwen Duan, Qingming Huang, Weidong Chen & Yifan Yang
Centre for Research & Technology Hellas, Thessaloniki, Greece
Emmanouil Michail, Ioannis Kompatsiaris, Konstantinos Avgerinakis, Panagiotis Giannakeris & Stefanos Vrochidis
Karlsruhe Institute of Technology, Karlsruhe, Germany
Martin Lauer & Wei Tian
Seoul National University, Seoul, South Korea
Byeongho Heo, Jin Young Choi, Jongwon Choi & Kyuewang Lee
NAVER Corp, Seongnam, South Korea
Sangdoo Yun
Samsung R&D Campus, Seoul, South Korea
Jongwon Choi
Shandong University, Jinan, China
Ke Song, Wei Zhang, Wenhao Wang, Xixi Hu & Yaxuan Li
Xidian University, Xi’an, China
Jie Zhang, Wenhua Zhang, Xiaotong Li, Xin Zhang & Yang Meng
National University of Defense Technology, Changsha, China
Dongdong Li, Hao Liu, Yangliu Kuai & Zhipeng Deng
University of South Australia, Adelaide, Australia
Asanka G. Perera
Tencent, Shanghai, China
Ruixin Zhang
Sun yat-sen university, Guangzhou, China
Peizhen Zhang
Tsinghua University, Beijing, China
Xiaohao He
Civil Aviation University of China, Tianjin, China
Jing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jian Cheng, Qiang Wang, Weiming Hu & Yifan Zhang
Nanjing Artificial Intelligence Chip Research, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jian Cheng, Jiaqing Fan & Yifan Zhang
Nanjing University of Information Science and Technology, Nanjing, China
Kaihua Zhang & Qingshan Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Qianqian Xu

Authors

Longyin Wen
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Du
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Bian
View author publications
You can also search for this author in PubMed Google Scholar
Haibin Ling
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Hu
View author publications
You can also search for this author in PubMed Google Scholar
Chenfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenya Ma
View author publications
You can also search for this author in PubMed Google Scholar
Qinqin Nie
View author publications
You can also search for this author in PubMed Google Scholar
Haotian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Lianjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Asanka G. Perera
View author publications
You can also search for this author in PubMed Google Scholar
Baochang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Byeongho Heo
View author publications
You can also search for this author in PubMed Google Scholar
Chunlei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Li
View author publications
You can also search for this author in PubMed Google Scholar
Emmanouil Michail
View author publications
You can also search for this author in PubMed Google Scholar
Hanlin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haojie Li
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Kompatsiaris
View author publications
You can also search for this author in PubMed Google Scholar
Jian Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqing Fan
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jin Young Choi
View author publications
You can also search for this author in PubMed Google Scholar
Jing Li
View author publications
You can also search for this author in PubMed Google Scholar
Jinyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jongwon Choi
View author publications
You can also search for this author in PubMed Google Scholar
Juanping Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jungong Han
View author publications
You can also search for this author in PubMed Google Scholar
Kaihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kaiwen Duan
View author publications
You can also search for this author in PubMed Google Scholar
Ke Song
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Avgerinakis
View author publications
You can also search for this author in PubMed Google Scholar
Kyuewang Lee
View author publications
You can also search for this author in PubMed Google Scholar
Lu Ding
View author publications
You can also search for this author in PubMed Google Scholar
Martin Lauer
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Giannakeris
View author publications
You can also search for this author in PubMed Google Scholar
Peizhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qianqian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Qingming Huang
View author publications
You can also search for this author in PubMed Google Scholar
Qingshan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Robert Laganière
View author publications
You can also search for this author in PubMed Google Scholar
Ruixin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sangdoo Yun
View author publications
You can also search for this author in PubMed Google Scholar
Shengyin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Sihang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar
Wei Tian
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Hu
View author publications
You can also search for this author in PubMed Google Scholar
Wenhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenhua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenrui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohao He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaotong Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xinbin Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xixi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Meng
View author publications
You can also search for this author in PubMed Google Scholar
Yangliu Kuai
View author publications
You can also search for this author in PubMed Google Scholar
Yanyun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yaxuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuankai Qi
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Deng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqun He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Zhu .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

A Submitted Trackers

In this appendix, we provide a short summary of all algorithms participated in the VisDrone2018-SOT competition. These are ordered according to the submissions of their final results.

1.1 A.1 Ottawa-sjtu-tracker (OST)

Yong Wang, Lu Ding, Robert Laganière, Xinbin Luo

ywang6@uottawa.ca, dinglu@sjtu.edu.cn, laganier@eecs.uottawa.ca,

losinbin@sjtu.edu.cn

OST is the combination of R-FCN detector [8] and ECO tracker [9]. Our algorithm is as follows: the tracker tracks the target. If the response is below a threshold, it indicates tracking failure. The detector provides detection results and the tracker searches target in the candidates and finally locate the target. The feature for tracker is HOG. The tracking results are based on the original R-FCN which is trained on ImageNet [51]. The detector is trained on VisDrone2018 training set and implemented offline at present.

1.2 A.2 Aggregation Signature Tracker (AST)

Chunlei Liu, Wenrui Ding, Jinyu Yang, Baochang Zhang, Jungong Han, Hanlin Chen

liuchunlei@buaa.edu.cn, ding@buaa.edu.cn, 17801004216@163.com

bczhang@buaa.edu.cn, jungonghan77@gmail.com, 15734029010@163.com

AST includes the base tracker and re-detection stages, particularly for small objects. The part of aggregation signature calculation illustrates the saliency map calculation in the re-detection procedure. Once a drifting is detected, we choose the searching region around the center of the previous target location to calculate the saliency map via aggregation signature. In the learning process, the target prior and the context information are used to learn the saliency map that helps find a new searching initial position, where the base tracker will be performed again for re-detection.

1.3 A.3 Correlation Filters with Weighted Convolution Responses and Kalman Filter (CFWCRKF)

Shengyin Zhu, Yanyun Zhao

lichenggang@bupt.edu.cn, zyy@bupt.edu.cn

CFWCRKF is built upon a correlation filters based tracker known as the Correlation Filters with Weighted Convolution Responses (CFWCR) [24], an improved version of the popular tracker Efficient Convolution Operators Tracker (ECO) [9]. ECO is an improved version of the tracker C-COT [13] and has achieved impressive results on the visual tracking benchmark. We have made some modifications to the algorithm of CFWCR, such as search area scale and weights factor. The most significant change is that we add Kalman Filter in the algorithm to deal with occlusion and fast motion.

1.4 A.4 3D Convolutional Networks for Visual Tracking (C3DT)

Haojie Li, Sihang Wu

{201721011386,eesihang}@mail.scut.edu.cn

C3DT improves the existing tracker MDNet [46] by introducing spatio-temporal information using the C3D network [58]. MDNet treats the tracking as classification and regression, which utilizes the appearance feature from the current frame to determine which candidate frame is object or background, and then gets a accurate bounding box by a linear regression. This network ignores the importance of spatio-temporal information for visual tracking. To address this problem, our approach adopts two-branch network to extract features. One branch is used to get features from the current frame by the VGG-S [5]; another is the C3D network, which extracts spatio-temporal information from the previous frames. C3DT fuses the features between two branch network to do the task of classification and regression.

1.5 A.5 VIsual Tracking via Adversarial Learning and Object Detection (VITALD)

Yuankai Qi, Yifan Yang, Weidong Chen, Kaiwen Duan, Qianqian Xu, Qingming Huang

qykshr@gmail.com, yangyifan@yeah.net, cwd2123@gmail.com

duankaiwen17@mails.ucas.ac.cn, xuqianqian@ict.ac.cn, qmhuang@ucas.ac.cn

VITALD is based on the VITAL tracker [54]. We improve VITAL from three aspects. First, we randomly augment fifty percent of the training data via flipping, rotation, and blurring. Second, we propose to adaptively adjust the size of the target searching region when the target scale change-ratio and translation between two contiguous frames exceed the thresholds \(\alpha \) and \(\beta \), respectively. Third, we train a pedestrian detection model and a vehicle (car, truck) detection model based on RefineDet [68] to provide additional target candidates for the target/background classification. According the given ground truth and detection results of these two models in the first frame, our method adaptively determines whether the detection should be used and to use which detection model.

1.6 A.6 CERTH’s KCF Algorithm on Visdrone (CKCF)

Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris

{michem,koafgeri,giannakeris,stefanos,ikom}@iti.gr

CKCF is based on KCF [25]. For specific sequences that needed excessive memory resource, the algorithm was applied sequentially, by splitting the whole sequence in shorter sequences and using as initial bounding boxes, the predicted bounding boxes of the previous sequence.

1.7 A.7 Jointly Weighted Correlation Filter and Convolutional Neural Network (CFCNN)

Wei Tian and Martin Lauer

{wei.tian,martin.lauer}@kit.edu

CFCNN combines both the correlation filter and the convolutional neural network into a single framework by accumulating the weighted response of each tracker model. For implementation, we employ the BACF tracker as our correlation filter model and keep the parameters from its paper [22]. For the CNN model, we deploy a simple residual network structure consisting of 2 base layers and 3 residual layers. The input for CF is the concatenation of HOG and Color Name [12] features while the input of our CNN model is the response map from the layer conv4-3 of a pre-trained VGG16 network. The channel number of response map from VGG16 is shrinked to 32 by PCA approach for computational efficiency. To cope with abrupt motion, we employ a very large searching area for each tracker model, i.e., 10 times of the target size.

1.8 A.8 Deep tracker with Color and Momentum (DeCoM)

Byeongho Heo, Sangdoo Yun, Jin Young Choi

bhheo@snu.ac.kr, sangdoo.yun@navercorp.com, jychoi@snu.ac.kr

DeCoM applies color and motion based tracking algorithm based on MDNet [46]. The scenes in the VisDrone dataset is very wide, and in most cases the object does not return to the same place. Therefore, we introduce an auxiliary tracking algorithm that can roughly follow the object even if the deep tracker fails. Classical color-based template matching is more efficient than deep features and edge-based features in the situations such as motion blur and heavy occlusion. In our tracking algorithm, if the deep tracker fails, an auxiliary tracker based on template matching is activated and tracks the object until the deep tracker is successful again. The tracking target of auxiliary tracker is the area around the object including the background for robust tracking. Besides, we introduce momentum in the auxiliary tracker to cope with heavy occlusion. Since the target of auxiliary tracker includes the background, the tracking position is closer to the background position than the actual object position. Thus, the difference between the position of a deep tracker and the auxiliary tracker approximates the relative speed of the background and the object. When the deep tracker is successful, we accumulate this difference to measure the momentum of the object, and when the deep tracker fails, the tracking result is made to move as much as the momentum, so as to predict where the object exits from the occlusion.

1.9 A.9 Extended Context-Aware Deep Feature Compression for High-Speed Visual Tracking (TRACA+)

Kyuewang Lee, Jongwon Choi, Jin Young Choi

{kyuewang5056,jwchoi.pil}@gmail.com, jychoi@snu.ac.kr

TRACA+ is a fast and effective deep feature-based tracker which is suitable to UAV camera environments. To address the issues such as confusing appearance of small objects, frequent occlusion in an urban environment, and abrupt camera motion due to swift change of UAV position, we have extended TRACA [6] to be applied to UAV environments. The reason to choose TRACA is that it achieves both high speed and high performance at the same time. Since the computing power of the embedded systems on drones is low, TRACA can be a viable tracking solution. Although TRACA shows superior performance in many of the benchmark datasets, UAV camera environments such as drones remain challenging due to the following hindrances: confusing appearance of small objects, frequent occlusion in an urban environment, and heavy or abrupt camera motion. To handle these hindrances, we extend TRACA by adding two-fold techniques. First, we concatenate RGB color feature in addition to the compressed feature to relieve the effects of confusing appearance of small objects and motion blur from the abrupt camera motion. Second, we propose a homography-based Kalman filtering method to predict the next frame target position which is combined with the CF tracking position in a convex combination manner to get the next frame final position. This method can not only handle occlusion problems to some degree but also predict object motion regardless of camera motion.

1.10 A.10 Visual Tracking Using Backtracking (BTT)

Ke Song, Xixi Hu, Wenhao Wang, Yaxuan Li, and Wei Zhang

201613125@mail.sdu.edu.cn, huxixity@gmail.com, 201400040023@mail.sdu.edu.cn

yaxuanli2018@gmail.com, davidzhangsdu@mail.sdu.edu.cn

BTT is improved from the MDNet [46] algorithm to handle fast motion (FM), partial occlusion (POC) and full occlusion (FOC). The modifications are mainly in two aspects: First, we generate 500 positive samples in the first frame of sequence then extract and store the features of them. These features are used to update network to prevent the model drift caused by background when fast motion and occlusion arise. In detail, besides the long-term and short-term updates, we add an extra backtracking-term update, which is performed when the positive score of the estimated target is less than 0.3. The samples used for backtracking-term update contains three parts: The first one are the positive samples generated from the first frame as stated above. The second one are the samples generated from the last 20 frames that the result confidence score is greater than 0.5. The last one are the negative samples. Considering that the old negative examples are often redundant or irrelevant to the current frame we only select the last 10 frames to generate negative samples. The negative samples are collected in the manner of hard negative mining. Second, correspondingly, we expand the search scale in one frame and increase the number of target candidates aimed at effective re-detection to fast motion and occlusion situation.

1.11 A.11 An Improved ECO Algorithm for Preventing Camera Shake, Long-Term Occlusion and Adaptation to Target Deformation (LZZ-ECO)

Xiaotong Li, Jie Zhang, Xin Zhang

lixiaotong@stu.xidian.edu.cn, 1437614843@qq.com, xinzhang1@stu.xidian.edu.cn

LZZ-ECO is based on ECO [9] and has made the following improvements based on ECO:

(1) We add the object detection algorithm YOLOv3 [48] to optimize the location of the target, especially when the target has a large deformation or camera angle changes. When the target is violently deformed or the camera’s perspective changes, the traditional ECO tracking box may only contain a part of the target. At this time, using the detection results of the detection algorithm to optimize the tracking results will achieve good results. Specifically, when the above situation is detected, a pixel block of \(400\times 400\) (in order to approximate the input picture size of YOLOv3) extracted around the center of the tracking box will be input to YOLOv3. Then the IOU of tracking box and each detection box are calculated in the detection result to select the detection box with the highest IOU as the optimized box.

(2) To deal with the long time occlusion problem, we use the optical flow method [1] to estimate the approximate motion trajectory of the target in the occluded stage when the target is detected to be occluded. Thus the tracking algorithm can track the target successfully when it appears again. Moreover, when the target is detected to be occluded, we stop update the correlation filters in ECO because the image used for filter training may already be an occlusion rather than a target at this time.

(3) To deal with the camera violent shaking problem, we use the sift feature based matching algorithm [40] to calculate the offset of the target between the current frame and the previous frame to accurately locate the position of the target in the current frame. It can successfully track several sequences of camera shakes in the testing sequences, which improves significantly in those with the sheep target.

1.12 A.12 Feature Learning in CFNet and Channel Attention in SENet by Focal Loss (SECFNet)

Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao

{lidongdong12,kuaiyangliu09}@nudt.edu.cn)

SECFNet is based on the feature learning study in CFNet [59], channel attention in SENet [27] and focal loss in [38]. The proposed tracker introduces channel attention and focal loss into the network design to enhance feature representation learning. Specifically, a Squeeze-and-Excitation (SE) block is coupled to each convolutional layer to generate channel attention. Channel attention reflects the channel-wise importance of each feature channel and is used for feature weighting in online tracking. To alleviate the foreground-background data imbalance, we propose a focal logistic loss by adding a modulating factor to the logistic loss, with two tunable focusing parameters. The focal logistic loss down-weights the loss assigned to easy examples in the background area. Both the SE block and focal logistic loss are computationally lightweight and impose only a slight increase in model complexity. Our tracker is pre-trained on the ILSVRC2015 dataset and fine-tuned on the VisDrone2018 train set.

1.13 A.13 Iteratively Matching Three-tier Tracker (IMT3)

Asanka G Perera

asanka.perera@mymail.unisa.edu.au

IMT3 is a method to use with Normalized cross-correlation (NCC) filter for rotation and scale invariant object tracking. The proposed solution consists of three modules: (i) multiple appearance generation in the search image at different rotation angles and scales, (ii) bounding box drifting correction by a re-initialization step, and (iii) failure handling by tracker combination. A point tracker that uses the Kanade-Lucas-Tomasi feature-tracking algorithm and a histogram-based tracker that uses the continuously adaptive mean shift (CAMShift) algorithm have been used as supporting trackers.

1.14 A.14 Convolution Operators for Tracking Using Resnet Features Using Rectangle Rectifier with Similarity Network to Solve the Occlusion Problem (SDRCO)

Zhiqun He, Ruixin Zhang, Peizhen Zhang, Xiaohao He

he010103@bupt.edu.cn, ruixinzhang@tencent.com, zhangpzh5@mail2.sysu.edu.cn

hexh17@mails.tsinghua.edu.cn

SDRCO is an improved version of the baseline tracker CFWCR [24]. We use ResNet features and new formulation to solve the correlation filter formula. Besides, we use Kalman filter to help smooth the results. After the tracking, we use a detector trained in the SOT training data to rectify the rectangle of RCO. We have a similarity network (ResNet50) to find out the occlusion frame and the Kalman filter to predict the location of the target and re-detect the target using the rectifier.

1.15 A.15 Discriminant Correlation Filters Network for Visual Tracking (DCFNet)

Jing Li, Qiang Wang, and Weiming Hu

jli24@outlook.com, {qiang.wang,wmhu}@ia.ac.cn

DCFNet [61] is an end-to-end lightweight network architecture to learn the convolutional features and perform the correlation tracking process simultaneously. Specifically, we treat DCF as special correlation filter layer added in a Siamese network, and carefully derive the back-propagation through it by defining the network output as the probability heatmap of object location. Since the derivation is still carried out in Fourier frequency domain, the efficiency property of DCF is preserved. This enables our tracker to run at more than 60 FPS during test time, while achieving a significant accuracy gain compared with KCF using HoGs.

1.16 A.16 Dual Color clustering and Spatio-temporal regularized regressions based complementary Tracker (DCST)

Jiaqing Fan, Yifan Zhang, Jian Cheng, Kaihua Zhang, Qingshan Liu

fjq199407@163.com, {yfzhang,jcheng}@nlpr.ia.ac.cn, zhkhua@gmail.com,

qsliu@nuist.edu.cn

DCST is improved from Staple [2], which is equipped with complementary learners of Discriminative Correlation Filters (DCFs) and color histograms to deal with color changes and deformations. Staple has some weakness: (i) It only employs a standard color histogram with the same quantization step for all sequences, which does not consider the specific structural information of target in each sequence, thereby affecting its discriminative capability to separate target from background. (ii) The standard DCFs are efficient but suffer from unwanted boundary effects, leading to failures in some challenging scenarios. Based on these issues, we make two significant improvements in color histogram regressor and DCF regressor, respectively. First, we design a novel color clustering based histogram model that first adaptively divides the colors of the target in the 1st frame into several cluster centers, and then the cluster centers are taken as references to construct adaptive color histograms for targets in the coming frames, which enable to adapt significant target deformations. Second, we propose to learn spatio-temporal regularized CFs, which not only enables to avoid boundary effects but also provides a more robust appearance model than DCFs in Staple in the case of large appearance variations. Finally, we fuse these two complementary merits.

1.17 A.17 Sparse Response Context-Aware Correlation Filter Tracking (STAPLE_SRCA)

Wenhua Zhang, Yang Meng

{zhangwenhua_nuc,xdyangmeng}@163.com

STAPLE_SRCA [44] is a context-aware tracking proposed based on the framework of correlation filter. A problem is that when the target moves out of the scene or is completely covered by other objects, it is possible that the target will be lost forever. When the target comes out again, the tracker cannot track the target. Focusing on this problem, we propose a sparse response context-aware correlation filter tracking method based on STAPLE [2]. In the training process, we force the expected response to be as sparse as possible, then most responses are close to 0. When the target disappears, all the responses will be close to 0. Then in the tracking process, the case that the target moves out of the scene or be covered by other objects can be easily recognized and this frame is taken as a pending frame. As a consequence, those frames will not influence the frames where the target comes out.