Keywords

1 Introduction

Developing autonomous drone systems that are helpful for humans in everyday tasks, e.g., agriculture, aerial photography, fast delivery, and surveillance, is one of the grand challenges in computer science. An example is autonomous drone systems that can help farmers to spray pesticide regularly. Consequently, automatic understanding of visual data collected from these platforms become highly demanding, which brings computer vision to drones more and more closely. Video object detection and tracking are the critical steps in those applications, which attract much research in recent years.

Several benchmark datasets have been proposed in video object detection and tracking, such as ImageNet-VID [43] and UA-DETRAC [30, 51] for object detection in videos, and KITTI [16] and MOTChallenge [25] for multi-object tracking, to promote the developments in related fields. The challenges in those datasets are quite different from that on drones for the video object detection and tracking algorithms, such as large viewpoint change and scales. Thus, these algorithms in video object detection and tracking are not usually optimal for dealing with video sequences generated by drones. As pointed out in recent studies (e.g., [20, 34]), autonomous video object detection and tracking is seriously limited by the lack of public large-scale benchmarks or datasets. Some recent preliminary efforts [20, 34, 42] have been devoted to construct datasets captured using a drone platform, which are still limited in size and scenarios covered, due to the difficulties in data collection and annotation. Thus, a more general and comprehensive benchmark is desired to further boost research on computer vision problems with drone platform. Moreover, thorough evaluations of existing or newly developed algorithms remains an open problem.

To this end, we organized a challenge workshop, “Vision Meets Drone Video Object Detection and Tracking” (VisDrone-VDT2018), which is a part of the “Vision Meets Drone: A Challenge” (VisDrone2018) on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. This challenge focuses on two tasks, i.e., (1) video object detection and (2) multi-object tracking, which are described as follows.

  • Video object detection aims to detect objects of a predefined set of object categories (e.g., pedestrian, car, and van) from videos taken from drones.

  • Multi-object tracking aims to recover the object trajectories in video sequences.

We collected a large-scale video object detection and tracking dataset with several drone models, e.g., DJI Mavic, Phantom series 3, and 3A, in various scenarios, which are taken at different locations, but share similar environments and attributes.

We invite researchers to submit the results of algorithms on the proposed VisDrone-VDT2018 dataset, and share their research at the workshop. We also present the evaluation protocol of the VisDrone-VDT2018 challenge, and the results of a comparison of the submitted algorithms on the benchmark dataset, on the challenge website: www.aiskyeye.com/. The authors of the submitted algorithms have an opportunity to publish the source code at our website, which will be helpful to track and boost research on video object detection and tracking with drones.

2 Related Work

2.1 Existing Datasets and Benchmarks

The ILSVRC 2015 challenge [43] opens the “object detection in video” track, which contains a total of 3, 862 snippets for training, 555 snippets for validation, and 937 snippets for testing. YouTube-Object dataset [37] is another large-scale dataset for video object detection, which consists of 155 videos with over 720, 152 frames for 10 classes of moving objects. However, only 1, 258 frames are annotated with a bounding-box around an object instance. Based on this dataset, Kalogeiton et al. [23] further provide the annotations of instance segmentationFootnote 1 for the YouTube-Object dataset.

Multi-object tracking is a hot topic in computer vision with many applications, such as surveillance, sport video analysis, and behavior analysis. Several datasets are presented to promote the developments in this field. The MOTChallenge teamFootnote 2 release a series of datasets, i.e., MOT15 [25], MOT16 [31], and MOT17 [1], for multi-pedestrian tracking evaluation. Wen et al. [51] collect the UA-DETRAC dataset for multi-vehicle detection and tracking evaluation, which contains 100 challenging videos captured from real-world traffic scenes (over 140, 000 frames with rich annotations, including illumination, vehicle type, occlusion, truncation ratio, and vehicle bounding boxes). Recently, Du et al. [12] construct a UAV dataset with approximate 80, 000 fully annotated video frames as well as 14 different kinds of attributes (e.g., weather condition, flying altitude, vehicle category, and occlusion) for object detection, single-object tracking, and multi-object tracking evaluation. We summarize the related datasets in Table 1.

2.2 Brief Review of Video Object Detection Methods

Object detection has achieved significant improvements in recent years, with the arriving of convolutional neural networks (CNNs), such as R-CNN [17], Faster-RCNN [40], YOLO [38], SSD [29], and RefineDet [57]. However, the aforementioned methods focus on detecting objects in still images. The object detection accuracy in videos suffers from appearance deterioration that are seldom observed in still images, such as motion blur, video defocus, etc. To that end, some previous methods are designed to detect specific classes of objects from videos, such as pedestrians [49] and cars [26]. Kang et al. [24] develop a multi-stage framework based on deep CNN detection and tracking for object detection in videos in [43], which uses a tubelet proposal module to combine object detection and tracking for tubelet object proposal, and a tubelet classification and re-scoring module to incorporate temporal consistency. The Seq-NMS method [18] uses high-scoring object detections from nearby frames to boost scores of weaker detections within the same clip to improve the video detection accuracy. Zhu [59] design an end-to-end learning framework for video object detection based on flow-guided feature aggregation and temporal coherence. Galteri et al. [14] connect detectors and object proposal generating functions to exploit the ordered and continuous nature of video sequences in a closed-loop. Bertasius et al. [5] propose to learn the spatially sample features from adjacent frames, which is robust to occlusion or motion blur in individual frames.

2.3 Brief Review of Multi-object Tracking Methods

Multi-object tracking aims to recover the target trajectories in video sequences. Most of the previous methods formulate the tracking problem as a data association problem [11, 32, 36, 56]. Some methods [3, 9, 45, 55] attempt to learn the affinity in association for better performance. In addition, Sadeghian et al. [44] design a Recurrent Neural Network (RNN) structure, which jointly integrates multiple cues based on the appearance, motion, and interactions of objects over a temporal window. Wen et al. [52] formulate the multi-object tracking task as dense structure exploiting on a hypergraph, whose nodes are detections and hyperedges describe the corresponding high-order relations. Tang et al. [46] use a graph-based formulation that links and clusters person hypotheses over time by solving an instance of a minimum cost lifted multicut problem for multiple object tracking. Feichtenhofer et al. [13] set up a CNN architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression.

3 The VisDrone-VDT2018 Challenge

As described above, the VisDrone-VDT2018 challenge focuses on two tasks in computer vision, i.e., (1) video object detection, and (2) multi-object tracking, which use the same video data. We release a large-scale video object detection and tracking dataset, including 79 video clips with approximate 1.5 million annotated bounding boxes in 33, 366 frames. Some other useful annotations, such as object category, occlusion, and truncation ratios, are also provided for better data usage. Participants are expected to submit a single set of results per algorithm in the VisDrone-VDT2018 dataset. We also allow the participants to submit the results of multiple different algorithms. However, changes in the parameters of the algorithms are not considered as the different algorithms. Notably, the participants are allowed to use additional training data to optimize their models. The use of external data should be explained in submission.

Table 1. Comparison of current state-of-the-art benchmarks and datasets. Note that, the resolution indicates the maximum resolution of the videos/images included in the dataset.
Fig. 1.
figure 1

The number of objects with different occlusion degrees of each object category in the training, validation and testing subsets for the video object detection and multi-object tracking tasks.

Fig. 2.
figure 2

The number of objects per frame vs. percentage of video frames in the training, validation and testing subsets for the video object detection and multi-object tracking tasks. The maximal, mean and minimal numbers of objects per image in the three subsets are presented in the legend.

3.1 Dataset

The VisDrone-VDT2018 dataset consists of 79 challenging sequences with a total of 33, 366 frames, which is divided into three non-overlapping subsets, i.e., training set (56 video clips with 24, 198 frames), validation set (7 video clips with 2, 846 frames), and testing set (16 video clips with 6, 322 frames). These video sequences are captured from different cities under various weather and lighting conditions. The manually generated annotations for the training and validation subsets are made available to users, but the annotations of the testing set are reserved to avoid (over)fitting of algorithms. The video sequences of the three subsets are captured at different locations, but share similar environments and attributes. We focus on five object categories in this challenge, i.e., pedestrianFootnote 3, car, van, bus, and truck, and carefully annotate more than 1 million bounding boxes of object instances in the video sequences. Some annotated example frames are shown in Fig. 3. We present the number of objects with different occlusion degrees of each object category in the training, validation, and testing subsets in Fig. 1, and plot the number of objects per frame vs. percentage of video frames in the training, validation, and testing subsets to show the distributions of the number of objects in each video frame in Fig. 2.

In addition, we also provide the occlusion and truncation ratios annotations for better usage. Specifically, we annotate the occlusion relationships between objects, and use the fraction of pixels being occluded to define the occlusion ratio. Three degrees of occlusions of objects are provided, i.e., no occlusion (occlusion ratio \(0\%\)), partial occlusion (occlusion ratio 1%\(\sim \)), and heavy occlusion (occlusion ratio >50%). We also provide the truncation ratio of objects, which is used to indicate the degree of object parts that appear outside a frame. If an object is not fully captured within a frame image, we label the bounding box inside the frame boundary and estimate the truncation ratio based on the region outside the image. It is worth mentioning that a target trajectory is regarded as ending if its truncation ratio starts to be larger than \(50\%\).

3.2 Video Object Detection

Video object detection aims to locate object instances from a predefined set of five object categories in the videos. For the video object detection task, we require the participating algorithms to predict the bounding boxes of each predefined object class in each video frame.

Evaluation Protocol. For the video object detection task, we require each algorithm to produce the bounding boxes of objects in each video frame of each video clip. Motivated by the evaluation protocols in MS COCO [28] and the ILSVRC 2015 challenge [43], we use the AP\(^{\text {IoU}=0.5:0.05:0.95}\), AP\(^{\text {IoU}=0.5}\), AP\(^{\text {IoU}=0.75}\), AR\(^{\text {max}=1}\), AR\(^{\text {max}=10}\), AR\(^{\text {max}=100}\), and AR\(^{\text {max}=500}\) metrics to evaluate the results of the video detection algorithms. Specifically, AP\(^{\text {IoU}=0.5:0.05:0.95}\) is computed by averaging over all 10 intersection over union (IoU) thresholds (i.e., in the range [0.50 : 0.95] with the uniform step size 0.05) of all object categories, which is used as the primary metric for ranking. AP\(^{\text {IoU}=0.50}\) and AP\(^{\text {IoU}=0.75}\) are computed at the single IoU thresholds 0.5 and 0.75 over all object categories, respectively. The AR\(^{\text {max}=1}\), AR\(^{\text {max}=10}\), AR\(^{\text {max}=100}\) and AR\(^{\text {max}=500}\) scores are the maximum recalls with 1, 10, 100 and 500 detections per frame, averaged over all categories and IoU thresholds. Please refer to [28] for more details.

Detectors Submitted. We have received 6 entries in the VisDrone-VDT2018 challenge. Four submitted detectors are derived directly from the image object detectors, including CERTH-ODV (A.1), CFE-SSDv2 (A.2), RetinaNet_s (A.3) and RD (A.4). The EODST (A.5) detector is a combination of the image object detector and visual tracker, and the FGFA+ (A.6) detector is an end-to-end learning framework for video object detection. We summarize the submitted algorithms in Table 2, and present a brief description of the submitted algorithms in Appendix A.

Table 2. The descriptions of the submitted video object detection algorithms in the VisDrone-VDT2018 challenge. The running speed (in FPS), GPUs for training, implementation details, training datasets and the references on the video object detection task are reported.

Results and Analysis. The results of the submitted algorithms are presented in Table 3. CFE-SSDv2 (A.2) achieves the best performance of all submissions, which design a comprehensive feature enhancement module to enhance the features for small object detection. In addition, the multi-scale inference strategy is used to further improve the performance. The EODST (A.5) detector produces the second best results, closely followed by FGFA+ (A.6). EODST (A.5) considers the concurrence of objects, and FGFA+ (A.6) employs the temporal context to improve the detection accuracy. RD (A.4) performs slightly better than FGFA+ (A.6) in AP\(_{50}\), but produces worse results on other metrics. CERTH-ODV (A.1) performs on par with RetinaNet_s (A.3) with the AP score less than \(10\%\).

Table 3. Video object detection results on the VisDrone-VDT2018 testing set. The submitted algorithms are ranked based on the AP score.

3.3 Multi-object Tracking

Given an input video sequence, multi-object tracking aims to recover the trajectories of objects. Depending on the availability of prior object detection results in each video frame, we divide the multi-object tracking task into two sub-tasks, denoted by MOT-a (without prior detection) and MOT-b (with prior detection). Specifically, for the MOT-b task, we provide the object detection results of the Faster R-CNN algorithm [40] trained on the VisDrone-VDT2018 dataset in the VisDrone2018 challenge, and require the participants to submit the tracking results for evaluation. Some annotated video frames of the multi-object tracking task are shown in Fig. 3.

Fig. 3.
figure 3

Some annotated example video frames of multiple object tracking. The bounding boxes and the corresponding attributes of objects are shown for each sequence.

Evaluation Protocol. For the MOT-a task, we use the tracking evaluation protocol of [35] to evaluate the performance of the submitted algorithms. Each algorithm is required to produce a list of bounding boxes with confidence scores and the corresponding identities. We sort the tracklets (formed by the bounding box detections with the same identity) according to the average confidence over the bounding box detections. A tracklet is considered correct if the intersection over union (IoU) overlap with ground truth tracklet is larger than a threshold. Similar to [35], we use three thresholds of evaluation, i.e., 0.25, 0.50, and 0.75. The performance of an algorithm is evaluated by averaging the mean average precision (mAP) per object class over different thresholds. Please refer to [35] for more details.

For the MOT-b task, we follow the evaluation protocol of [31] to evaluate the performance of the submitted algorithms. Specifically, the average rank of 10 metrics (i.e., MOTA, MOTP, IDF1, FAF, MT, ML, FP, FN, IDS, and FM) is used to rank the algorithms. The MOTA metric combines three error sources, i.e., FP, FN and IDS. The MOTP metric is the average dissimilarity between all true positives and the corresponding ground truth targets. The IDF1 metric indicates the ratio of correctly identified detections over the average number of ground truths and the predicted detections. The FAF metric indicates the average number of false alarms per frame. The FP metric describes the total number of tracker outputs which are the false alarms, and FN is the total number of targets missed by any of the tracked trajectories in each frame. The IDS metric describes the total number of times that the matched identity of a tracked trajectory changes, while FM is the times that the trajectories are disconnected. Both the IDS and FM metrics describe the accuracy of the tracked trajectories. The ML and MT metrics measure the percentage of tracked trajectories less than \(20\%\) and more than \(80\%\) of the time span based on the ground truth respectively.

Table 4. Multi-object tracking results without prior object detection in each video frame on the VisDrone-VDT2018 testing set. The submitted algorithms are ranked based on the AP metric.
Table 5. Multi-object tracking results with prior object detection in each frame on the VisDrone-VDT2018 testing set. The submitted algorithms are ranked based on the average rank of the ten metrics. \(*\) indicates that the tracking algorithm is submitted by the committee.

Trackers Submitted. There are in total 8 different multi-object tracking methods submitted to the VisDrone-VDT2018 challenge. The VisDrone committee also reports 6 baseline methods (i.e., GOG (B.9) [36], IHTLS (B.13) [11], TBD (B.10) [15], H\(^2\)T (B.14) [53], CMOT (B.12) [3], and CEM (B.11) [33]) using the default parameters. If the default parameters are not available, we select the reasonable values for evaluation. The Ctrack (B.7), TrackCG (B.5) and V-IOU (B.6) trackers aim to exploit the motion information to improve tracking performance. GOG_EOC (B.2), SCTrack (B.3) and FRMOT (B.4) are designed to learn discriminative appearance features of objects to help tracking. Another two trackers MAD (B.1) and deep-sort_v2 (B.8) combines the detectors (e.g., RetinaNet [27] and YOLOv3 [38]) and tracking algorithms (e.g., Deep-SORT [54] and CFNet [50]) to complete the tracking task. We summarize the submitted algorithms in Table 6, and present the descriptions of the algorithms in Appendix B.

Table 6. The descriptions of the submitted algorithms in the multi-object tracking task in the VisDrone-VDT2018 challenge. The running speed (in FPS), CPU and GPU platforms information for training and testing, implementation details (i.e., P indicates Python, M indicates Matlab, and C indicates C/C++), training datasets, and the references on the multi-object tracking task are reported. The \(*\) mark is used to indicate the methods are submitted by the VisDrone committee.

Results and Analysis. The results of the submissions of the MOT-a and MOT-b tasks are presented in Tables 4 and 5, respectively.

As shown in Table 4, Ctrack (B.7) achieves the top AP score among all submissions in the MOT-a task. In terms of different object categories, it performs the best in the bus and truck categories. We suspect that the complex motion models used in Ctrack (B.7) are effective in tracking large size objects. Deep-sort_d2 (B.8) produces the best results for cars and vans. Since these two categories of objects usually move smoothly, the IOU similarity and deep appearance features are effective to extract the discriminative motion and appearance features of these objects. MAD (B.1) produces the top AP\(_\text {ped}\) score, which demonstrates the effectiveness of the model ensemble strategy.

As shown in Table 5, we find that V-IOU (B.6) produces the top average rank of 2.7 over the 10 metrics. The TrackCG method (B.5) achieves the best MOTA and IDF1 scores among all submissions. GOG_EOC (B.2) considers the exchanging context of objects to improve the performance, which performs much better than the original GOG method (B.9) in terms of the MOTP, IDF1, FAF, ML, FP, IDS and FM metrics, and ranks at the third place. Ctrack (B.7) performs on par with SCTrack (B.3), but produces better MT, ML and FN scores. Ctrack (B.7) uses the aggregation of prediction events in grouped targets and the stitching procedure by temporal constraints to help tracking, which is able to recover the target objects with long-time disappearance in the crowded scenes.

Fig. 4.
figure 4

Comparisons of all the submissions based on the MOTA metric for each object category.

Fig. 5.
figure 5

Comparisons of all the submissions based on the IDF1 metric for each object category.

To further analyze the performance of the submissions thoroughly in different object categories, we present the MOTA and IDF1 scores of 5 evaluated object categories (i.e., car, bus, truck, pedestrian, and van) in Figs. 4 and 5. The top two best trackers V-IOU (B.6) and TrackCG (B.5) produce the best results in all categories of objects. We also observe that V-IOU (B.6) and FRMOT (B.4) produce the best results in the bus category, which may be attributed to the effectiveness of the IOU and deep feature based similarities in tracking the large size objects.

4 Conclusions

This paper concludes the VisDrone-VDT2018 challenge, which focuses on two tasks, i.e., (1) video object detection, and (2) multi-object tracking. A large-scale video object detection and tracking dataset is released, which consists of 79 challenging sequences with 33, 366 frames in total. We provide fully annotations of the dataset with annotated bounding boxes and the corresponding attributes such as object categories, occlusion status and truncation ratios. 6 algorithms are submitted to the video object detection task and 14 algorithms are submitted to the multiple object tracking (i.e., 3 methods do not use the prior object detection in video frames and 12 methods use the prior object detection in video frames). The CFE-SSDv2 (A.2) method achieves the best results in the video object detection task, Ctrack (B.7) achieves the best results in the MOT-a task, and V-IOU (B.6) and TrackCG (B.5) perform better than other submitted methods in the MOT-b task. The VisDrone-VDT2018 challenge was successfully held on September 8, 2018, which is a part of the VisDrone2018 challenge workshop. We hope this challenge is able to provide a unified platform for video object detection and tracking evaluations on drones. Our future work will focus on revising the dataset and evaluation kit based on the feedbacks from the community.