Keywords

1 Introduction

Detecting objects in images, which aims to detect objects of the predefined set of object categories (e.g., cars and pedestrians), is a problem with a long history [9, 17, 32, 40, 50]. Accurate object detection would have immediate and far reaching impact on many applications, such as image understanding, video surveillance, and anomaly detection. Although object detection attracts much research and has achieved significant advances with the deep learning techniques in recent years, these algorithms are not usually optimal for dealing with sequences or images captured by drone-based platforms, due to various challenges such as view point change, scales and occlusion.

To narrow the gap between current object detection performance and the real-world requirements, we organized the “Vision Meets Drone - Object Detection in Images (VisDrone-DET2018) challenge, which is one track of the “Vision Meets Drone: A Challenge” (or VisDrone2018, for short) on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. We collected a large-scale object detection dataset in real scenarios with detailed annotations. The VisDrone2018 challenge mainly focus on human and vehicles in our daily life. The comparisons of the proposed dataset and previous datasets are presented in Table 1.

We invite researchers to submit algorithms to detect objects of ten predefined categories (e.g., pedestrian and car) from individual images in the VisDrone-DET2018 dataset, and share their research results at the workshop. We believe this comprehensive challenge benchmark is useful to further boost research on object detection on drone platforms. The authors of the detection algorithms in this challenge have an opportunity to share their ideas and publish the source code at our website: http://www.aiskyeye.com/, which are helpful to promote the development of object detection algorithms.

2 Related Work

2.1 Existing Datasets

Several object detection benchmarks have been collected for evaluating object detection algorithms. Enzweiler and Gavrila [12] present the Daimler dataset, captured by a vehicle driving through urban environment. The dataset includes 3, 915 manually annotated pedestrians in video images in the training set, and 21, 790 video images with 56, 492 annotated pedestrians in the testing set. The Caltech dataset [11] consists of approximately 10 h of \(640\times 480\) 30 Hz videos taken from a vehicle driving through regular traffic in an urban environment. It contains \(\sim \)250,000 frames with a total of 350, 000 annotated bounding boxes of 2, 300 unique pedestrians. The KITTI-D benchmark [19] is designed to evaluate the car, pedestrian, and cyclist detection algorithms in autonomous driving scenarios, with 7, 481 training and 7, 518 testing images. Mundhenk et al. [34] create a large dataset for classification, detection and counting of cars, which contains 32, 716 unique cars from six different image sets, different geographical locations and different imagers. The recent UA-DETRAC benchmark [33, 47] provides 1, 210k objects in 140k frames for vehicle detection.

Table 1. Comparisons of current state-of-the-art benchmarks and datasets for object detection. Note that, the resolution indicates the maximum resolution of the videos/images included in the dataset.

The PASCAL VOC dataset [15, 16] is one of the pioneering works in generic object detection, which is designed to provide a standardized test bed for object detection, image classification, object segmentation, person layout, and action classification. ImageNet [10, 41] follows the footsteps of the PASCAL VOC dataset by scaling up more than an order of magnitude in the number of object classes and images, i.e., PASCAL VOC 2012 with 20 object classes and 21, 738 images vs. ILSVRC2012 with 1, 000 object classes and 1, 431, 167 annotated images. Recently, Lin et al. [31] release the MS COCO dataset, containing more than 328, 000 images with 2.5 million manually segmented object instances. It has 91 object categories with 27.5k instances on average per category. Notably, it contains object segmentation annotations that are not available in ImageNet.

2.2 Review of Object Detection Methods

Classical Object Detectors. In early days, the object detection methods are constructed based on the sliding-window paradigm, which use the hand-crafted features and classifiers on dense image grids to locate objects. As one of previous most popular frameworks, Viola and Jones [45] use Haar feature and Adaboost algorithm to learn a series of cascaded classifiers for face detection, which achieves accurate results with high efficiency. Felzenszwalb et al. [17] develop an effective object detection method based on mixtures of multiscale deformable part models. Specifically, they calculate the Histograms of Oriented Gradients (HOG) features on each part of object and train the latent SVM (a reformulation of MI-SVM in terms of latent variables) for robust performance. However, the classical object detectors do not perform well in challenging scenarios. In recent years, with the advance of deep Convolutional Neural Network (CNN), the object detection field is dominated by the CNN-based detectors, which can be roughly divided into two categories, i.e., the two-stage approach and the one-stage approach.

Two-Stage CNN-Based Methods. The two-stage approach first generates a pool of object proposals by a separated proposal generator and then predicts the accurate object regions and the corresponding class labels, such as R-CNN [21], SPP-Net [24], Fast R-CNN [20], Faster R-CNN [40], R-FCN [7], Mask R-CNN [23], and FPN [29].

R-CNN [21] is one of the pioneering works using the CNN model pre-trained on ImageNet, which extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVM. SPP-Net [24] proposes the SPP layer that pools the features and generates fixed length outputs to remove the fixed input size constraint of the CNN model. In contrast to SPP [24], Fast R-CNN [20] designs a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations in an end-to-end way. Faster R-CNN [40] further improves Fast R-CNN using a region proposal network instead of the selective search algorithm [44] to extract the region proposals. The R-FCN method [7] develops a fully convolutional network (FCN) to solve object detection, which constructs a set of position-sensitive maps using a bank of specialized convolutional layers to incorporate translation variance into FCN. Recently, Lin et al. [29] exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost to improve the detection performance. In [28], the head of network is designed as light as possible to decrease the computation cost, by using a thin feature map and a cheap R-CNN subnet (pooling and single fully-connected layer). Zhang et al. [49] propose a new occlusion-aware R-CNN to improve the pedestrian detection in the crowded scenes, which designs an aggregation loss to enforce proposals to be close and locate compactly to the corresponding objects. In general, the aforementioned methods share almost the same pipeline for object detection (i.e., object proposal generation, feature extraction, object classification and bounding box regression). The region proposal generating stage is the bottleneck to improve running efficiency.

One-Stage CNN-Based Methods. Different from the two-stage approach, the one-stage approach directly predicts the object locations, shapes and the class labels without the proposal extraction stage, which can run in high efficiency. The community witnesses the noticeable improvements in this direction, including YOLO [37], SSD [32], DSSD [18], RefineDet [50], and RetinaNet [30].

Specifically, YOLO [37] formulates object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. After that, Redmon et al. [38] improve YOLO in various aspects, such as adding batch normalization on all of the convolutional layers, using anchor boxes to predict bounding boxes, and using multi-scale training. SSD [32] takes advantage of a set of default anchor boxes with different aspect ratios and scales to discretize the output space of bounding boxes and fuses predictions from multiple feature maps with different resolutions. DSSD [18] augments SSD with deconvolution layers to introduce additional large scale context in object detection to improve accuracy, especially for small objects. Zhang et al. [51] enrich the semantics of object detection features within SSD, by a semantic segmentation branch and a global activation module. Lin et al. [30] use Focal Loss (RetinaNet) to address the class imbalance issue in object detection by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. In addition, Zhang et al. [50] propose a single-shot detector RefineDet. It is formed by two inter-connected modules, i.e., the anchor refinement module and the object detection module, which achieves high accuracy and efficiency. Moreover, Chen et al. [6] propose a dual refinement network to boost the performance of the one-stage detectors, which considers anchor refinement and feature offset refinement in the anchor-offset detection.

Fig. 1.
figure 1

The number of objects with different occlusion degrees of different object categories in the training, validation and testing sets for the object detection in images task.

Fig. 2.
figure 2

The number of objects per image vs. percentage of images in the training, validation and testing sets for object detection in images. The maximal, mean and minimal number of objects per image in the three subsets are presented in the legend.

3 The VisDrone-DET2018 Challenge

As mentioned above, to track and advance the developments in object detection, we designed the VisDrone-DET2018 challenge, which focuses on detecting ten predefined categories of objects (i.e., pedestrian, personFootnote 1, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle) in images from drones. We require each participating algorithm to predict the bounding boxes of objects in predefined classes with a real-valued confidence. Some rarely occurring special vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in the evaluation. The VisDrone-DET2018 dataset consists of 8, 599 images (6, 471 for training, 548 for validation, 1, 580 for testing) with rich annotations, including object bounding boxes, object categories, occlusion, and truncation ratios. Featuring a diverse real-world scenarios, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios (across 14 different cities spanned over thousands of kilometres), and under various weather and lighting conditions. The manually annotated ground truths in the training and validation sets are made available to users, but the ground truths of the testing set are reserved in order to avoid (over)fitting of algorithms. We encourage the participants to use the provided training data, while also allow them to use additional training data. The use of external data must be indicated during submission.

Fig. 3.
figure 3

Some annotated example images of the object detection in images task. The dashed bounding box indicates the object is occluded. Different bounding box colors indicate different classes of objects. For better visualization, we only display some attributes. (Color figure online)

3.1 Dataset

The dataset and annotations presented in this workshop are expected to be a significant contribution to the community. As mentioned above, we have collected and annotated the benchmark dataset consisting of 8, 599 images captured by drone platforms in different places at different heights, which is much larger than any previously published drone-based dataset. Specifically, we manually annotated more than 540k bounding boxes of targets of ten predefined categories. Some example images are shown in Fig. 3. We present the number of objects with different occlusion degrees of different object categories in the training, validation, and testing sets in Fig. 1, and plot the number of objects per image vs. percentage of images in each subset to show the distributions of the number of objects in each image in Fig. 2. The images of the three subsets are taken at different locations, but share similar environments and attributes.

In addition, we provide two kinds of useful annotations, occlusion ratio and truncation ratio. Specifically, we use the fraction of pixels being occluded to define the occlusion ratio, and define three degrees of occlusions: no occlusion (occlusion ratio \(0\%\)), partial occlusion (occlusion ratio 1%–50%), and heavy occlusion (occlusion ratio \({>}{50\%}\)). Regarding truncation ratio, it is used to indicate the degree of object parts that appear outside a frame. If an object is not fully captured within a frame, we annotate the bounding box inside the frame boundary and estimate the truncation ratio based on the region outside the image. It is worth mentioning that a target is skipped during evaluation if its truncation ratio is larger than \(50\%\).

3.2 Evaluation Protocol

We require each participating algorithm to output a list of detected bounding boxes with confidence scores for each test image. Following the evaluation protocol in MS COCO [31], we use the AP\(^{\text {IoU}=0.50:0.05:0.95}\), AP\(^{\text {IoU}=0.50}\), AP\(^{\text {IoU}=0.75}\), AR\(^{\text {max}=1}\), AR\(^{\text {max}=10}\), AR\(^{\text {max}=100}\) and AR\(^{\text {max}=500}\) metrics to evaluate the results of detection algorithms. These criteria penalize missing detection of objects as well as duplicate detections (two detection results for the same object instance). Specifically, AP\(^{\text {IoU}=0.50:0.05:0.95}\) is computed by averaging over all 10 Intersection over Union (IoU) thresholds (i.e., in the range [0.50 : 0.95] with the uniform step size 0.05) of all categories, which is used as the primary metric for ranking. AP\(^{\text {IoU}=0.50}\) and AP\(^{\text {IoU}=0.75}\) are computed at the single IoU thresholds 0.5 and 0.75 over all categories, respectively. The AR\(^{\text {max}=1}\), AR\(^{\text {max}=10}\), AR\(^{\text {max}=100}\) and AR\(^{\text {max}=500}\) scores are the maximum recalls given 1, 10, 100 and 500 detections per image, averaged over all categories and IoU thresholds. Please refer to [31] for more details.

4 Results and Analysis

4.1 Submitted Detectors

There are 34 different object detection methods from 31 different institutes submitted to the VisDrone-DET2018 challenge. The VisDrone committee also reports the results of the 4 baseline methods, i.e., FPN (A.35) [29], R-FCN (A.36) [7], Faster R-CNN (A.37) [40], and SSD (A.38) [32]. For these baselines, the default parameters are used or set to reasonable values. Thus, there are 38 algorithms in total included in the VisDrone-DET2018 challenge. We present a brief overview of the entries and provide the algorithm descriptions in Appendix A.

Table 2. Object detection results on the VisDrone-DET2018 testing set. The submitted algorithms are ranked based on the AP score. \(*\) indicates that the detection algorithm is submitted by the committee.

Nine submitted detectors improve the Faster R-CNN method [40], namely JNU_Faster RCNN (A.5), Faster R-CNN3 (A.7), MMN (A.9), CERTH-ODI (A.13), MFaster-RCNN (A.14), Faster R-CNN2 (A.16), IITH DODO (A.18), Faster R-CNN+ (A.19), and DPNet (A.34). Seven detectors are based on the FPN method [29], including FPN+ (A.1), DE-FPN (A.3), DFS (A.4), FPN2 (A.11), DDFPN (A.17), FPN3 (A.21), and DenseFPN (A.22). Three detectors are inspired from RetinaNet [30], including Keras-RetinaNet (A.27), RetinaNet2 (A.28), and HAL-Retina-Net (A.32). Three detectors, i.e., RefineDet+ (A.10), RD\(^4\)MS (A.24), and R-SSRN (A.30), are based on the RefineDet method [50]. Five detectors, i.e., YOLOv3+ (A.6), YOLOv3++ (A.12), YOLOv3_DP (A.26), MSYOLO (A.29) and SODLSY (A.33), are based on the YOLOv3 method [39]. CFE-SSDv2 (A.15) is based on the SSD method [32]. SOD (A.23) is based on the R-FCN method [7]. L-H RCNN+ (A.25) is modified from the light-head R-CNN method [28]. AHOD (A.31) is a feature fusion backbone network with the capability of modeling geometric transformations. MSCNN (A.20) is formed by two sub-networks: a multi-scale object proposal network (MS-OPN) [4] and an accurate object detection network (AODN) [5]. YOLO-R-CNN (A.2) and MMF (A.8) are the combinations of YOLOv3 and Faster R-CNN. We summarize the submitted algorithms in Table 3.

Table 3. The descriptions of the submitted algorithms in the VisDrone-DET2018 challenge. The tracking speed (in FPS), GPUs for training, backbone network, training datasets (I is imageNet, L is ILSVRC, P is COCO, V is VisDrone-DET2018 train set) and implementation details are reported. The \(*\) mark is used to indicate the methods are submitted by the VisDrone committee.

4.2 Overall Results

The overall results of the submissions are presented in Table 2. As shown in Table 2, we find that HAL-Retina-Net (A.32) and DPNet (A.34) are the only two algorithms achieving more than \(30\%\) AP score. HAL-Retina-Net (A.32) uses the SE module [27] and downsampling-upsampling [46] to learn channel attention and spatial attention. DPNet (A.34) employs the framework of FPN [29] to capture context information in different scales of feature maps. DE-FPN (A.3) and CFE-SSDv2 (A.15) rank in the third and fourth places with more than \(25\%\) AP score, respectively. We also report the detection results of each object category in Table 4. As shown in Table 4, we observe that all the top three results of different kinds of objects are produced by the detectors with top four AP scores (see Table 2), i.e., HAL-Retina-Net (A.32), DPNet (A.34), DE-FPN (A.3), and CFE-SSDv2 (A.15).

Among the 4 baseline methods provided by the VisDrone committee, FPN (A.35) achieves the best performance, SSD (A.38) performs the worst, and R-FCN (A.36) performs better than Faster R-CNN (A.37). These results of the algorithms are consistent with that in the MS COCO dataset [31].

  • SSD (A.38) performs worst, only producing \(2.52\%\) AP score. CFE-SSDv2 (A.15) is an improvement of SSD (A.38), which uses a new comprehensive feature enhancement mechanism to highlight the weak features of small objects and adopts the multi-scale testing to further improve the performance. Specifically, it brings a significant improvement on AP score (i.e., \(26.48\%\)), ranking the fourth place.

  • Faster R-CNN (A.37) performs slightly better than \(2.89\%\) AP. DPNet (A.34) uses three Faster R-CNN models to detect different scales of objects. Specifically, the authors train FPN [29] architecture based Faster R-CNN models with multiple scales (i.e., \(1000\times 1000, 800\times 800, 600\times 600\)), achieving the second best AP score (\(30.92\%\)). Faster R-CNN2 (A.16) and Faster R-CNN+ (A.19) design the size of anchors to adapt to the distribution of objects, producing \(21.34\%\) and \(9.67\%\) AP score, respectively. MFaster-RCNN (A.14) replaces the ROI pooling layer with ROI align layer proposed in Mask R-CNN [23] to get better results for small object detection, i.e., obtaining \(18.08\%\) AP score.

  • R-FCN (A.36) achieves much better performance than SSD and Faster R-CNN, i.e., producing \(7.20\%\) AP. However, its accuracy is still not satisfactory. SOD (A.23) use the pyramid-like prediction network for RPN and R-FCN [7] to improve object detection performance. In this way, the predictions made by higher level feature maps contain stronger contextual semantics while the lower level ones integrate more localized information at finer spatial resolution. It generates \(0.93\%\) high AP score than R-FCN (A.36), i.e., \(8.27\%\) vs. \(7.20\%\).

  • FPN (A.35) performs the best among the 4 baseline methods by achieving 13.36 AP score, ranking in the middle of all submissions. We speculate that the extracted semantic feature maps at all scales are effective to deal with the objects with various scales. To further improve the accuracy, DE-FPN (A.3) enhances the data augmentation part by image cropping and color jitter, achieving \(27.10\%\) AP, ranking the third place. DDFPN (A.17) uses the DBPN [22] super resolution network to up-sample the image, producing \(21.05\%\) AP. FPN2 (A.11) implements an additional keypoint classification module to help locate the object, improving \(2.79\%\) AP score comparing to FPN (A.35).

Table 4. The AP\(^{\text {IoU}=0.50:0.05:0.95}\) scores on the VisDrone2018 testing set of each object category. \(*\) indicates the detection algorithms submitted by the VisDrone committee. The top three results are highlighted in bold, italic and underline fonts.

4.3 Discussion

As shown in Table 3, we find that 18 detectors perform better than all the baseline methods. The best detector HAL-Retina-Net (A.32) achieves \(31.88\%\) AP score, which is still far from satisfactory in real applications. In the following, we discuss some critical issues in object detection on drone platforms.

Large Scale Variations. As shown in Fig. 3, the objects have a substantial difference in scales, even for the objects in the same category. For example, as shown in the top-left of Fig. 3, cars on the bottom of the image appear larger than cars on the top-right side of the image. This factor greatly challenges the performance of the detectors. For better performance, it is necessary to redesign the anchor scales to adapt to scales of objects in the dataset, and it is also interesting to design an automatic mechanism to handle the objects with large scale variations in object detection. Meanwhile, fusing multi-level convolutional features to integrate contextual semantic information is also effective to handle scale variations, just like the architecture in FPN (A.35). In addition, multi-scale testing and model ensemble are effective to deal with the scale variations.

Occlusion. Occlusion is one of the critical issues challenging the detection performance, especially in our VisDrone2018 dataset (see Fig. 3). For example, as shown in Fig. 1, most of the instances in bus and motor categories, are occluded by other objects or background obstacles, which greatly hurt the detection performance. Specifically, the best detector HAL-Retina-Net (A.32) only produces less than \(20\%\) AP scores in these two categories. All the other detectors even produces less than \(1\%\) AP score on the motor class. In summary, it is important and urgent to design an effective strategy to solve the occlusion challenge to improve the detection performance.

Class Imbalance. Class imbalance is another issue of object detection. As shown in Fig. 1, there are much less awning-tricycle, tricycle, and bus instances in the training set than the instances in the car and pedestrian classes. Most of the detectors perform much better on the car and pedestrian classes than on the awning-tricycle, tricycle, and bus classes. For example, DPNet (A.34) produces \(45.06\%\) and \(54.62\%\) APs on the car and pedestrian classes, while only produces \(11.85\%\), \(21.79\%\), and \(3.78\%\) APs on the awning-tricycle, tricycle, and bus classes, see Table 4 for more details. The most straightforward and common approach is using the sampling strategy to balance the samples in different classes. Meanwhile, some methods (i.e., Keras-RetinaNet (A.27), RetinaNet2 (A.28)) integrate the weights of different object classes in the loss function to handle this issue, such as Focal Loss [30]. How to solve the class imbalance issue is still an open problem.

5 Conclusions

This paper reviews the VisDrone-DET2018 challenge and its results. The challenge contains a large-scale drone-based object detection dataset, including 8, 599 images (6, 471 for training, 548 for validation, and 1, 580 for testing) with rich annotations, including object bounding boxes, object categories, occlusion status, truncation ratios, etc. A set of 38 detectors have been evaluated on the released dataset. A large percentage of them have been published in recent top conferences and journals, such as ICCV, CVPR, and TPAMI, and some of them have not yet been published (available at arXiv). The top three detectors are HAL-Retina-Net (A.32), DPNet (A.34), and DE-FPN (A.3), achieving \(31.8\%\), \(30.92\%\), and \(27.10\%\) APs, respectively.

The VisDrone-DET2018 primary objective is to establish a community-based common platform for discussion and evaluation of detection performance on drones. This challenge will not only serve as a meeting place for researchers in this area but also present major issues and potential opportunities. We hope the released dataset allows for the development and comparison of the algorithms in the object detection fields, and workshop challenge provides a way to track the process. Our future work will be focused on revising the evaluation kit, dataset, as well as including more challenging vision tasks on the drone platform, through the feedbacks from the community.