In this appendix, we provide a short summary of all algorithms participated in the VisDrone2018 competition. These are ordered according to the submissions of their final results.
A.1 Improved Feature Pyramid Network (FPN+)
Karthik Suresh, Hongyu Xu, Nitin Bansal, Chase Brown, Yunchao Wei, Zhangyang Wang, Honghui Shi
k21993@tamu.edu, xuhongyu2006@gmail.com, bansa01@tamu.edu
chasebrown42@tamu.edu, wychao1987@gmail.com, atlaswang@tamu.edu
honghui.shi@ibm.com
FPN+ is improved from the Feature Pyramid Network (FPN) model [29]. The main changes we made are concluded as follows: (1) We resize the input images with different scales; (2) We use more scales of smaller anchors; (3) We ensemble FPN models with different anchors and parameters; (4) We employ NMS as another post processing step to avoid box overlap and multi-scale testing. Specifically, we use a FPN with ResNet-101 pre-trained weights on ImageNet as the backbone. We also attempt to make some changes to the training data (resizing it to different shapes, cutting it into pieces, etc.).
A.2 Fusion of Faster R-CNN and YOLOv3 (YOLO-R-CNN)
Wenchi Ma, Yuanwei Wu, Usman Sajid, Guanghui Wang
{wenchima, y262w558,usajid,ghwang}@ku.edu
YOLO-R-CNN is basically a voting algorithm specifically designed for object detection. Instead of the widely used feature-level fusion for deep neural networks, our approach works at the detection-level. We train two different DCNN models, i.e., Faster R-CNN [40] and YOLOv3 [39]. Then the final detection results are produced by voting, weighted averages of the two above models.
A.3 Data Enhanced Feature Pyramid Network (DE-FPN)
Jingkai Zhou, Yi Luo, Hu Lin, Qiong Liu
{201510105876, 201721045510, 201721045497}@mail.scut.edu.cn
liuqiong@scut.edu.cn
DE-FPN is based on the Feature Pyramid Network (FPN) model [29] with data enhancement. Specifically, we enhance the training data by image cropping and color jitter. We use ResNeXt-101 64-4d as the backbone of FPN with COCO pre-trained model. We remove level 6 of FPN to improve small object detection.
A.4 Focal Loss for Object Detection (DFS)
Ke Bo
kebo3@mail2.sysu.edu.cn
DFS is based on ResNet-101 and Feature Pyramid Networks [29]. The features from Conv2_x are also used to detect objects, which gains about \(1\%\) improvements in mAP. Our model use other techniques including multiple scale training and testing, deformable convolutions and Soft-NMS.
A.5 Faster R-CNN by Jiangnan University (JNU_Faster RCNN)
Haipeng Zhang
6161910043@vip.jiangnan.edu.cn
JNU_Faster RCNN is based on the Faster R-CNN algorithm [40] to complete the detection task. The source code is from Github repository named faster-rcnn.pytorch2. We use trainset and valset of the VisDrone2018-DET dataset without additional training data to train this model. The pre-trained model is Faster R-CNN with ResNet-101 backbone.
A.6 Improved YOLOv3 (YOLOv3+)
Siwei Wang, Xintao Lian
285111284@qq.com
YOLOv3+ is improved from YOLO [37]. Specifically, we use the VisDrone2018-DET train set and pre-trained models on the COCO dataset to fine-tune our model.
A.7 Improved Faster R-CNN: (Faster R-CNN3)
Yiling Liu, Ying Li
liulingyi601@mail.nwpu.edu.cn, lybyp@nwpu.edu.cn
Faster R-CNN3 is based on Faster R-CNN [40]. We only use VisDrone2018 train set as the training set. Our algorithm is implemented in TITAN XP\(\times 2\), Ubuntu, pytorch. The testing speed is about 7s per image. The based network of Faster R-CNN is ResNet-101.
A.8 The Object Detection Algorithm Based on Multi-Model Fusion (MMF)
Yuqin Zhang, Weikun Wu, Zhiyao Guo, Minyu Huang
{23020161153381,23020171153097}@stu.xmu.edu.cn
{23020171153021,23020171153029}@stu.xmu.edu.cn
MMF is a multi-model fusion based on Faster-RCNN [40] and YOLOv3 [39]. The Faster-RCNN algorithm is a modification of a published one3. We re-write the codes and re-set the parameters including learning rate, gamma, step size, scales, anchors and ratios. We use the ResNet-152 as the backbone. The YOLOv3 algorithm is also a modification of a published one4. We modify the anchor setting by the K-means++ algorithm.
Since the number of objects in different categories is very unbalanced in the train set, we adopt the multi-model fusion method to improve the accuracy. Specifically, the car category is trained using the Faster R-CNN algorithm and the rest categories are trained using the YOLOv3 algorithm. Moreover, the rest categories are divided into two types: one for pedestrian and people, and the other one for bicycle, van, truck, tricycle, awning-tricycle, bus and motor. Finally, the detection result is determined by the three models.
A.9 Multi-Model Net Based on Faster R-CNN (MMN)
Xin Sun
sunxin@ouc.edu.cn
MMN is based on the Faster R-CNN network [40]. We first crop the train images into small size to avoid the resize operation. Then there cropped images are used to train different Faster R-CNN networks. Finally we merge the results to obtain the best classification result.
A.10 An improved Object Detector Based on Single-Shot Refinement Neural Network (RefineDet+)
Kaiwen Duan, Honggang Qi, Qingming Huang
duankaiwen17@mails.ucas.ac.cn, hgqi@jdl.ac.cn, qmhuang@ucas.ac.cn
RefineDet+ improves the single-shot refinement Neural Network (RefineDet) [50] by proposing a new anchor matching strategy. Our anchor matching strategy is based on center point translation of anchors (CPTMatching). During the training phase, the detector needs to determine which anchors correspond to an object bounding box. RefineDet firstly matches each object to the anchor with the highest jaccard overlap and then matches each anchor to an object with jaccard overlap higher than a threshold (usually 0.5). However, some nearby anchors whose jaccard overlap lower than the threshold may also help the bounding box regression. In our CPTMatching, we first select bounding boxes predicted by the anchor refinement module (ARM) [50] to have a jaccard overlap with any object ground-truth higher than 0.5. For each selected bounding box, we compute a measurement \(\beta \), which is a ratio of the center point distance between its corresponding anchor and its matched ground-truth box to the scale of its anchor. Discard those anchors whose \(\beta \) are larger than a threshold. The remaining anchors are called potential valid anchors. Finally, we align each center point of those potential valid anchors to the center of their nearest ground-truth boxes. Anchors are preserved if their jaccard overlap higher than 0.6 with the aligned ground-truth.
A.11 An improved Object Detector Based on Feature Pyramid Networks (FPN2)
Zhenwei He, Lei Zhang
{hzw, leizhang}@cqu.edu.cn
FPN2 is based on the Feature Pyramid Networks (FPN) object detection framework [
29]. To obtain better detection results, we improve the original FPN in three folds:
-
Data expansion. We extend the training set by clipping the images. The clipped images contain at least one object. New pictures have different proportions of the original pictures. In our implementation, the proportions to the width or height are set as 0.5 and 0.7, which results in totally 4 kinds of ratios ([0.5, 0.5], [0.5, 0.7], [0.7, 0.7], [0.7, 0.5] to the width and height, respectively). As a result, the extended datasets has 5 times number of training pictures compared to the original dataset.
-
Keypoint classification. We implement an auxiliary keypoint classification task to further improve the detection accuracy. The bounding box is the border line of the foreground and background, therefore, we suppose the 4 corners and the center of the bounding box are the keypoints of the corresponding object. 4 corners of the bounding box are annotated as background while the center is annotated as the category of the corresponding object in our implement.
-
Fusion of different models. We train our deep model with different expanded datasets to obtain different models. First, we implement the NMS to generate the detection results of the each deep models. Then, we count the number of bounding boxes with the score greater than the threshold from different deep models. If the number is more than half of the deep models, we will keep the bounding box; otherwise we will discard it. Finally, we perform NMS again to generate the final detection results.
A.12 Modified YOLOv3 (YOLOv3++)
Yuanwei Wu, Wenchi Ma, Usman Sajid, Guanghui Wang
YOLOv3++ is based on YOLOv3 [39], which is a one stage detection method without using object proposals. We follow the default setting in YOLOv3 during training. To improve the object detection performance, we conduct experiments by increasing network resolution in inference and training time, and recalculating the anchor box priors on VisDrone dataset. We only use the provided training dataset to train YOLOv3 without adding additional training data, and evaluate the algorithm performance on the validation dataset. Then, the training and validation datasets are combined together to train a new YOLOv3 model, and the predicted classes probabilities and bounding boxes position on the testing dataset are submitted as our final submission.
A.13 CERTH’s Object Detector in Images (CERTH-ODI)
Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris
{michem, koafgeri, giannakeris, stefanos, ikom}@iti.gr
CERTH-ODI is trained on the whole training set of the VisDrone2018-DET dataset. However, since pedestrian and cars were dominant, compared to other classes, in order to balance the training set, we remove several thousand cars and pedestrians annotations. For the training we use the Inception ResNet-v2 Faster R-CNN model pre-trained on the MSCOCO dataset. In order to provide more accurate results, we use a combination of different training set-ups: One with all the available object classes trained until 800, 000 training steps, one with four-wheel vehicles only (i.e., car, van, truck, bus, because they share similar characteristics) and one with the remaining classes. We apply each model separately on each image, and NMS on the results, and afterwards we merge all the resulting bounding boxes from the different training models. Subsequently, we reject overlapping bounding boxes with an IoU of 0.6, which is chosen empirically, excluding several combinations, like people-bicycle, people-motor that tends to high overlap.
A.14 Modified Faster-RCNN for Small Objects Detection (MFaster-RCNN)
Wenrui He, Feng Zhu
{hewenrui,zhufeng}@bupt.edu.cn
MFaster-RCNN is improved from the Faster R-CNN model [40]. Our method only uses the VisDrone2018-DET train set with data augmentation, including cropping, zooming and flipping. We use pre-trained ResNet-101 as backbone due to GPU limit. The tuned hyper-parameters are mainly presented as follows: (1) The anchor ratio is adjusted from [0.5, 1, 2] to [0.5, 1.5, 2.5] which is calculated by K-means with training data. (2) The base size of the anchors remains 16 but the multiplicative scale is adjusted from [4, 8, 16] to [1, 2, 4, 8, 16] to detect very small objects. (3) The RPN positive overlap threshold which decides whether the proposal is regarded as a positive sample to train the RPN is adjusted from 0.7 to 0.5, while the RPN negative overlap threshold is adjusted from 0.3 to 0.2. (4) the foreground and background thresholds for the Fast R-CNN part is 0.5 and 0.1, respectively. The foreground fraction is adjusted from 0.25 to 0.4 as we find these values perform the best in practice. (5) The maximal number of the groundtruth boxes allowed to use for training in one input image is adjusted from 20 to 60 as we have enormous training samples per image in average.
A.15 SSD with Comprehensive Feature Enhancement (CFE-SSDv2)
Qijie Zhao, Feng Ni, Yongtao Wang
{zhaoqijie,nifeng,wyt}@pku.edu.cn
CFE-SSDv2 is an end-to-end one-stage object detector with specially designed novel module, namely Comprehensive Feature Enhancement (CFE) module. We first improve the original SSD model [32] by enhancing the weak features for detecting small objects. Our CFE-SSDv25 is designed to enhance detection ability for small objects. In addition, we apply multi-scale inference strategy. Although training on input size of \(800\times 800\), we have broadened the input size to \(2200\times 2200\) when inferencing, leading to further improvement in detecting small objects.
A.16 Faster R-CNN based object detection (Faster R-CNN2)
Fan Zhang
zhangfan_1@stu.xidian.edu.cn
Faster R-CNN2 depends on the VisDrone2018-DET dataset, Faster R-CNN [40], and adjusts some parameters. For example, we add a small anchor scale \(64^2\) to detect small objects and reducing the mini-batch size from 256 to 128.
A.17 DBPN+Deformable FPN+Soft NMS (DDFPN)
Liyu Lu
coordinate@tju.edu.cn
DDFPN is designed for small object detection. Since the dataset contains a large amount of small objects, so we scale up the original image first and then detect the objects. We use the DBPN [22] super resolution network to upsample the image. The model used for the detection task is Deformable FPN [8, 29]. Bsides, we use Soft-NMS [3] as our non-maximum suppression algorithm.
For network training, we first divide the input image into patches with size of \(1024\times 1024\), and obtain 23, 602 training images and their corresponding labels as training set to train Deformable FPN. Our training process uses OHEM training methods [42]. The learning rate we use in training is 0.001, and the image input size we use for training is \(1024\times 1024\). ResNet-101 is used as the backbone and the weights are initialized using model pre-trained on Image-Net.
For network testing, we use the same method as the training set to divide the test image into patches with size of \(512\times 512\). Next, we up-sample the previously obtained test patches to \(1024\times 1024\) via the DBPN network. Then we send these testing patches to our trained Deformable FPN to obtain \(1024\times 1024\) results. In fact, the size of image corresponds to the size of the original image is \(512\times 512\). Since the results in different scales are consistent with the characteristics of visual blind spots, we use multi-scale images for testing purpose, i.e., [688, 688], [800, 800], [12001200], [1400, 1400], [1600, 1600], [2000, 2000]. Finally, we merge the results in each scale derived from the same image back into one single image, hence we obtain the final test results.
A.18 IIT-H Drone Object DetectiOn (IITH DODO)
Nehal Mamgain, Naveen Kumar Vedurupaka, K. J. Joseph, Vineeth N. Balasubramanian
cs17mtech11023@iith.ac.in, naveenkumarvedurupaka@gmail.com
{cs17m18p100001, vineethnb}@iith.ac.in
IITH DODO is based on the Faster R-CNN architecture [40]. Faster R-CNN has a Region Proposal Network which is trained end-to-end and shares convolutional features with the detection network thus ameliorating the computational cost of high-quality region proposals. Our model uses the Inception ResNet-v2 [43] backbone for Faster R-CNN, pre-trained on the COCO dataset. The anchor sizes are adapted to improve the performance of the detector on small objects. To reduce the complexity of the model, only anchors of single aspect ratio are used. Non-maximum suppression is applied both on the region proposals and final bounding box predictions. Atrous convolutions are also used. No external data has been used for training and no test-time augmentation is performed. The performance is the result of the detection pipeline with no ensemble used.
A.19 Adjusted Faster Region-Based Convolutional Neural Networks (Faster R-CNN+)
Tiaojio Lee, Yue Fan, Han Deng, Lin Ma, Wei Zhang
{tianjiao.lee, fanyue}@mail.sdu.edu.cn, 67443542@qq.com
forest.linma@gmail.com, davidzhang@sdu.edu.cn
Faster R-CNN+ basically follows the original algorithm of Faster R-CNN [40]. However, we make a few adjustments on Faster R-CNN algorithm to adapt to the VisDroneDet dataset. The dataset given consists of many variant-sized proposals which leads to a multi-scale object detection problem. In order to mitigate the impact of relatively rapid changes in scales of bounding boxes, we add more anchors with large sizes to fit those larger objects and keep small anchors unchanged for detecting tiny objects such as people and cars in long distance. Moreover, the VisDroneDet dataset has an unbalanced object distribution. When testing on validation dataset, we find that classification performance for car is much better than others for the reason that the appearance of cars is more frequent. To alleviate this problem, we mask out some car bounding boxes by hand for pursuing better classification performance.
A.20 Multi-Scale Convolutional Neural Networks (MSCNN)
Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao
moqimubai@sina.cn
MSCNN is a unified and effective deep CNN based approach for simultaneously detecting multi-class objects in UAV images with large scales variability. Similar to Faster R-CNN, our method consists of two sub-networks: a multi-scale object proposal network (MS-OPN) [4] and an accurate object detection network (AODN) [5]. Firstly, we redesign the architecture of feature extractor by adopting some recent building blocks, such as inception module, which can increase the variety of receptive field sizes. In order to ease the inconsistency between the sizes variability of objects and fixed filter receptive fields, MS-OPN is performed with several intermediate feature maps, according to the certain scale ranges of different objects. That is, the larger objects are proposed in deeper feature maps with highly-abstracted information, whereas the smaller objects are proposed in shallower feature maps with fine-grained details. The object proposals from various intermediate feature maps are combined together to form the outputs of MSOPN. Then those object proposals are sent to the AODN for accurate object detection. For detecting small objects appear in groups, AODN combines several outputs of intermediate layers to increase the resolution of feature maps, enabling small and densely packed objects to produce larger regions of strong response.
A.21 Feature Pyramid Networks for Object Detection (FPN3)
Chengzheng Li, Zhen Cui
czhengli@njust.edu.cn, zhen.cui@njust.edu.cn
FPN3 follows the Faster R-CNN [40] which uses the feature pyramid [29]. We make some modifications of the algorithm. First of all, since most of the objects in the VisDrone-DET2018 dataset are quite small, we add another stage feature based on the original P2-P6 layer, we take the output of conv1 which not pass the pool layer in ResNet [25] as C1, then transform it into P1 whose stride is 1 / 2 like what has done in FPN, the anchor size of this stage is 16, the additional stage is used to detect smaller objects in images. Secondly, we change the up-sample by nearest pixel which has no parameters into deconvolution layer which has parameters just like convolution layer, since the layers with parameters have better performance compared with those without parameters. In the training phase, we trained two model based on ResNet-50 and ResNet-101 respectively, all training images are artificially occluded and flipped to make the model more robust. In the testing phase, we combine the two results from ResNet-50 and ResNet-101 as the final results.
A.22 Dense Feature Pyramid Net (DenseFPN)
Xin Sun
sunxin@ouc.edu.cn
DenseFPN is inspired by Feature Pyramid Networks [29] to detect small objects on the VisDrone2018 dataset. In the original FPN, they use the low-level feature to predict small objects. We use the same strategy and fuse high-level and low-level features in a dense feature pyramid network. Meanwhile, we crop the training images into small size to avoid the resize operation. Then we merge the results to obtain the best detection result.
A.23 SJTU-Ottawa-Detector (SOD)
Lu Ding, Yong Wang, Chen Qian, Robert Laganière, Xinbin Luo
dinglu@sjtu.edu.cn, ywang6@uottawa.ca, qian_chen@sjtu.edu.cn
laganier@eecs.uottawa.ca, losinbin@sjtu.edu.cn
SOD employs a pyramid like predict network to detect objects with large range of scales because pyramid like representations are wildly used in recognition systems for detecting objects at different scales [29]. The prediction made by higher level feature maps contains stronger contextual semantics while the lower level ones integrate more localized information at finer spatial resolution. These predictions are hierarchically fused together to make pyramid-like decisions. We use this pyramid-like prediction network for RPN and region fully convolutional networks (R-FCN) [7] to perform object detection.
A.24 Ensemble of Four RefineDet Models with Multi-scale Deployment (RD\(^4\)MS)
Oliver Acatay, Lars Sommer, Arne Schumann
{oliver.acatay, lars.sommer, arne.schumann}@iosb.fraunhofer.de
RD\(^4\)MS is a variant of the RefineDet detector [50], using the novel Squeeze-and-Excitation Network (SENet) [27] as the base network. We train four variants of the detector: three with SEResNeXt-50 and one with ResNet-50 as base network, each with its own set of anchor sizes. Multi-scale testing is employed and the detection results of the four detectors are combined via weighted averaging.
A.25 Improved Light-Head RCNN (L-H RCNN+)
Li Yang, Qian Wang, Lin Cheng, Shubo Wei
liyang16361@163.com, {844021514,2643105823,914417478}@qq.com
L-H RCNN+ modifies the published algorithm light-head RCNN [28]. Firstly, we modify the parameter “anchor_scales”, replacing \(32\times 32\), \(64\times 64\), \(128\times 128\), and \(256\times 256\), \(512\times 512\) with \(16\times 16\), \(32\times 32\), \(64\times 64\), \(128\times 128\), and \(256\times 256\). Secondly, we modify the parameter “max_boxes_of_image”, replacing 50 with 600. Thirdly, we perform NMS for all detection objects that belong to the same category.
A.26 Improved YOLOv3 with Data Processing (YOLOv3_DP)
Qiuchen Sun, Sheng Jiang
345412791@qq.com
YOLOv3_DP is based on the YOLOv3 model [39]. We process the images of the training set. Firstly, we remove some images including pedestrians and cars. Secondly, we increase the brightness of some lower brightness pictures to enhance the data. Thirdly, we black out the ignored regions in the image and cut the image to a size of \(512\times 512\) with a step size of 400. The images without objects will be removed. Thus the final training set contains 31, 406 images with the size of \(512\times 512\).
A.27 RetinaNet implemented by Keras (Keras-RetinaNet)
Qiuchen Sun, Sheng Jiang
345412791@qq.com
Keras-RetinaNet is based on the RetinaNet [30], which is implemented by the Keras toolkit. The source codes can be found in the website: https://github.com/facebookresearch/Detectron.
A.28 Focal Loss for Dense Object Detection (RetinaNet2)
Li Yang, Qian Wang, Lin Cheng, Shubo Wei
liyang16361@163.com, 844021514@qq.com
2643105823@qq.com, 914417478@qq.com
RetinaNet2 is based on the RetinaNet [30] algorithm. The short size of images is set as 800, and the maximum size of the image is set as 1, 333. Each mini-batch has 1 image per GPU for training/testing.
A.29 Multiple-scale yolo network (MSYOLO)
Haoran Wang, Zexin Wang, Ke Wang, Xiufang Li
18629585405@163.com, 1304180668@qq.com
MSYOLO is the multiple scale YOLO network [39]. We divide these categories into three cases according to the scale of object categories. First of all, ignored regions and the others category is the first case for areas that are not trained. Second, since many categories are not in the same scale, we divide them into big objects and small objects on the basis of their scale of boxes. The big objects include car, truck, van and bus, and small objects contain pedestrian, people, bicycle, motor, tricycle and awning-tricycle. The big objects as the center of cut images have the scale of \(480\times 480\), and small objects have the scale of \(320\times 320\).
A.30 Region-Based Single-Shot Refinement Network (R-SSRN)
Wenzhe Yang, Jianxiu Yang
wzyang@stu.xidian.edu.cn, jxyang xidian@outlook.com
R-SSRN is based on the deep learning method called RefineDet [50]. We do modifications as follows: (1) We remove the deep convolutional layers after fc7 because they are useless for the VisDrone small objects detection; (2) We added additional small scales default boxes at conv3_3 and set new aspect ratios by using k-means cluster algorithm on the VisDrone dataset. The change of scales and aspect radios can help default boxes more suitable for the objects; (3) Due to the small and dense objects, we split each image to 5 sub images (i.e., bottom left, bot-tom right, middle, top left, top right), where the size of each sub image is 1 / 4 of that of original image. After testing the sub images, we merge them by using NMS.
A.31 A Highly Accurate Object Detectior in Drone Scenarios (AHOD)
Jianqiang Wang, Yali Li, Shengjin Wang
wangjian16@mails.tsinghua.edu.cn, liyali@ocrserv.ee.tsinghua.edu.cn
wgsgj@tsinghua.edu.cn
AHOD is a novel detection method with high accuracy in drone scenarios. First, a feature fusion backbone network with the capability of modelling geometric transformations is proposed to extract object features. Second, a special object proposal sub-network is applied to generate candidate proposals using multi-level semantic feature maps. Finally, a head network refines the categories and locations of these proposals.
A.32 Hybrid Attention based Low-Resolution Retina-Net (HAL-Retina-Net)
Yali Li, Zhaoyue Xia, Shengjin Wang
{liyali13, wgsgj}@tsinghua.edu.cn
HAL-Retina-Net is improved from Retina-Net [30]. To detect low-resolution objects, we remove P6 and P7 from the pyramid. Therefore the pyramid of the network includes three pathways, named as P1, P3, and P5. We inherit the head design of Retina-Net. Furthermore, the post-processing steps include Soft-NMS [3] and bounding box voting. We find that bounding box voting improve the detection accuracy significantly. Furthermore, we note that by increasing the normalized size of images the improvement is also significant. To encourage the full usage of training samples, we split the images into patches with size \(640\times 640\). To avoid out-of-memory in detection, we use SE-ResNeXt-50 [27] as the backbone network and train the Retina-Net with the cropped sub-images. To further improve the detection accuracy, we add the hybrid attention mechanism. That is, we use additional SE module [27] and downsample-upsample [46] to learn channel attention and spatial attention. Our final detection results on test challenge are based on the ensemble of modified Retina-net with the above two kinds of attention.
A.33 Small Object Detection in Large Scene based on YOLOv3 (SODLSY)
Sujuan Wang, Yifan Zhang, Jian Cheng
Wangsujuan@airia.cn, {yfzhang,jcheng}@nlpr.ia.ac.cn
SODLSY is used to detect objects in various weather and lighting conditions, representing diverse scenarios in our daily life. The maximum resolution of VOC images is \(469\times 387\), and \(640\times 640\) for COCO images. However, the static images in VisDrone2018 are even \(2000\times 1500\). Our algorithm first increases the size of training images to 1184, ensuring the information of small objects is not lost during image resizing. Thus, we adopt multi-scale (\(800,832,864,\cdots ,1376\)) training method to improve the detection results. We also re-generate the anchors for VisDrone-DET2018.
A.34 Drone Pyramid Networks (DPNet)
HongLiang Li, Qishang Cheng, Wei Li, Xiaoyu Chen, Heqian Qiu, Zichen Song
hlli@uestc.edu.cn, cqs@std.uestc.edu.cn, weili.cv@gmail.com
xychen9459@gmail.com, hqqiu@std.uestc.edu.cn, szc.uestc@gmail.com
DPNet consists of three object detectors based on the Faster R-CNN [
40] method, by Caffe2 deep learning framework, in parallel, on 8 GPUs. The design of DPNet follows the idea of FPN [
29], whose feature extractors are ResNet-50 [
25], ResNet101, and ResNeXt [
48], respectively which are pre-trained on ImageNet only. To make full use of the data, the methods are designed as follows:
-
No additional data other than the train + val dataset are used for network training.
-
We train Faster-RCNN with FPN using multiple scales (\(1000\times 1000\), \(800\times 800\), \(600\times 600\)) to naturally handle objects of various sizes, generating improvement of \(4\%\).
-
When selecting the prior boxes, we set multiple specific aspect ratios based on the scale distribution of the training data.
-
We change the IOU threshold from 0.5 to 0.6 and removed the last FPN layer, yielding an improvement of \(1.5\%\).
We use Soft-NMS [3] instead of the conventional NMS to select predicted boxes. We replace RoIPooling with RoIAlign [23] to perform feature quantification. We use multi-scale training and testing. On the validation set, our best single detector obtains mAP \(49.6\%\), and the ensemble of three detectors achieves mAP \(50.0\%\).
A.35 Feature pyramid networks for object detection (FPN)
Submitted by the VisDrone Committee
FPN takes advantage of featurized image pyramids to construct deep convolutional networks with inherent multi-scale and pyramidal hierarchy. It combines low-resolution but semantically strong features and high-resolution but semantically weak features. Thus it exploits rich semantic information from all scales and is trained in an end-to-end way. The experimental results show this architecture can significantly improve the generic deep models in several fields. Please refer to [29] for more details.
A.36 Object Detection via Region-based Fully Convolutional Networks (R-FCN)
Submitted by the VisDrone Committee
R-FCN is the region-based fully convolutional networks for object detection without ROI-wise sub-network. Different from previous methods such as Fast R-CNN and Faster R-CNN using a costly pre-region subnetwork, R-FCN addresses the dilemma between translation-invariance in image classification and translation-variance in object detection using the position-sensitive score maps. That is, almost all the computation is shared on the whole image. It also can adopt recent state-of-the-art classification network backbones (e.g., ResNet and Inception) for better performance. Please refer to [7] for more details.
A.37 Towards Real-Time Object Detection with Region Proposal Networks (Faster R-CNN)
Submitted by the VisDrone Committee
Faster R-CNN improves Fast R-CNN [20] by adding Region Proposal Network (RPN). RPN shares full-image convolutional features with the detection network in a nearly cost-free way. Specifically, it is implemented as a fully convolutional network that predict object bounding boxes and their scores at the same time. Given object proposals by the RPN, the Fast R-CNN model shares the convolutional features and then detect object efficiently. Please refer to [40] for more details.
A.38 Single Shot MultiBox Detector (SSD)
Submitted by the VisDrone Committee
SSD is the one-stage object detection method based a single deep neural network without proposal generation. It uses a set of pre-set anchor boxes with different aspect ratios and scales, and then discretize the output space of bounding boxes. To deal with multi-scale object detection, the network combines predictions from several feature maps in different layers. Notably, it predicts the score of each object category and adjusts the corresponding bounding box simultaneously. The network is optimized via a multi-task loss including confidence loss and localization loss. Finally, the multi-scale bounding boxes are converted to the detection results using the NMS strategy. Please refer to [32] for more details.