The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline

  • Hongyang Yu
  • Guorong LiEmail author
  • Weigang Zhang
  • Qingming HuangEmail author
  • Dawei Du
  • Qi Tian
  • Nicu Sebe


With the increasing popularity of Unmanned Aerial Vehicles (UAVs) in computer vision-related applications, intelligent UAV video analysis has recently attracted the attention of an increasing number of researchers. To facilitate research in the UAV field, this paper presents a UAV dataset with 100 videos featuring approximately 2700 vehicles recorded under unconstrained conditions and 840k manually annotated bounding boxes. These UAV videos were recorded in complex real-world scenarios and pose significant new challenges, such as complex scenes, high density, small objects, and large camera motion, to the existing object detection and tracking methods. These challenges have encouraged us to define a benchmark for three fundamental computer vision tasks, namely, object detection, single object tracking (SOT) and multiple object tracking (MOT), on our UAV dataset. Specifically, our UAV benchmark facilitates evaluation and detailed analysis of state-of-the-art detection and tracking methods on the proposed UAV dataset. Furthermore, we propose a novel approach based on the so-called Context-aware Multi-task Siamese Network (CMSN) model that explores new cues in UAV videos by judging the consistency degree between objects and contexts and that can be used for SOT and MOT. The experimental results demonstrate that our model could make tracking results more robust in both SOT and MOT, showing that the current tracking and detection methods have limitations in dealing with the proposed UAV benchmark and that further research is indeed needed.


UAV Object detection Single object tracking Multiple object tracking 



This work was supported in part by National Natural Science Foundation of China under Grant 61620106009, Grant 61836002, Grant U1636214, Grant 61931008, Grant 61772494 and Grant 61976069, in part by Key Research Program of Frontier Sciences, CAS: QYZDJ-SSW-SYS013, in part by the Italy-China collaboration project TALENT: 2018YFE0118400, in part by the University of Chinese Academy of Sciences, in part by Youth Innovation Promotion Association CAS, in part by ARO grants W911NF-15-1-0290 and Faculty Research Gift Awards by NEC Laboratories of America and Blippar.


  1. Bae, S. H., & Yoon, K. (2014). Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In CVPR (pp. 1218–1225).Google Scholar
  2. Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008, 246309.CrossRefGoogle Scholar
  3. Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. In ECCV (pp. 850–865).Google Scholar
  4. Bewley, A., Ge, Z., Ott, L., Ramos, F. T., & Upcroft, B. (2016). Simple online and realtime tracking. In ICIP (pp. 3464–3468).Google Scholar
  5. Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In AVSS (pp. 1–6).Google Scholar
  6. Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In ICCV (pp. 3029–3037).Google Scholar
  7. Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In CVPR (pp. 539–546).Google Scholar
  8. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS (pp. 379–387).Google Scholar
  9. Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2016). ECO: Efficient convolution operators for tracking. arXiv:1611.09224.
  10. Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In ICCV (pp. 4310–4318).Google Scholar
  11. Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2016). Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In CVPR (pp. 1430–1438).Google Scholar
  12. Danelljan, M., Robinson, A., Khan, F. S., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV (pp. 472–488).Google Scholar
  13. Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).Google Scholar
  14. Dicle, C., Camps, O. I., & Sznaier, M. (2013). The way they move: Tracking multiple targets with similar appearance. In ICCV (pp. 2304–2311).Google Scholar
  15. Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.CrossRefGoogle Scholar
  16. Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., & Tian, Q. (2018). The unmanned aerial vehicle benchmark: Object detection and tracking. In ECCV (pp. 375–391).Google Scholar
  17. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.CrossRefGoogle Scholar
  18. Fan, H., & Ling, H. (2017). Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV.Google Scholar
  19. Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In AVSS (pp. 1–6).Google Scholar
  20. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR (pp. 3354–3361).Google Scholar
  21. Girshick, R. B. (2015). Fast R-CNN. In ICCV (pp. 1440–1448).Google Scholar
  22. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).Google Scholar
  23. Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 FPS with deep regression networks. In ECCV (pp. 749–765).Google Scholar
  24. Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.CrossRefGoogle Scholar
  25. Hsieh, M., Lin, Y., & Hsu, W. H. (2017). Drone-based object counting by spatially regularized regional proposal network. In ICCV.Google Scholar
  26. Hwang, S., Park, J., Kim, N., Choi, Y., & Kweon, I. S. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In CVPR (pp. 1037–1045).Google Scholar
  27. Izadinia, H., Saleemi, I., Li, W., & Shah, M. (2012). (MP)2T: Multiple people multiple parts tracker. In ECCV (pp. 100–114).Google Scholar
  28. Kalra, I., Singh, M., Nagpal, S., Singh, R., Vatsa, M., & Sujit, P. (2019). Dronesurf: Benchmark dataset for drone-based face recognition. In IEEE FG 2019 (pp. 1–7).Google Scholar
  29. Kiani Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In ICCV (pp. 1125–1134).Google Scholar
  30. Kim, C., Li, F., Ciptadi, A., & Rehg, J. M. (2015). Multiple hypothesis tracking revisited. In ICCV (pp. 4696–4704).Google Scholar
  31. Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). RON: Reverse connection with objectness prior networks for object detection. In CVPR.Google Scholar
  32. Kristan, M., Leonardis, A., Matas, J., et al. (2016). The visual object tracking VOT2016 challenge results. In ECCV workshop (pp. 777–823).Google Scholar
  33. Kristan, M., Leonardis, A., Matas, J., Felsberg, M., & He, Z. (2017). The visual object tracking VOT2017 challenge results. In ICCV workshop.Google Scholar
  34. Leal-Taixé, L., Milan, A., Reid, I. D., Roth, S., & Schindler, K. (2015). Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942.
  35. Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In CVPR.Google Scholar
  36. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E., Fu, C., & Berg, A. C. (2016). SSD: Single shot multibox detector. In ECCV (pp. 21–37).Google Scholar
  37. Ma, C., Huang, J., Yang, X., & Yang, M. (2015). Hierarchical convolutional features for visual tracking. In ICCV (pp. 3074–3082).Google Scholar
  38. Milan, A., Leal-Taixé, L., Reid, I. D., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.
  39. Milan, A., Rezatofighi, S. H., Dick, A. R., Reid, I. D., & Schindler, K. (2017). Online multi-target tracking using recurrent neural networks. In AAAI (pp. 4225–4232).Google Scholar
  40. Milan, A., Roth, S., & Schindler, K. (2014). Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1), 58–72.CrossRefGoogle Scholar
  41. Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV.Google Scholar
  42. Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In ECCV (pp. 445–461).Google Scholar
  43. Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In CVPR.Google Scholar
  44. Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.MathSciNetCrossRefGoogle Scholar
  45. Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In CVPR (pp. 4293–4302).Google Scholar
  46. Ning, W., Wengang, Z., Qi, T., Richang, H., Meng, W., & Houqiang, L. (2018). Multi-cue correlation filters for robust visual tracking. In CVPR (pp. 4844–4853).Google Scholar
  47. Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33.CrossRefGoogle Scholar
  48. Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2011). Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR (pp. 1201–1208).Google Scholar
  49. Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., & Yang, M. (2016). Hedged deep tracking. In CVPR (pp. 4303–4311).Google Scholar
  50. Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).Google Scholar
  51. Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCVW (pp. 17–35).Google Scholar
  52. Robicquet, A., Sadeghian, A., Alahi, A., & Savarese, S. (2016). Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV (pp. 549–565).Google Scholar
  53. Shu, G., Dehghan, A., Oreifej, O., Hand, E., & Shah, M. (2012). Part-based multiple-person tracking with partial occlusion handling. In CVPR (pp. 1815–1821).Google Scholar
  54. Smeulders, A. W. M., Chu, D. M., Cucchiara, R., Calderara, S., Dehghan, A., & Shah, M. (2014). Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1442–1468.CrossRefGoogle Scholar
  55. Solera, F., Calderara, S., & Cucchiara, R. (2015). Towards the evaluation of reproducible robustness in tracking-by-detection. In AVSS (pp. 1–6).Google Scholar
  56. Son, J., Baek, M., Cho, M., & Han, B. (2017). Multi-object tracking with quadruplet convolutional neural networks. In CVPR.Google Scholar
  57. Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R. W. H., & Yang, M. (2017). CREST: Convolutional residual learning for visual tracking. arXiv:1708.00225.
  58. Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R. W. H., & Yang, M. (2018). VITAL: Visual tracking via adversarial learning. arXiv:1804.04273.
  59. Tang, S., Andres, B., Andriluka, M., & Schiele, B. (2016). Multi-person tracking by multicut and deep matching. In ECCV workshops (pp. 100–111).Google Scholar
  60. Tang, S., Andriluka, M., Andres, B., & Schiele, B. (2017). Multiple people tracking by lifted multicut and person re-identification. In CVPR.Google Scholar
  61. Tao, R., Gavves, E., & Smeulders, A. W. M. (2016). Siamese instance search for tracking. In CVPR (pp. 1420–1429).Google Scholar
  62. Valmadre, J., Bertinetto, L., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2017). End-to-end representation learning for correlation filter based tracking. In CVPR.Google Scholar
  63. Wang, L., Ouyang, W., Wang, X., & Lu, H. (2015). Visual tracking with fully convolutional networks. In ICCV (pp. 3119–3127).Google Scholar
  64. Wang, L., Ouyang, W., Wang, X., Lu, H. (2016). STCT: Sequentially training convolutional networks for visual tracking. In CVPR (pp. 1373–1381).Google Scholar
  65. Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M., Qi, H., Lim, J., Yang, M., & Lyu, S. (2015). DETRAC: A new benchmark and protocol for multi-object tracking. arXiv:1511.04136.
  66. Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. arXiv:1703.07402.
  67. Wu, Y., Lim, J., & Yang, M. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.CrossRefGoogle Scholar
  68. Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., & Zhang, L. (2018). DOTA: A large-scale dataset for object detection in aerial images. In CVPR (pp. 3974–3983).Google Scholar
  69. Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In ICCV (pp. 4705–4713).Google Scholar
  70. Yoon, J. H., Lee, C., Yang, M., & Yoon, K. (2016). Online multi-object tracking via structural constraint event aggregation. In CVPR (pp. 1392–1400).Google Scholar
  71. Yoon, J. H., Yang, M., Lim, J., & Yoon, K. (2015). Bayesian multi-object tracking using motion context from multiple objects. In WACV (pp. 33–40).Google Scholar
  72. Yu, H., Qin, L., Huang, Q., & Yao, H. (2018). Online multiple object tracking via exchanging object context. Neurocomputing, 292, 28–37.CrossRefGoogle Scholar
  73. Yun, S., Choi, J., Yoo, Y., Yun, K., & Choi, J. Y. (2017). Action-decision networks for visual tracking with deep reinforcement learning. In CVPR.Google Scholar
  74. Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. (2014). Fast visual tracking via dense spatio-temporal context learning. In ECCV (pp. 127–141).Google Scholar
  75. Zhang, T., Xu, C., & Yang, M. H. (2017). Multi-task correlation particle filter for robust visual tracking. In CVPR.Google Scholar
  76. Zhong, B., Bai, B., Li, J., Zhang, Y., & Fu, Y. (2018). Hierarchical tracking by reinforcement learning-based searching and coarse-to-fine verifying. IEEE Transactions on Image Processing, 28(5), 2331–2341.MathSciNetCrossRefGoogle Scholar
  77. Zhou, Q., Zhong, B., Zhang, Y., Li, J., & Fu, Y. (2018). Deep alignment network based multi-person tracking with occlusion and motion reasoning. IEEE Transactions on Multimedia, 21(5), 1183–1194.CrossRefGoogle Scholar
  78. Zhu, P., Wen, L., Bian, X., Haibin, L., & Hu, Q. (2018). Vision meets drones: A challenge. arXiv preprint arXiv:1804.07437.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Harbin Institute of TechnologyHarbinChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.Harbin Institute of TechnologyWeihaiChina
  4. 4.University of Texas at San AntonioSan AntonioUSA
  5. 5.University of TrentoTrentoItaly
  6. 6.Key Laboratory of Big Data Mining and Knowledge Management CASBeijingChina
  7. 7.Key Laboratory of Intelligent Information Processing (IIP), Institute of Computing Technology CASBeijingChina

Personalised recommendations