The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

  • Dawei Du
  • Yuankai Qi
  • Hongyang Yu
  • Yifan Yang
  • Kaiwen Duan
  • Guorong LiEmail author
  • Weigang Zhang
  • Qingming Huang
  • Qi Tian
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)


With the advantage of high mobility, Unmanned Aerial Vehicles (UAVs) are used to fuel numerous important applications in computer vision, delivering more efficiency and convenience than surveillance cameras with fixed camera angle, scale and view. However, very limited UAV datasets are proposed, and they focus only on a specific task such as visual tracking or object detection in relatively constrained scenarios. Consequently, it is of great importance to develop an unconstrained UAV benchmark to boost related researches. In this paper, we construct a new UAV benchmark focusing on complex scenarios with new level challenges. Selected from 10 hours raw videos, about 80, 000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e.g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking. Then, a detailed quantitative study is performed using most recent state-of-the-art algorithms for each task. Experimental results show that the current state-of-the-art methods perform relative worse on our dataset, due to the new challenges appeared in UAV based real scenes, e.g., high density, small object, and camera motion. To our knowledge, our work is the first time to explore such issues in unconstrained scenes comprehensively. The dataset and all the experimental results are available in


UAV Object detection Single object tracking Multiple object tracking 



This work was supported in part by National Natural Science Foundation of China under Grant 61620106009, Grant 61332016, Grant U1636214, Grant 61650202, Grant 61772494 and Grant 61429201, in part by Key Research Program of Frontier Sciences, CAS: QYZDJ-SSW-SYS013, in part by Youth Innovation Promotion Association CAS, in part by ARO grants W911NF-15-1-0290 and Faculty Research Gift Awards by NEC Laboratories of America and Blippar.


  1. 1.
    Mot17 challenge.
  2. 2.
    Bae, S.H., Yoon, K.: Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In: CVPR, pp. 1218–1225 (2014)Google Scholar
  3. 3.
    Barekatain, M., et al.: Okutama-action: an aerial view video dataset for concurrent human action detection. In: CVPRW, pp. 2153–2160 (2017)Google Scholar
  4. 4.
    Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process. 2008(2008)CrossRefGoogle Scholar
  5. 5.
    Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). Scholar
  6. 6.
    Bewley, A., Ge, Z., Ott, L., Ramos, F.T., Upcroft, B.: Simple online and realtime tracking. In: ICIP, pp. 3464–3468 (2016)Google Scholar
  7. 7.
    Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: AVSS, pp. 1–6 (2017)Google Scholar
  8. 8.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016)Google Scholar
  9. 9.
    Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. CoRR abs/1611.09224 (2016)Google Scholar
  10. 10.
    Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp. 4310–4318 (2015)Google Scholar
  11. 11.
    Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In: CVPR, pp. 1430–1438 (2016)Google Scholar
  12. 12.
    Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 472–488. Springer, Cham (2016). Scholar
  13. 13.
    Dicle, C., Camps, O.I., Sznaier, M.: The way they move: tracking multiple targets with similar appearance. In: ICCV, pp. 2304–2311 (2013)Google Scholar
  14. 14.
    Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. TPAMI 34(4), 743–761 (2012)CrossRefGoogle Scholar
  15. 15.
    Kristan, M., et al.: The Visual Object Tracking VOT2016 Challenge Results. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 777–823. Springer, Cham (2016). Scholar
  16. 16.
    Everingham, M., Eslami, S.M.A., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111(1), 98–136 (2015)CrossRefGoogle Scholar
  17. 17.
    Fan, H., Ling, H.: Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In: ICCV (2017)Google Scholar
  18. 18.
    Ferryman, J., Shahrokni, A.: Pets 2009: dataset and challenge. In: AVSS, pp. 1–6 (2009)Google Scholar
  19. 19.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)Google Scholar
  20. 20.
    Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 749–765. Springer, Cham (2016). Scholar
  21. 21.
    Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. TPAMI 37(3), 583–596 (2015)CrossRefGoogle Scholar
  22. 22.
    Hsieh, M., Lin, Y., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: ICCV (2017)Google Scholar
  23. 23.
    Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.S.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: CVPR, pp. 1037–1045 (2015)Google Scholar
  24. 24.
    Kahou, S.E., Michalski, V., Memisevic, R., Pal, C.J., Vincent, P.: RATM: recurrent attentive tracking model. In: CVPRW, pp. 1613–1622 (2017)Google Scholar
  25. 25.
    Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y.: RON: reverse connection with objectness prior networks for object detection. In: CVPR (2017)Google Scholar
  26. 26.
    Leal-Taixé, L., Milan, A., Reid, I.D., Roth, S., Schindler, K.: Motchallenge 2015: Towards a benchmark for multi-target tracking. CoRR abs/1504.01942 (2015)Google Scholar
  27. 27.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  28. 28.
    Ma, C., Huang, J., Yang, X., Yang, M.: Hierarchical convolutional features for visual tracking. In: ICCV, pp. 3074–3082 (2015)Google Scholar
  29. 29.
    Milan, A., Leal-Taixé, L., Reid, I.D., Roth, S., Schindler, K.: Mot16: a benchmark for multi-object tracking. CoRR abs/1603.00831 (2016)Google Scholar
  30. 30.
    Milan, A., Roth, S., Schindler, K.: Continuous energy minimization for multitarget tracking. TPAMI 36(1), 58–72 (2014)CrossRefGoogle Scholar
  31. 31.
    Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). Scholar
  32. 32.
    Mueller, M., Smith, N., Ghanem, B.: Context-aware correlation filter tracking. In: CVPR (2017)Google Scholar
  33. 33.
    Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)Google Scholar
  34. 34.
    Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38(1), 15–33 (2000)CrossRefGoogle Scholar
  35. 35.
    Pirsiavash, H., Ramanan, D., Fowlkes, C.C.: Globally-optimal greedy algorithms for tracking a variable number of objects. In: CVPR, pp. 1201–1208 (2011)Google Scholar
  36. 36.
    Qi, Y., et al.: Hedged deep tracking. In: CVPR, pp. 4303–4311 (2016)Google Scholar
  37. 37.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  38. 38.
    Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). Scholar
  39. 39.
    Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: human trajectory understanding in crowded scenes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 549–565. Springer, Cham (2016). Scholar
  40. 40.
    Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI 36(7), 1442–1468 (2014)CrossRefGoogle Scholar
  41. 41.
    Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W.H., Yang, M.: CREST: convolutional residual learning for visual tracking. CoRR abs/1708.00225 (2017)Google Scholar
  42. 42.
    Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: CVPR, pp. 1420–1429 (2016)Google Scholar
  43. 43.
    Valmadre, J., Bertinetto, L., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: CVPR (2017)Google Scholar
  44. 44.
    Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: ICCV, pp. 3119–3127 (2015)Google Scholar
  45. 45.
    Wang, L., Ouyang, W., Wang, X., Lu, H.: STCT: sequentially training convolutional networks for visual tracking. In: CVPR, pp. 1373–1381 (2016)Google Scholar
  46. 46.
    Wen, L., et al.: DETRAC: a new benchmark and protocol for multi-object tracking. CoRR abs/1511.04136 (2015)Google Scholar
  47. 47.
    Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: NIPS, pp. 2074–2082 (2016)Google Scholar
  48. 48.
    Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. CoRR abs/1703.07402 (2017)Google Scholar
  49. 49.
    Wu, Y., Lim, J., Yang, M.: Object tracking benchmark. TPAMI 37(9), 1834–1848 (2015)CrossRefGoogle Scholar
  50. 50.
    Xiang, Y., Alahi, A., Savarese, S.: Learning to track: online multi-object tracking by decision making. In: ICCV, pp. 4705–4713 (2015)Google Scholar
  51. 51.
    Yang, T., Chan, A.B.: Recurrent filter learning for visual tracking. In: ICCVW, pp. 2010–2019 (2017)Google Scholar
  52. 52.
    Yun, S., Choi, J., Yoo, Y., Yun, K., Choi, J.Y.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR (2017)Google Scholar
  53. 53.
    Zhang, T., Xu, C., Yang, M.H.: Multi-task correlation particle filter for robust visual tracking. In: CVPR (2017)Google Scholar
  54. 54.
    Zhang, X., Zou, J., Ming, X., He, K., Sun, J.: Efficient and accurate approximations of nonlinear convolutional networks. In: CVPR, pp. 1984–1992 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of Chinese Academy of SciencesBeijingChina
  2. 2.Harbin Institute of TechnologyHarbinChina
  3. 3.Harbin Institute of TechnologyWeihaiChina
  4. 4.Huawei Noah’s Ark LabShenzhenChina
  5. 5.University of Texas at San AntonioSan AntonioUSA

Personalised recommendations