Deep Reinforcement Learning with Iterative Shift for Visual Tracking

  • Liangliang Ren
  • Xin Yuan
  • Jiwen LuEmail author
  • Ming Yang
  • Jie Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11213)


Visual tracking is confronted by the dilemma to locate a target both accurately and efficiently, and make decisions online whether and how to adapt the appearance model or even restart tracking. In this paper, we propose a deep reinforcement learning with iterative shift (DRL-IS) method for single object tracking, where an actor-critic network is introduced to predict the iterative shifts of object bounding boxes, and evaluate the shifts to take actions on whether to update object models or re-initialize tracking. Since locating an object is achieved by an iterative shift process, rather than online classification on many sampled locations, the proposed method is robust to cope with large deformations and abrupt motion, and computationally efficient since finding a target takes up to 10 shifts. In offline training, the critic network guides to learn how to make decisions jointly on motion estimation and tracking status in an end-to-end manner. Experimental results on the OTB benchmarks with large deformation improve the tracking precision by 1.7% and runs about 5 times faster than the competing state-of-the-art methods.


Visual object tracking Reinforcement learning Actor-critic algorithm 



This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61672306, Grant U1713214, Grant 61572271, and in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564.

Supplementary material

474192_1_En_42_MOESM1_ESM.pdf (260 kb)
Supplementary material 1 (pdf 260 KB)
474192_1_En_42_MOESM2_ESM.mp4 (23.5 mb)
Supplementary material 2 (mp4 24113 KB)


  1. 1.
    Ali, N.H., Hassan, G.M.: Kalman filter tracking. Int. J. Comput. Appl. (0975–8887) 89(9) (March 2014)Google Scholar
  2. 2.
    Ammar, H.B., Eaton, E., Ruvolo, P., Taylor, M.: Online multi-task learning for policy gradient methods. In: ICML, pp. 1206–1214 (2014)Google Scholar
  3. 3.
    Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: CVPR, pp. 983–990 (2009)Google Scholar
  4. 4.
    Bao, C., Wu, Y., Ling, H., Ji, H.: Real time robust L1 tracker using accelerated proximal gradient approach. In: CVPR, pp. 1830–1837 (2012)Google Scholar
  5. 5.
    Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR, pp. 2544–2550 (2010)Google Scholar
  6. 6.
    Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcement learning. In: ICCV, pp. 2488–2496 (2015)Google Scholar
  7. 7.
    Cao, Q., Lin, L., Shi, Y., Liang, X., Li, G.: Attention-aware face hallucination via deep reinforcement learning. In: CVPR, pp. 690–698 (2017)Google Scholar
  8. 8.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. arXiv (2014)Google Scholar
  9. 9.
    Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: CVPR, pp. 142–149 (2000)Google Scholar
  10. 10.
    Cui, Z., Xiao, S., Feng, J., Yan, S.: Recurrently target-attending tracking. In: CVPR, pp. 1449–1458 (2016)Google Scholar
  11. 11.
    Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Discriminative scale space tracking. TPAMI 39(8), 1561–1575 (2017)CrossRefGoogle Scholar
  12. 12.
    Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp. 4310–4318 (2015)Google Scholar
  13. 13.
    Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 472–488. Springer, Cham (2016). Scholar
  14. 14.
    Decarlo, D., Metaxas, D.: Optical flow constraints on deformable models with applications to face tracking. In: IJCV 38(2), 99–127 (2000)Google Scholar
  15. 15.
    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)Google Scholar
  16. 16.
    Gao, J., Ling, H., Hu, W., Xing, J.: Transfer learning based visual tracking with gaussian processes regression. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 188–203. Springer, Cham (2014). Scholar
  17. 17.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)Google Scholar
  18. 18.
    Gordon, D., Farhadi, A., Fox, D.: Re3 : Real-time recurrent regression networks for object tracking. CoRR (2017)Google Scholar
  19. 19.
    Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. TPAMI 20(10), 1025–1039 (1998)CrossRefGoogle Scholar
  20. 20.
    Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. TPAMI 37(3), 583–596 (2015)CrossRefGoogle Scholar
  21. 21.
    Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. In: ICCV, pp. 105–114 (2017)Google Scholar
  22. 22.
    Isard, M., Blake, A.: Condensation-conditional density propagation for visual tracking. In: IJCV, pp. 5–28 (1998)Google Scholar
  23. 23.
    Jie, Z., Liang, X., Feng, J., Jin, X., Lu, W., Yan, S.: Tree-structured reinforcement learning for sequential object localization. In: NIPS, pp. 127–135 (2016)Google Scholar
  24. 24.
    Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. TPAMI 34(7), 1409–1422 (2012)CrossRefGoogle Scholar
  25. 25.
    Kamalapurkar, R., Andrews, L., Walters, P., Dixon, W.E.: Model-based reinforcement learning for infinite-horizon approximate optimal tracking. TNNLS 28(3), 753–758 (2017)Google Scholar
  26. 26.
    Karayev, S., Baumgartner, T., Fritz, M., Darrell, T.: Timely object recognition. In: NIPS, pp. 899–907 (2012)Google Scholar
  27. 27.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR (2014)Google Scholar
  28. 28.
    Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: NIPS, pp. 1008–1014 (2000)Google Scholar
  29. 29.
    Kong, X., Xin, B., Wang, Y., Hua, G.: Collaborative deep reinforcement learning for joint object search. In: CVPR (2017)Google Scholar
  30. 30.
    Kristan, M.: A novel performance evaluation methodology for single-target trackers. TPAMI 38(11), 2137–2155 (2016)CrossRefGoogle Scholar
  31. 31.
    Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. TIP 24(12), 5630–5644 (2015)MathSciNetGoogle Scholar
  32. 32.
    Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: ICCV, pp. 3074–3082 (2015)Google Scholar
  33. 33.
    Mathe, S., Pirinen, A., Sminchisescu, C.: Reinforcement learning for visual object detection. In: CVPR, pp. 2894–2902 (2016)Google Scholar
  34. 34.
    Mnih, V.: Human-level control through deep reinforcement learning. Nature (7540) 518, 529–533 (2015)CrossRefGoogle Scholar
  35. 35.
    Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. CoRR (2015)Google Scholar
  36. 36.
    Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)Google Scholar
  37. 37.
    O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V.: PGQ: combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626 (2016)
  38. 38.
    Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: multitarget detection and tracking. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004). Scholar
  39. 39.
    Qi, Y., et al.: Hedged deep tracking. In: CVPR, pp. 4303–4311 (2016)Google Scholar
  40. 40.
    Rao, Y., Lu, J., Zhou, J.: Attention-aware deep reinforcement learning for video face recognition. In: ICCV, pp. 3931–3940 (2017)Google Scholar
  41. 41.
    Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI 36(7), 1442–1468 (2014)CrossRefGoogle Scholar
  42. 42.
    Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W.H., Yang, M.: CREST: convolutional residual learning for visual tracking. CoRR (2017)Google Scholar
  43. 43.
    Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W.H., Yang, M.H.: Crest: convolutional residual learning for visual tracking. In: ICCV, pp. 2555–2564 (2017)Google Scholar
  44. 44.
    Supancic III, J., Ramanan, D.: Tracking as online decision-making: learning a policy from streaming videos with reinforcement learning. In: ICCV, pp. 322–331 (2017)Google Scholar
  45. 45.
    Tao, R., Gavves, E., Smeulders, A.W.: Siamese instance search for tracking. In: CVPR, pp. 1420–1429 (2016)Google Scholar
  46. 46.
    Wang, L., Ouyang, W., Wang, X., Lu, H.: STCT: sequentially training convolutional networks for visual tracking. In: CVPR, pp. 1373–1381 (2016)Google Scholar
  47. 47.
    Wang, N., Yeung, D.Y.: Learning a deep compact image representation for visual tracking. In: NIPS, pp. 809–817 (2013)Google Scholar
  48. 48.
    Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR, pp. 2411–2418 (2013)Google Scholar
  49. 49.
    Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. TPAMI 37(9), 1834–1848 (2015)CrossRefGoogle Scholar
  50. 50.
    Yang, B., Nevatia, R.: Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In: CVPR, pp. 1918–1925 (2012)Google Scholar
  51. 51.
    Yang, H., Shao, L., Zheng, F., Wang, L., Song, Z.: Recent advances and trends in visual tracking: a review. Neurocomputing (18), 3823–3831 (2011)CrossRefGoogle Scholar
  52. 52.
    Yun, S., Choi, J., Yoo, Y., Yun, K., Young Choi, J.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR, pp. 2711–2720 (2017)Google Scholar
  53. 53.
    Zhang, D., Maei, H., Wang, X., Wang, Y.F.: Deep reinforcement learning for visual object tracking in videos. arXiv preprint arXiv:1701.08936 (2017)
  54. 54.
    Zhang, J., Ma, S., Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy minimization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 188–203. Springer, Cham (2014). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Liangliang Ren
    • 1
  • Xin Yuan
    • 1
  • Jiwen Lu
    • 1
    Email author
  • Ming Yang
    • 2
  • Jie Zhou
    • 1
  1. 1.Tsinghua UniversityBeijingChina
  2. 2.Horizon Robotics, Inc.BeijingChina

Personalised recommendations