Multimedia Tools and Applications

, Volume 77, Issue 17, pp 22131–22143 | Cite as

Robust real-time visual object tracking via multi-scale fully convolutional Siamese networks

  • Longchao YangEmail author
  • Peilin Jiang
  • Fei Wang
  • Xuan Wang


Robust visual object tracking against occlusions and deformations is still very challenging task. To tackle these issues, existing Convolutional Neural Networks (CNNs) based trackers either fail to handle them or can just run in low speed. In this paper, we present a realtime tracker which is robust to occlusions and deformations based on a Region-based, Multi-Scale Fully Convolutional Siamese Network (R-MSFCN). In the proposed R-MSFCN, the information of regions is extracted separately by the proposition of position-sensitive score maps on multiple convolutional layers. Combining these score maps via adaptive weights leads to accurate location of the target on a new frame. The experiments illustrate that our method outperforms state-of-the-art approaches, and can handle the cases of object deformation and occlusion at about 31 FPS.


Visual tracking Region-based Fully convolutional Siamese-network Deep learning 



This work was supported in part by Natural Science Foundation of China (No.61231018), National Science and Technology Support Program (2015BAH31F01) and Program of Introducing Talents of Discipline to University under grant B13043.


  1. 1.
    Ahuja N, Liu S, Ghanem B, Zhang T (2012) Robust visual tracking via multi-task sparse learning. In: CVPR, pp 2042–2049Google Scholar
  2. 2.
    Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr P (2016) Staple: complementary learners for real-time tracking. Comput Sci 38(2):311–323Google Scholar
  3. 3.
    Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr P (2016) Fully-convolutional siamese networks for object tracking. arXiv:1606.09549
  4. 4.
    Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. TPAMIGoogle Scholar
  5. 5.
    Danelljan M, Hager G, Khan FS, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: BMVCGoogle Scholar
  6. 6.
    Danelljan M, Hager G, Khan FS, Felsberg M (2016) Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In: CVPRGoogle Scholar
  7. 7.
    Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: ECCVGoogle Scholar
  8. 8.
    Hare S, Saffari A, Torr PHS (2016) Struck: structured output tracking with kernels. TPAMI 38(10):263–270CrossRefGoogle Scholar
  9. 9.
    Held D, Thrun S, Savarese S (2016) Learning to track at 100 fps with deep regression networks. In: ECCVGoogle Scholar
  10. 10.
    Henriques JF, Rui C, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. TPAMI 37(3):583–596CrossRefGoogle Scholar
  11. 11.
    Jifeng D, Yi L, Kaiming H, Jian S (2016) R-FCN: object detection via region-based fully convolutional networks. arXiv:1605.06409
  12. 12.
    Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. TPAMI 34(7):1409–22CrossRefGoogle Scholar
  13. 13.
    Kristan M, Matas J, Leonardis A, Felsberg M, Cehovin L, Fernandez G, Vojir T, Hager G, Nebehay G, Pflugfelder R (2016) The visual object tracking vot2015 challenge results. In: ICCV, pp 564–586Google Scholar
  14. 14.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25(2):2012Google Scholar
  15. 15.
    Li Y, Qi H, Dai J, Ji X, Wei Y (2016) Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709
  16. 16.
    Liu T, Wang G, Yang Q (2015) Real-time part-based visual tracking via adaptive correlation filters. In: CVPR, pp 4902–4912Google Scholar
  17. 17.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR, pp 3431–3440Google Scholar
  18. 18.
    Ma C, Yang X, Zhang C, Yang MH (2015) Long-term correlation tracking. In: CVPR, pp 5388–5396Google Scholar
  19. 19.
    Nam H, Han B (2015) Learning multi-domain convolutional neural networks for visual tracking. arXiv preprint arXiv:1510.07945
  20. 20.
    Nam H, Baek M, Han B (2016) Modeling and propagating cnns in a tree structure for visual tracking. arXiv:1608.07242
  21. 21.
    Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: ICCV, pp 1520–1528Google Scholar
  22. 22.
    Pinheiro PO, Collobert R, Dollar P (2015) Learning to segment object candidates. Comput Sci: 1990–1998Google Scholar
  23. 23.
    Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang MH (2016) Hedged deep tracking. In: CVPR, pp 4303–4311Google Scholar
  24. 24.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRefGoogle Scholar
  25. 25.
    Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. IJCV 77(1):125–141CrossRefGoogle Scholar
  26. 26.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. IJCV 115(3):211–252MathSciNetCrossRefGoogle Scholar
  27. 27.
    Tao R, Gavves E, Smeulders AW (2016) Siamese instance search for tracking. In: CVPR, pp 1420–1429Google Scholar
  28. 28.
    Wang L, Ouyang W, Wang X, Lu H (2016) Visual tracking with fully convolutional networks. In: ICCV, pp 3119–3127Google Scholar
  29. 29.
    Wu Y, Lim J, Yang MH (2013) Online object tracking: a benchmark. In: CVPR, pp 2411–2418Google Scholar
  30. 30.
    Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. TPAMI 37 (9):1–1CrossRefGoogle Scholar
  31. 31.
    Xiang W, Zhou Y (2014) Part-based tracking with appearance learning and structural constrains. In: ICONIP. Springer, Berlin, pp 594–601Google Scholar
  32. 32.
    Yao R, Shi Q, Shen C, Zhang Y (2013) Part-based visual tracking with online latent structural learning. In: CVPR, pp 2363–2370Google Scholar
  33. 33.
    Zhang T, Jia K, Xu C, Ma Y, Ahuja N (2014) Partial occlusion handling for visual tracking via robust part matching. In: ICCV, pp 1258–1265Google Scholar
  34. 34.
    Zhao H, Shi J, Qi X, Wang X, Jia J (2016) Pyramid scene parsing network. arXiv:1612.01105

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Longchao Yang
    • 1
    Email author
  • Peilin Jiang
    • 2
  • Fei Wang
    • 1
  • Xuan Wang
    • 1
  1. 1.Institute of Artificial Intelligence and RoboticsXi’an Jiaotong UniversityXi’anChina
  2. 2.School of Software EngineeringXi’an Jiaotong UniversityXi’anChina

Personalised recommendations