Multimedia Tools and Applications

, Volume 78, Issue 2, pp 2269–2309 | Cite as

Explorations on visual localization from active to passive

  • Yongquan YangEmail author
  • Yang Wu
  • Ning Chen


In this paper, we novelly consider visual localization in active and passive two ways, with simple definition that active localization assists device to estimate location of its interest while passive localization aids device to estimate its own location in environment. Expecting to indicate some insights into visual localization, we specifically performed two explorations on active localization and more importantly explored to upgrade them from active to passive localization with extra geometry information available. In order to produce unconstrained and accurate 2D location estimation of interested object, we constructed an active localization system by fusing detection, tracking and recognition. Based on recognition, we proposed a collaborative strategy making mutual enhancement between detection and tracking possible to obtain better performance on 2D location estimation. Meanwhile, to actively estimate semantic location of interested visual region, we employed latest state-of-the-art light weight CNN models specifically designed for efficiency and trained two of them with large place dataset in perspective of scene recognition. What’s more, using depth information available from RGB-D camera, we improved the active system for 2D location of interested object to a passive system for relative 3D location of device to the interested object. Firstly estimated was the 3D location of the interested object in the coordinate system of device, then relative location of device to the interested object in world coordinate system was deduced with appropriate assumption. Evaluations both subjectively on a RGB-D sequence obtained in a lab environment and practically on a robotic platform in an office environment indicated that the improved system was suitable for autonomous following robot. As well, the active system for rough semantic location estimation of interested visual region was promoted to a passive system for fine location estimation of device, with available 3D map describing the visited environment. In perspective of place recognition, we first adopted one of the efficient CNN models previously trained for semantic location estimation as a base to generate CNN features for both retrieval of candidate loops in the map and geometrical consistency checking of retrieved loops, then true loops were used to deduce fine location of device itself in environment. Comparison with state-of-the-art results reflected that the promoted system was adequate for long-term robotic autonomy. Achieving favorable performances, the presented four explorations have implied adequacy for elaborating on some insights into visual localization.


Visual localization Following robot Loop closure detection Long-term robotic autonomy 



This work was supported by JSPS KAKENHI Grant Number 15K16024. We gratefully acknowledge Intel China Lab and Beijing Qfeel Technology Co., Ltd., China for equipment support.

Supplementary material

11042_2018_6347_MOESM1_ESM.mp4 (8.9 mb)
ESM 1 (MP4 9095 kb)
11042_2018_6347_MOESM2_ESM.mp4 (9.8 mb)
ESM 2 (MP4 10,062 kb)


  1. 1.
    P Viola, M Jones (2001) Rapid object detection using a boosted cascade of simple features. CVPRGoogle Scholar
  2. 2.
    N Dalal, B Triggs (2005) Histograms of Oriented Gradients for Human Detection. CVPRGoogle Scholar
  3. 3.
    P Felzenszwalb, D Mcallester, D Ramanan (2008) A discriminatively trained, multiscale, deformable part modelfor. CVPRGoogle Scholar
  4. 4.
    Girshick R, Donahue J, Darrell T, et al (2014) Rich feature hierarchies for accurate object detection an semantic segmentation. CVPRGoogle Scholar
  5. 5.
    T-Y Lin, P Dollár, R Girshick, K He, B Hariharan, S Belongie (2017) Feature pyramid networks for object detection. CVPRGoogle Scholar
  6. 6.
    R Girshick (2015) Fast R-CNN. ICCVGoogle Scholar
  7. 7.
    S Ren, K He, R Girshick, J Sun (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. NIPSGoogle Scholar
  8. 8.
    K He, G Gkioxari, P Dollár, R Girshick (2017) Mask R-CNN. ICCVGoogle Scholar
  9. 9.
    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: Unified, real-time object detection. CVPR 1(2)Google Scholar
  10. 10.
    W Liu, D Anguelov, D Erhan, C Szegedy, S Reed (2016) SSD: Single shot multibox detector. ECCVGoogle Scholar
  11. 11.
    Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. CVPR 1(2):8Google Scholar
  12. 12.
    C-Y Fu, W Liu, A Ranga, A Tyagi, AC Berg (2016) DSSD:Deconvolutional single shot detector. arXiv:1701.06659Google Scholar
  13. 13.
    A Krizhevsky, I Sutskever, GE Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Adv Neu Info Proc Syst (NIPS) 1097–1105Google Scholar
  14. 14.
    J Deng, W Dong, R Socher, L-J Li, K Li, L Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. Proc CVPRGoogle Scholar
  15. 15.
    Y Jia, E Shelhamer, J Donahue, S Karayev, J Long, R Girshick, S Guadar-rama, T Darrell (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093Google Scholar
  16. 16.
    Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X, Tensorflow (2016) A system for large-scale machine learning. Tech Rep Google Brain arXiv:1603.04467Google Scholar
  17. 17.
    T Chen, M Li, Y Li, M Lin, N Wang, M Wang, T Xiao, B Xu, C Zhang, Z Zhang (2015) MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. Neural Information Processing Systems. Workshop on Machine Learning SystemsGoogle Scholar
  18. 18.
    Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139MathSciNetCrossRefGoogle Scholar
  19. 19.
    T-Y Lin, P Goyal, R Girshick et al (2017) Focal Loss for Dense Object Detection, in ICCVGoogle Scholar
  20. 20.
    Jasper R, Uijlings R, van de Sande KEA, Gevers T et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171CrossRefGoogle Scholar
  21. 21.
    A Shrivastava, A Gupta, R Girshick (2016) Training region-based object detectors with online hard example mining. CVPRGoogle Scholar
  22. 22.
    Hastie T, Tibshirani R, Friedman J (2008) The elements of statistical learning. Springer series in statistics Springer, BerlinzbMATHGoogle Scholar
  23. 23.
    BD Lucas, T Kanade (1981) An iterative image registration technique with an application to stereo vision. IJCAIGoogle Scholar
  24. 24.
    Bradski GR (1998) Computer vision face tracking for use in a perceptual user interface. Intel Technol J 2(2):12–21Google Scholar
  25. 25.
    Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–577CrossRefGoogle Scholar
  26. 26.
    S Avidan (2004) Support Vector Tracking. IEEE Trans Patt Anal Mach Intel 1064–1072Google Scholar
  27. 27.
    Avidan S (2007) Ensemble tracking. IEEE Trans Pattern Anal Mach Intell 29(2):261–271CrossRefGoogle Scholar
  28. 28.
    Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning, in CVPRGoogle Scholar
  29. 29.
    K Zhang, L Zhang, M-H Yang (2012) Real-Time compressive tracking, In ECCVGoogle Scholar
  30. 30.
    Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. TPAMI 34(7):1409–1422CrossRefGoogle Scholar
  31. 31.
    Zhong W, Lu H, Yang M-H (2012) Robust object tracking via sparse collaborative appearance model. CVPRGoogle Scholar
  32. 32.
    Wen L, Cai Z, Lei Z (2014) Robustonline learned Spatio-Temporal Context model for visual tracking. IEEE Trans Image ProcGoogle Scholar
  33. 33.
    Adam A, Rivlin E, Shimshoni I (2006) Robust fragments based tracking using the integral histogram. CVPR 1:798–805Google Scholar
  34. 34.
    Nebehay G, Pflugfelder R (2015) Clustering of static-adaptive correspondences for deformable object tracking. CVPRGoogle Scholar
  35. 35.
    Pernici F, Del Bimbo A (2014) Object tracking by oversampling local features. TPAMI 36(12)Google Scholar
  36. 36.
    DS Bolme, JR Beveridge, BA Draper, YM Lui (2010) Visual object tracking using adaptive correlation filters. CVPRGoogle Scholar
  37. 37.
    J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the Circulant Structure of Tracking-by-detection with Kernels (2012) in ECCV. Springer Berlin Heidelberg 702–715Google Scholar
  38. 38.
    JF Henriques, R Caseiro, P Martins, J Batista (2014) High-speed tracking with kernelized correlation filtersGoogle Scholar
  39. 39.
    Y Li, J Zhu (2014) A scale adaptive kernel correlation filter tracker with feature integration, in Computer Vision-ECCV 2014 Workshops. Springer 254–265Google Scholar
  40. 40.
    M Danelljan, G Häger, FS Khan, M Felsberg (2014) Accurate scale estimation for robust visual tracking, in Proceedings of the British Machine Vision Conference BMVCGoogle Scholar
  41. 41.
    M Danelljan, FS Khan, M Felsberg, and J vd Weijer (2014) Adaptive color attributes for real-time visual tracking, in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE 1090–1097Google Scholar
  42. 42.
    M Danelljan, FS Khan, M Felsberg (2015) Convolutional features for correlation filter based visual tracking. ICCV WorkshopsGoogle Scholar
  43. 43.
    T Liu, G Wang, Q Yang (2015) Real-time part-based visual tracking via adaptive correlation filters. Proc IEEE Conf Comput Vis Patt Recog 4902–4912Google Scholar
  44. 44.
    C Ma, X Yang, C Zhang, M-H Yang (2015) Long-term correlation tracking. Proc IEEE Conf Comput Vis Patt Recog 5388–5396Google Scholar
  45. 45.
    M Danelljan, G Bhat, FS Khan, M Felsberg (2017) Eco: Efficient convolution operators for tracking. CVPRGoogle Scholar
  46. 46.
    Chen Z, Hong Z, Tao D (2015) An experimental survey on correlation filter-based tracking. Comput Sci 53(6025):68–83Google Scholar
  47. 47.
    N Wang, D-Y Yeung (2013) Learning a deep compact image representation for visual tracking. Adv Neu Info Proc Syst 809–817Google Scholar
  48. 48.
    N Wang , S Li , A Gupta , DY Yeung (2015) Transferring Rich Feature Hierarchies for Robust Visual Tracking. Comput SciGoogle Scholar
  49. 49.
    Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. CVPRGoogle Scholar
  50. 50.
    Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. CVPRGoogle Scholar
  51. 51.
    L Bertinetto, J Valmadre, JF Henriques, A Vedaldi, PHS Torr (2016) Fully-convolutional Siamese networksfor object tracking. arXiv:1606.09549Google Scholar
  52. 52.
    Held D, Thrun S, Savarese S (2016) Learning to track at 100 FPS with deep regression networks. ECCVGoogle Scholar
  53. 53.
    C Harris, M Stephens (1988) A combined corner and edge detector. Proc AVC 147–151Google Scholar
  54. 54.
    P Beaudet (1978) Rotationally invariant image operators. Proc IJCPRGoogle Scholar
  55. 55.
    Lindeberg T (1998) Feature detection with automatic scale selection. IJCV 30(2):79–116CrossRefGoogle Scholar
  56. 56.
    D G Lowe (1999) Object recognition from local scale-invariant features. Proc CVPR 1150–1157Google Scholar
  57. 57.
    H Bay, T Tuytelaars, LV Gool (2006) Surf: Speeded up robust features. Proc ECCV 404–417Google Scholar
  58. 58.
    E Rosten T Drummond (2005) Fusing points and lines for high performance tracking. Proc ICCV 1508–1515Google Scholar
  59. 59.
    E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger (2010) Adaptive and generic corner detection based on the accelerated segment test. Proc ECCVGoogle Scholar
  60. 60.
    M Calonder, V Lepetit, C Strecha, P Fua (2010) Brief: Binary robust independent elementary features. Proc ECCV 778–792Google Scholar
  61. 61.
    E Rublee, V Rabaud, K Konolige, G Bradski (2011) Orb: An efficient alternative to sift or surf. Proc ICCV 2564–2571Google Scholar
  62. 62.
    S Leutenegger, M Chli, R Siegwart (2011) Brisk: Binary robust invariant scalable keypoints. Proc ICCV 2548–2555Google Scholar
  63. 63.
    A Alahi, R Ortiz, P Vandergheynst (2012) Freak: Fast retina keypoint. Proc CVPR 510–517Google Scholar
  64. 64.
    Y Uchida (2016) Local Feature Detectors, Descriptors, and Image Representations: A Survey, arXiv:1607.08368Google Scholar
  65. 65.
    J Sivic, A Zisserman (2003) Video google: A text retrieval approach to object matching in videos. Proc ICCV1470–1477Google Scholar
  66. 66.
    D Nistér, H Stewénius (2006) Scalable recognition with a vocabulary tree. Proc CVPR 2161–2168Google Scholar
  67. 67.
    Y Jiang, C Ngo, J Yang (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. Proc CIVR 494–501Google Scholar
  68. 68.
    H Jégou, M Douze, C Schmid (2008) Hamming embedding and weak geometric consistency for large scale image search. Proc ECCV 304–317Google Scholar
  69. 69.
    Galvez-Lopez D, Tardos JD (2012) Bags of binary words for fast place recognition in image sequences. IEEE Trans Robot 28(5):1188–1197CrossRefGoogle Scholar
  70. 70.
    S Khan, D Wollherr (2015) Ibuild: Incremental bag of binary words for appearance based loop closure detection, in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE 5441–5447Google Scholar
  71. 71.
    L Han, L Fang (2017) Multi-Index Hashing for Loop closure Detection. Int Conf Multimed ExpoGoogle Scholar
  72. 72.
    L Han, L Fang (2017) Beyond SIFT Using Binary features in Loop Closure Detection. IROSGoogle Scholar
  73. 73.
    K Chatfield, K Simonyan, A Vedaldi, A Zisserman (2014) Return of the Devil in the Details: Delving Deep into Convolutional Nets. Bri Mach Vis Conf (BMVC)Google Scholar
  74. 74.
    K Simonyan, A Zisserman (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. URL
  75. 75.
    Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. Eur Conf Comput Vis (ECCV) 8689:584–599Google Scholar
  76. 76.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. IEEE Conf Comput Vis Patt Recog (CVPR) 7–12. doi:
  77. 77.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Comput Vis Pattern Recog.
  78. 78.
    Zhang X, Liu Z (2015) A survey on stereo vision matching algorithms. Intell Control Autom 22(12):2026–2031Google Scholar
  79. 79.
    Kumari D, Kaur K (2016) A survey on stereo matching techniques for 3D vision in image processing. Int J Eng Manuf 4:40–49Google Scholar
  80. 80.
    Wei YM, Kang L, Yang B (2013) WU Ling-Da, applications of structure from motion: a survey. J Zhejiang Univ Sci C 14(7):486–494CrossRefGoogle Scholar
  81. 81.
    O Ozyesil, V Voroninski, R Basri (2017) A Singer, A Survey of Structure from Motion. Acta Numerica 26Google Scholar
  82. 82.
    Aulinas J, Petillot Y, Salvi J, Lladó X (2008) The SLAM problem: a survey. Artif Intel Res Develop 184(1):363–371Google Scholar
  83. 83.
    Gouda W, Gomaa W, Ogawa T (2014) Vision based SLAM for humanoid robots: a survey, Japan-Egypt international conference on. Electronics:170–175Google Scholar
  84. 84.
    Taketomi T, Uchiyama H, Ikeda S (2017) Visual SLAM algorithms: a survey from 2010 to 2016. Ipsj Trans Comput Vis Appl 9(1):16CrossRefGoogle Scholar
  85. 85.
    Zhang X, Zhou X, Lin M, Sun J (2017) Shufflenet: An extremely efficient convolutional neural network for mobile devices. CVPR. arXiv preprint arXiv:1707.01083Google Scholar
  86. 86.
    Luo JH, Wu J, and Lin W (2017) Thinet: A filter level pruning method for deep neural network compression. in ICCVGoogle Scholar
  87. 87.
    B Zhou, A Lapedriza, J Xiao, A Torralba, A Oliva (2014) Learning deep features for scene recognition using places database. Adv Neu Info Proc SystGoogle Scholar
  88. 88.
    B Zhou, A Lapedriza, A Khosla, A Oliva, A Torralba (2017) Places: A 10 million image database for scene recognition. IEEE Trans Pat Anal Mach Intel 99Google Scholar
  89. 89.
    Kalal Z, Mikolajczyk K, Matas J (2010) Forward-backward error: automatic detection of tracking failures. In: Proceedings of the 2010 20th International Conference on Pattern Recognition. IEEE Comput Soc Washington 2756–2759Google Scholar
  90. 90.
    Kalal Z, Matas J, Mikolajczyk K (2010) P-N learning: bootstrapping binary classifiers by structural constraints. In: 23rd IEEE Conference on Computer Vision and Pattern Recognition, CVPR, San FranciscoGoogle Scholar
  91. 91.
    J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: Theory and practice. Int’l J Comput VisGoogle Scholar
  92. 92.
    C Doersch, A Gupta, AA Efros (2013) Mid-level visual element discovery as discriminative mode seeking. Adv Neu Info Proc SystGoogle Scholar
  93. 93.
    Nebehay G, Pflugfelder R (2014) Consensus-based matching and tracking of keypoints. TPAMI 27(10). doi:
  94. 94.
    Y Yang, N Chen, S Jiang (2017) Collaborative strategy for visual object tracking. Multimed Tools Appl 1–21Google Scholar
  95. 95.
    Vojir T, Matas J (2014) The enhanced flock of trackers. RRIVGoogle Scholar
  96. 96.
    Kwon J, Lee KM (2009) Tracking of a non-rigid object via patch-based sampling. CVPRGoogle Scholar
  97. 97.
    Klein DA, Schulz D, Frintrop S, Cremers AB (2010) Adaptive real-time video-tracking for arbitrary objects. IEEE/RSJ 6219(1):772–777Google Scholar
  98. 98.
    Hare S, Saffari A, Torr PHS (2011) Struck: Structured output tracking with kernels. ICCV IEEE Int Conf 263–270Google Scholar
  99. 99.
    Zhang K, Zhang L, Liu Q, Zhang D, Yang M-H (2014) Fast tracking via dense spatio-temporal context learning. ECCVGoogle Scholar
  100. 100.
    M Jaderberg, A Vedaldi, A Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866Google Scholar
  101. 101.
    V Lebedev, Y Ganin, M Rakhuba, I Oseledets, V Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553Google Scholar
  102. 102.
    Zhang X, Zou J, He K, Sun J (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans Pattern Anal Mach Intell 38(10):1943–1955CrossRefGoogle Scholar
  103. 103.
    W Wen, C Wu, Y Wang, Y Chen, H Li (2016) Learning structured sparsity in deep neural networks. Adv Neu Info Proc Syst 2074–2082Google Scholar
  104. 104.
    M Rastegari, V Ordonez, J Redmon, A Farhadi (2016) Xnor-net: Imagenet classification using binary convolutional neural networks. Eur Conf Comput Vis 525–542Google Scholar
  105. 105.
    AG Howard (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. forthcomingGoogle Scholar
  106. 106.
    L Sifre (2014) Rigid-motion scattering for image classification, Ph. D. thesisGoogle Scholar
  107. 107.
    Ulrich I, Nourbakhsh I (2000) Appearance-based place recognition for topological localization. ICRA 2:1023–1029Google Scholar
  108. 108.
    Knopp J, Sivic J, Pajdla T (2010) Avoiding confusing features in place recognition. ECCV 6311:748–761Google Scholar
  109. 109.
    Lowry S, Sünderhauf N, Newman P, Leonard JJ, Cox D (2016) Visual place recognition: a survey. IEEE Trans Robots 32(1):1–19CrossRefGoogle Scholar
  110. 110.
    Williams B, Klein G, Reid I (2011) Automatic re-localization and loop closing for real-time monocular slam. IEEE Trans Pattern Anal Mach Intell 33(9):1699–1712CrossRefGoogle Scholar
  111. 111.
    H Strasdat (2012) Local accuracy and global consistency for efficient visual slam, Ph.D. thesis, CiteseerGoogle Scholar
  112. 112.
    J Engel, T Schöps, D Cremers (2014) Lsd-slam: Large-scale direct monocular slam, in European Conference on Computer Vision. Springer 834–849Google Scholar
  113. 113.
    D Hahnel , W Burgard , D Fox , S Thrun (2003) An efficient fastSLAM algorithm for generating maps of large-scale cyclic environments from raw laser range measurements. IROSGoogle Scholar
  114. 114.
    JiaWang Bian, Wen-Yan Lin, Yasuyuki Matsushita, Sai-Kit Yeung, Tan Dat Nguyen, Ming-Ming Cheng (2017) GMS: Grid-based Motion Statistics for Fast, Ultra-robust Feature Correspondence. Conf Comput Vis Patt Recog (CVPR)Google Scholar
  115. 115.
    Latif Y, Cadena C, Neira J (2013) Robust loop closing over time for pose graph SLAM. Int J Robot Res 32(14):1611–1626CrossRefGoogle Scholar
  116. 116.
    Nister D, Naroditsky O, Bergen J (2004) Visual odometry. IEEE Comput Soc Conf Comput Vis Patt Recog 1(1):I-652–I-659zbMATHGoogle Scholar
  117. 117.
    Y Hou, H Zhang, S Zhou (2015) Convolutional neuralnetwork-based image representation for visual loop closure detection, in information and automation, 2015 IEEE International Conference on. IEEE 2238–2245Google Scholar
  118. 118.
    Cummins M, Newman P (2008) Fab-map: probabilistic localization and mapping in the space of appearance. Int J Robot Res 27(6):647–665CrossRefGoogle Scholar
  119. 119.
    Labbe M, Michaud F (2013) Appearance-based loop closure detection for online large-scale and long-term operation. IEEE Trans Robot 29(3):734–745CrossRefGoogle Scholar
  120. 120.
    Kejriwal N, Kumar S, Shibata T (2016) High performance loop closure detection using bag of word pairs. Robot Auton Syst 77:55–65CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Nara Institute of Science and TechnologyIkomaJapan
  2. 2.Xi’an Polytechnic UniversityXi’anChina

Personalised recommendations