Advertisement

DeepTAM: Deep Tracking and Mapping

  • Huizhong Zhou
  • Benjamin Ummenhofer
  • Thomas Brox
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)

Abstract

We present a system for keyframe-based dense camera tracking and depth map estimation that is entirely learned. For tracking, we estimate small pose increments between the current camera image and a synthetic viewpoint. This significantly simplifies the learning problem and alleviates the dataset bias for camera motions. Further, we show that generating a large number of pose hypotheses leads to more accurate predictions. For mapping, we accumulate information in a cost volume centered at the current depth estimate. The mapping network then combines the cost volume and the keyframe image to update the depth prediction, thereby effectively making use of depth measurements and image-based priors. Our approach yields state-of-the-art results with few images and is robust with respect to noisy camera poses. We demonstrate that the performance of our 6 DOF tracking competes with RGB-D tracking algorithms.We compare favorably against strong classic and deep learning powered dense depth algorithms.

Keywords

Camera tracking Multi view stereo ConvNets 

Supplementary material

474218_1_En_50_MOESM2_ESM.pdf (6.2 mb)
Supplementary material 2 (pdf 6301 KB)

References

  1. 1.
    Abadi, M., et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015). Software available from: www.tensorflow.org
  2. 2.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 37–45, Dec 2015.  https://doi.org/10.1109/ICCV.2015.13
  3. 3.
    Collins, R.T.: A space-sweep approach to true multi-image matching, pp. 358–363. IEEE, June 1996.  https://doi.org/10.1109/CVPR.1996.517097
  4. 4.
    Dhiman, V., Tran, Q.H., Corso, J.J., Chandraker, M.: A continuous occlusion model for road scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4331–4339, June 2016.  https://doi.org/10.1109/CVPR.2016.469
  5. 5.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. arXiv:1406.2283 [cs], June 2014
  6. 6.
    Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2018).  https://doi.org/10.1109/TPAMI.2017.2658577CrossRefGoogle Scholar
  7. 7.
    Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10605-2_54CrossRefGoogle Scholar
  8. 8.
    Fattal, R.: Single image dehazing. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH 2008, pp. 72:1–72:9. ACM, New York (2008).  https://doi.org/10.1145/1399504.1360671, http://doi.acm.org/10.1145/1399504.1360671
  9. 9.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. Int. J. Comput. Vis. 70(1), 41–54 (2006).  https://doi.org/10.1007/s11263-006-7899-4CrossRefGoogle Scholar
  10. 10.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE (2012)Google Scholar
  11. 11.
    Gupta, S., Arbelez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from rgb-d images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571, June 2013.  https://doi.org/10.1109/CVPR.2013.79
  12. 12.
    Gupta, S., Arbeláez, P., Girshick, R., Malik, J.: Indoor scene understanding with rgb-d images: bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vision 112(2), 133–149 (2015).  https://doi.org/10.1007/s11263-014-0777-6, http://dx.doi.org/10.1007/s11263-014-0777-6MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hirschmüller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 807–814, June 2005.  https://doi.org/10.1109/CVPR.2005.56
  14. 14.
    Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 504–511 (2013).  https://doi.org/10.1109/TPAMI.2012.156CrossRefGoogle Scholar
  15. 15.
    Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P.: End-to-end learning of geometry and context for deep stereo regression. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 66–75, Oct 2017.  https://doi.org/10.1109/ICCV.2017.17
  16. 16.
    Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  17. 17.
    Kerl, C., Sturm, J., Cremers, D.: Dense visual SLAM for RGB-D cameras. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2100–2106, Nov 2013.  https://doi.org/10.1109/IROS.2013.6696650
  18. 18.
    Kerl, C., Sturm, J., Cremers, D.: Robust odometry estimation for RGB-D cameras. In: 2013 IEEE International Conference on Robotics and Automation, pp. 3748–3754, May 2013.  https://doi.org/10.1109/ICRA.2013.6631104
  19. 19.
    Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], Dec 2014
  20. 20.
    Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning. arXiv:1709.06841 [cs], Sept 2017
  21. 21.
    Loshchilov, I., Hutter, F.: SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv:1608.03983 [cs, math], Aug 2016
  22. 22.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004).  https://doi.org/10.1023/B:VISI.0000029664.99615.94MathSciNetCrossRefGoogle Scholar
  23. 23.
    Newcombe, R.A., Lovegrove, S., Davison, A.: DTAM: dense tracking and mapping in real-time. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2320–2327 (2011).  https://doi.org/10.1109/ICCV.2011.6126513
  24. 24.
    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113, June 2016.  https://doi.org/10.1109/CVPR.2016.445
  25. 25.
    Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_31CrossRefGoogle Scholar
  26. 26.
    Song, S., Chandraker, M.: Joint SFM and detection cues for monocular 3D localization in road scenes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3734–3742, June 2015.  https://doi.org/10.1109/CVPR.2015.7298997
  27. 27.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 190–198, July 2017.  https://doi.org/10.1109/CVPR.2017.28
  28. 28.
    Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580, Oct 2012.  https://doi.org/10.1109/IROS.2012.6385773
  29. 29.
    Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574,  https://doi.org/10.1109/CVPR.2017.695, July 2017
  30. 30.
    Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  31. 31.
    Valada, A., Radwan, N., Burgard, W.: Deep Auxiliary Learning for Visual Localization and Odometry. arXiv:1803.03642 [cs], Mar 2018
  32. 32.
    Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: Learning of Structure and Motion from Video. arXiv:1704.07804 [cs], Apr 2017
  33. 33.
    Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050, May 2017.  https://doi.org/10.1109/ICRA.2017.7989236
  34. 34.
    Xiao, J., Owens, A., Torralba, A.: SUN3D: a database of big spaces reconstructed using SfM and object labels. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 1625–1632, Dec 2013.  https://doi.org/10.1109/ICCV.2013.458
  35. 35.
    Zhang, H., Patel, V.M.: Densely connected pyramid dehazing network. In: CVPR (2018)Google Scholar
  36. 36.
    Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. In: CVPR (2018)Google Scholar
  37. 37.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised Learning of Depth and Ego-Motion from Video. arXiv:1704.07813 [cs], Apr 2017

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Huizhong Zhou
    • 1
  • Benjamin Ummenhofer
    • 1
  • Thomas Brox
    • 1
  1. 1.University of FreiburgFreiburgGermany

Personalised recommendations