Abstract
We present a system for dense keyframe-based camera tracking and depth map estimation that is entirely learned. For tracking, we estimate small pose increments between the current camera image and a synthetic viewpoint. This formulation significantly simplifies the learning problem and alleviates the dataset bias for camera motions. Further, we show that generating a large number of pose hypotheses leads to more accurate predictions. For mapping, we accumulate information in a cost volume centered at the current depth estimate. The mapping network then combines the cost volume and the keyframe image to update the depth prediction, thereby effectively making use of depth measurements and image-based priors. Our approach yields state-of-the-art results with few images and is robust with respect to noisy camera poses. We demonstrate that the performance of our 6 DOF tracking competes with RGB-D tracking algorithms.We compare favorably against strong classic and deep learning powered dense depth algorithms.
This is a preview of subscription content,
to check access.










Notes
https://github.com/magican/OpenDTAM.git SHA: 1f92a54334c233f9c4ce7d8cbaf9a81dee5e69a6
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In 2015 IEEE international conference on computer vision (ICCV) (pp. 37–45), https://doi.org/10.1109/ICCV.2015.13.
Collins, R.T. (1996). A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE computer society conference on computer vision and pattern recognition, IEEE (pp. 358–363), https://doi.org/10.1109/CVPR.1996.517097.
Dhiman, V., Tran, Q.H., Corso, J.J., & Chandraker, M. (2016). A continuous occlusion model for road scene understanding. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4331–4339), https://doi.org/10.1109/CVPR.2016.469.
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.) Advances in neural information processing systems (Vol. 27, pp. 2366–2374), Curran Associates, Inc.
Engel, J., Schöps, T., & Cremers, D. (2014). LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision (pp. 834–849). Springer.
Engel, J., Koltun, V., & Cremers, D. (2018). Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 611–625. https://doi.org/10.1109/TPAMI.2017.2658577.
Fattal, R. (2008). Single image dehazing. In ACM SIGGRAPH 2008 Papers, ACM, New York, NY, USA, SIGGRAPH ’08 (pp. 72:1–72:9), https://doi.org/10.1145/1399504.1360671.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. International Journal of Computer Vision, 70(1), 41–54. https://doi.org/10.1007/s11263-006-7899-4.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE (pp. 3354–3361).
Gupta, S., Arbelàez, P., & Malik, J. (2013). Perceptual organization and recognition of indoor scenes from rgb-d images. In 2013 IEEE conference on computer vision and pattern recognition (pp. 564–571), https://doi.org/10.1109/CVPR.2013.79.
Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision, 112(2), 133–149. https://doi.org/10.1007/s11263-014-0777-6.
Hirschmüller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 2, pp. 807–814), https://doi.org/10.1109/CVPR.2005.56.
Hosni, A., Rhemann, C., Bleyer, M., Rother, C., & Gelautz, M. (2013). Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2), 504–511. https://doi.org/10.1109/TPAMI.2012.156.
Kendall, A., & Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kendall, A., Martirosyan, H., Dasgupta, S., & Henry, P. (2017). End-to-end learning of geometry and context for deep stereo regression. In 2017 IEEE international conference on computer vision (ICCV) (pp. 66–75), https://doi.org/10.1109/ICCV.2017.17.
Kerl, C., Sturm, J., & Cremers, D. (2013a). Dense visual SLAM for RGB-D cameras. In 2013 IEEE/RSJ international conference on intelligent robots and systems (pp. 2100–2106), https://doi.org/10.1109/IROS.2013.6696650.
Kerl, C., Sturm, J., & Cremers, D. (2013b). Robust odometry estimation for RGB-D cameras. In 2013 IEEE international conference on robotics and automation (pp. 3748–3754), https://doi.org/10.1109/ICRA.2013.6631104.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio , Y. LeCun (Eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.
Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small \(\{\text{AR}\}\) workspaces. In Proceedings of sixth IEEE and ACM international symposium on mixed and augmented reality (ISMAR’07).
Li, R., Wang, S., Long, Z., & Gu, D. (2018). UnDeepVO: monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 7286–7291), https://doi.org/10.1109/ICRA.2018.8461251.
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic gradient descent with warm restarts. In 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings, OpenReview.net.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94.
Newcombe, R. A., Lovegrove, S., & Davison, A. (2011). DTAM: Dense tracking and mapping in real-time. In: 2011 IEEE international conference on computer vision (ICCV) (pp. 2320–2327), https://doi.org/10.1109/ICCV.2011.6126513.
Schönberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revisited. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4104–4113), https://doi.org/10.1109/CVPR.2016.445.
Schönberger, J. L., Zheng, E., Frahm, J. M., & Pollefeys, M. (2016). Pixelwise view selection for unstructured multi-view stereo. In Computer Vision – ECCV 2016 (pp. 501–518). Springer, https://doi.org/10.1007/978-3-319-46487-9_31.
Song, S., & Chandraker, M. (2015). Joint SFM and detection cues for monocular 3D localization in road scenes. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3734–3742), https://doi.org/10.1109/CVPR.2015.7298997.
Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 190–198), https://doi.org/10.1109/CVPR.2017.28.
Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 573–580), https://doi.org/10.1109/IROS.2012.6385773.
Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6565–6574), https://doi.org/10.1109/CVPR.2017.695.
Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., & Brox, T. (2017). DeMoN: Depth and motion network for learning monocular stereo. In IEEE conference on computer vision and pattern recognition (CVPR).
Valada, A., Radwan, N., & Burgard, W. (2018). Deep auxiliary learning for visual localization and odometry. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 6939–6946), https://doi.org/10.1109/ICRA.2018.8462979.
Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). SfM-Net: Learning of structure and motion from video. arXiv:170407804 [cs].
Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2043–2050), https://doi.org/10.1109/ICRA.2017.7989236.
Weerasekera, C. S., Latif, Y., Garg, R., & Reid, I. (2017). Dense monocular reconstruction using surface normals. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2524–2531), https://doi.org/10.1109/ICRA.2017.7989293.
Weerasekera, C. S., Garg, R., Latif, Y., Reid, I. (2018). Learning deeply supervised good features to match for dense monocular reconstruction. In Computer vision—ACCV 2018 (pp. 609–624). Cham: Springer, https://doi.org/10.1007/978-3-030-20873-8_39.
Xiao, J., Owens, A., & Torralba, A. (2013). SUN3D: A database of big spaces reconstructed using SfM and object labels. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1625–1632), https://doi.org/10.1109/ICCV.2013.458.
Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, H., & Patel, V. M. (2018a) Densely connected pyramid dehazing network. In CVPR.
Zhang, H., & Patel, V. M. (2018b) Density-aware single image de-raining using a multi-stream dense network. In CVPR.
Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deeptam: Deep tracking and mapping. In European conference on computer vision (ECCV).
Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6612–6619), https://doi.org/10.1109/CVPR.2017.700.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Cristian Sminchisescu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This project was in large parts funded by the EU Horizon 2020 project Trimbot2020. We also thank the bwHPC initiative for computing resources, Facebook for their P100 server donation and gift funding.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 16091 KB)
Rights and permissions
About this article
Cite this article
Zhou, H., Ummenhofer, B. & Brox, T. DeepTAM: Deep Tracking and Mapping with Convolutional Neural Networks. Int J Comput Vis 128, 756–769 (2020). https://doi.org/10.1007/s11263-019-01221-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01221-0