Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network

  • Xinjing ChengEmail author
  • Peng Wang
  • Ruigang Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


Depth estimation from a single image is a fundamental problem in computer vision. In this paper, we propose a simple yet effective convolutional spatial propagation network (CSPN) to learn the affinity matrix for depth prediction. Specifically, we adopt an efficient linear propagation model, where the propagation is performed with a manner of recurrent convolutional operation, and the affinity among neighboring pixels is learned through a deep convolutional neural network (CNN). We apply the designed CSPN to two depth estimation tasks given a single image: (1) Refine the depth output from existing state-of-the-art (SOTA) methods; (2) Convert sparse depth samples to a dense depth map by embedding the depth samples within the propagation procedure. The second task is inspired by the availability of LiDAR that provides sparse but accurate depth measurements. We experimented the proposed CSPN over the popular NYU v2 [1] and KITTI [2] datasets, where we show that our proposed approach improves not only quality (e.g., 30% more reduction in depth error), but also speed (e.g., 2 to 5\(\times \) faster) of depth maps than previous SOTA methods. The codes of CSPN are available at:


Depth estimation Convolutional spatial propagation 


  1. 1.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  2. 2.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)Google Scholar
  3. 3.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)Google Scholar
  4. 4.
    Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)Google Scholar
  5. 5.
    Xiao, J., Owens, A., Torralba, A.: Sun3d: a database of big spaces reconstructed using sfm and object labels. In: ICCV (2013)Google Scholar
  6. 6.
    Chang, A., et al.: Matterport3D: Learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017)Google Scholar
  7. 7.
    Wang, S., et al.: Torontocity: Seeing the world with a million eyes. In: ICCV (2017)Google Scholar
  8. 8.
    Huang, X., et al.: The apolloscape dataset for autonomous driving. arXiv preprint arXiv:1803.06184 (2018)
  9. 9.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  11. 11.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  12. 12.
    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241 (2015)Google Scholar
  13. 13.
    Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: ICRA (2018)Google Scholar
  14. 14.
    Liu, S., De Mello, S., Gu, J., Zhong, G., Yang, M.H., Kautz, J.: Learning affinity via spatial propagation networks. In: Advances in Neural Information Processing Systems, pp. 1519–1529 (2017)Google Scholar
  15. 15.
    Liao, Y., Huang, L., Wang, Y., Kodagoda, S., Yu, Y., Liu, Y.: Parse geometry from a line: monocular depth estimation with partial laser observation. In: ICRA (2017)Google Scholar
  16. 16.
    Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR (2015)Google Scholar
  17. 17.
    Li, J., Klein, R., Yao, A.: A two-streamed network for estimating fine-scaled depth maps from single rgb images. In: ICCV (2017)Google Scholar
  18. 18.
    Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction (2017)Google Scholar
  19. 19.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency (2017)Google Scholar
  20. 20.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)Google Scholar
  21. 21.
    Yang, Z., Wang, P., Xu, W., Zhao, L., Ram, N.: Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In: AAAI (2018)Google Scholar
  22. 22.
    Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: Lego: Learning edge with geometry all at once by watching videos. In: CVPR, pp. 225–234 (2018)Google Scholar
  23. 23.
    Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B.L., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: CVPR (2015)Google Scholar
  24. 24.
    Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: CVPR (June 2015)Google Scholar
  25. 25.
    Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: CVPR (2015)Google Scholar
  26. 26.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  27. 27.
    Wang, P., Shen, X., Russell, B., Cohen, S., Price, B.L., Yuille, A.L.: SURGE: surface regularized geometry estimation from a single image. In: NIPS (2016)Google Scholar
  28. 28.
    Barron, J.T., Poole, B.: The fast bilateral solver. In: ECCV (2016)Google Scholar
  29. 29.
    Matsuo, K., Aoki, Y.: Depth image enhancement using local tangent plane approximations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3574–3583 (2015)Google Scholar
  30. 30.
    Ferstl, D., Reinbacher, C., Ranftl, R., Rüther, M., Bischof, H.: Image guided depth upsampling using anisotropic total generalized variation. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 993–1000. IEEE (2013)Google Scholar
  31. 31.
    Ferstl, D., Ruther, M., Bischof, H.: Variational depth superresolution using example-based edge representations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 513–521 (2015)Google Scholar
  32. 32.
    Liu, R., Zhong, G., Cao, J., Lin, Z., Shan, S., Luo, Z.: Learning to diffuse: a new perspective to design pdes for visual analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38(12), 2457–2471 (2016)CrossRefGoogle Scholar
  33. 33.
    Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). Scholar
  34. 34.
    Yang, J., Ye, X., Li, K., Hou, C., Wang, Y.: Color-guided depth recovery from rgb-d data using an adaptive autoregressive model. IEEE TIP 23(8), 3443–3458 (2014)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Song, X., Dai, Y., Qin, X.: Deep depth super-resolution: learning depth super-resolution using deep convolutional neural network. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10114, pp. 360–376. Springer, Cham (2017). Scholar
  36. 36.
    Hui, T.-W., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 353–369. Springer, Cham (2016). Scholar
  37. 37.
    Kwon, H., Tai, Y.W., Lin, S.: Data-driven depth map refinement via multi-scale sparse representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 159–167. (2015)Google Scholar
  38. 38.
    Riegler, G., Rüther, M., Bischof, H.: ATGV-Net: accurate depth super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 268–284. Springer, Cham (2016). Scholar
  39. 39.
    Weickert, J.: Anisotropic diffusion in image processing. Volume 1. Teubner Stuttgart (1998)Google Scholar
  40. 40.
    Maire, M., Narihira, T., Yu, S.X.: Affinity cnn: Learning pixel-centric pairwise relations for figure/ground embedding. In: CVPR, pp. 174–182 (2016)Google Scholar
  41. 41.
    Bertasius, G., Torresani, L., Yu, S.X., Shi, J.: Convolutional random walk networks for semantic image segmentation. arXiv preprint arXiv:1605.07681 (2016)
  42. 42.
    Chen, L.C., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In: CVPR, pp. 4545–4554 (2016)Google Scholar
  43. 43.
    Zimmermann, K., Petricek, T., Salansky, V., Svoboda, T.: Learning for active 3d mapping. In: ICCV (2017)Google Scholar
  44. 44.
    Ladicky, L., Saurer, O., Jeong, S., Maninchedda, F., Pollefeys, M.: From point clouds to mesh using regression. In: The IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  45. 45.
    Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant cnns. 3DV (2017)Google Scholar
  46. 46.
    Kschischang, F.R., Frey, B.J., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inf. theory 47(2), 498–519 (2001)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Sobel, I.: History and definition of the sobel operator. Retrieved from the World Wide Web (2014)Google Scholar
  48. 48.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: NIPS (2012)Google Scholar
  49. 49.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  50. 50.
    Igarashi, T., Moscovich, T., Hughes, J.F.: As-rigid-as-possible shape manipulation. ACM Trans. Gr. (TOG) 24(3), 1134–1141 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Baidu Research, Baidu Inc.BeijingChina

Personalised recommendations