Deep Modular Network Architecture for Depth Estimation from Single Indoor Images

  • Seiya Ito
  • Naoshi KanekoEmail author
  • Yuma Shinohara
  • Kazuhiko Sumi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11129)


We propose a novel deep modular network architecture for indoor scene depth estimation from single RGB images. The proposed architecture consists of a main depth estimation network and two auxiliary semantic segmentation networks. Our insight is that semantic and geometrical structures in a scene are strongly correlated, thus we utilize global (i.e. room layout) and mid-level (i.e. objects in a room) semantic structures to enhance depth estimation. The first auxiliary network, or layout network, is responsible for room layout estimation to infer the positions of walls, floor, and ceiling of a room. The second auxiliary network, or object network, estimates per-pixel class labels of the objects in a scene, such as furniture, to give mid-level semantic cues. Estimated semantic structures are effectively fed into the depth estimation network using newly proposed discriminator networks, which discern the reliability of the estimated structures. The evaluation result shows that our architecture achieves significant performance improvements over previous approaches on the standard NYU Depth v2 indoor scene dataset.


Depth estimation Convolutional Neural Network 



This work was partially supported by Aoyama Gakuin University-Supported Program “Early Eagle Program”. The authors are grateful to Prof. M.J. Dürst from Aoyama Gakuin University for his careful proofreading and kind advices on the manuscript.


  1. 1.
    Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: NIPS, pp. 730–738 (2016)Google Scholar
  2. 2.
    Dasgupta, S., Fang, K., Chen, K., Savarese, S.: Delay: robust spatial layout estimation for cluttered indoor scenes. In: CVPR, pp. 616–624 (2016)Google Scholar
  3. 3.
    Dellaert, F., Seitz, S.M., Thorpe, C.E., Thrun, S.: Structure from motion without correspondence. In: CVPR, pp. 557–564 (2000)Google Scholar
  4. 4.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)Google Scholar
  5. 5.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS, pp. 2366–2374 (2014)Google Scholar
  6. 6.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
  7. 7.
    Felzenszwalb, P., Huttenlocher, D.: Distance transforms of sampled functions. Technical report, Cornell University (2004)Google Scholar
  8. 8.
    Foix, S., Alenya, G., Torras, C.: Lock-in time-of-flight (ToF) cameras: a survey. IEEE Sens. J. 11(9), 1917–1926 (2011)CrossRefGoogle Scholar
  9. 9.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  10. 10.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 270–279 (2017)Google Scholar
  11. 11.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)Google Scholar
  12. 12.
    Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Trans. Graph. 24(3), 577–584 (2005)CrossRefGoogle Scholar
  13. 13.
    Jafari, O.H., Groth, O., Kirillov, A., Yang, M.Y., Rother, C.: Analyzing modular CNN architectures for joint depth prediction and semantic segmentation. In: ICRA, pp. 4620–4627 (2017)Google Scholar
  14. 14.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  15. 15.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS, pp. 109–117 (2011)Google Scholar
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  17. 17.
    Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp. 6647–6655 (2017)Google Scholar
  18. 18.
    Ladicky, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: CVPR, pp. 89–96 (2014)Google Scholar
  19. 19.
    Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: CVPR, pp. 1253–1260 (2010)Google Scholar
  20. 20.
    Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: CVPR, pp. 5162–5170 (2015)Google Scholar
  21. 21.
    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)CrossRefGoogle Scholar
  22. 22.
    Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: CVPR, pp. 716–723 (2014)Google Scholar
  23. 23.
    Roy, A., Todorovic, S.: Monocular depth estimation using neural regression forest. In: CVPR, pp. 5506–5514 (2016)Google Scholar
  24. 24.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: NIPS, pp. 1161–1168 (2005)Google Scholar
  26. 26.
    Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)CrossRefGoogle Scholar
  27. 27.
    Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: CVPR, pp. 195–202 (2003)Google Scholar
  28. 28.
    Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR, pp. 519–528 (2006)Google Scholar
  29. 29.
    Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)CrossRefGoogle Scholar
  30. 30.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  31. 31.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  32. 32.
    Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)Google Scholar
  33. 33.
    Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: CVPR, pp. 2800–2809 (2015)Google Scholar
  34. 34.
    Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  35. 35.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 2881–2890 (2017)Google Scholar
  36. 36.
    Zhuo, W., Salzmann, M., He, X., Liu, M.: Indoor scene structure analysis for single image depth estimation. In: CVPR, pp. 614–622 (2015)Google Scholar
  37. 37.
    Zoran, D., Isola, P., Krishnan, D., Freeman, W.T.: Learning ordinal relationships for mid-level vision. In: ICCV, pp. 388–396 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Seiya Ito
    • 1
  • Naoshi Kaneko
    • 1
    Email author
  • Yuma Shinohara
    • 1
  • Kazuhiko Sumi
    • 1
  1. 1.Aoyama Gakuin UniversityKanagawaJapan

Personalised recommendations