Exploiting Single Image Depth Prediction for Mono-stixel Estimation

  • Fabian BrickweddeEmail author
  • Steffen Abraham
  • Rudolf Mester
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11129)


The stixel-world is a compact and detailed environment representation specially designed for street scenes and automotive vision applications. A recent work proposes a monocamera based stixel estimation method based on the structure from motion principle and scene model to predict the depth and translational motion of the static and dynamic parts of the scene. In this paper, we propose to exploit the recent advantages in deep learning based single image depth prediction for mono-stixel estimation. In our approach, the mono-stixels are estimated based on the single image depth predictions, a dense optical flow field and semantic segmentation supported by the prior knowledge about the characteristic of typical street scenes. To provide a meaningful estimation, it is crucial to model the statistical distribution of all measurements, which is especially challenging for the single image depth predictions. Therefore, we present a semantic class dependent measurement model of the single image depth prediction derived from the empirical error distribution on the Kitti dataset.

Our experiments on the Kitti-Stereo’2015 dataset show that we can significantly improve the quality of mono-stixel estimation by exploiting the single image depth prediction. Furthermore, our proposed approach is able to handle partly occluded moving objects as well as scenarios without translational motion of the camera.


Mono-stixel Single image depth prediction Scene reconstruction Scene flow Monocamera Automotive 


  1. 1.
    Badino, H., Franke, U., Pfeiffer, D.: The stixel world - a compact medium level representation of the 3D-world. In: Denzler, J., Notni, G., Süße, H. (eds.) DAGM 2009. LNCS, vol. 5748, pp. 51–60. Springer, Heidelberg (2009). Scholar
  2. 2.
    Brickwedde, F., Abraham, S., Mester, R.: Mono-Stixels: monocular depth reconstruction of dynamic street scenes. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. IEEE (2018)Google Scholar
  3. 3.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)Google Scholar
  4. 4.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
  5. 5.
    Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). Scholar
  6. 6.
    Fanani, N., Stürck, A., Ochs, M., Bradler, H., Mester, R.: Predictive monocular odometry (PMO): What is possible without RANSAC and multiframe bundle adjustment? Image Vis. Comput. 68, 3–13 (2017)CrossRefGoogle Scholar
  7. 7.
    Fcil, J.M., Concha, A., Montesano, L., Civera, J.: Single-view and multi-view depth fusion. IEEE Robot. Autom. Lett. 2(4), 1994–2001 (2017). Scholar
  8. 8.
    Carneiro, G., Reid, I., Garg, R.B.G.V.K., et al.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). Scholar
  9. 9.
    Garnett, N., et al.: Real-time category-based and general obstacle detection for autonomous driving. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  10. 10.
    Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: dense 3D reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968. IEEE (2011)Google Scholar
  11. 11.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  12. 12.
    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  13. 13.
    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems, pp. 5574–5584 (2017)Google Scholar
  14. 14.
    Klappstein, J.: Optical-flow based detection of moving objects in traffic scenes. Ph.D. thesis (2008)Google Scholar
  15. 15.
    Levi, D., Garnett, N., Fetaya, E., Herzlyia, I.: StixelNet: a deep convolutional network for obstacle detection and road segmentation. In: BMVC, pp. 109:1 (2015)Google Scholar
  16. 16.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  17. 17.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  18. 18.
    Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)CrossRefGoogle Scholar
  19. 19.
    Pereira, F.I., Ilha, G., Luft, J., Negreiros, M., Susin, A.: Monocular visual odometry with cyclic estimation. In: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 1–6. IEEE (2017)Google Scholar
  20. 20.
    Pfeiffer, D., Franke, U.: Modeling dynamic 3D environments by means of the Stixel World. IEEE Intell. Transp. Syst. Mag. 3(3), 24–36 (2011)CrossRefGoogle Scholar
  21. 21.
    Pfeiffer, D., Franke, U.: Towards a global optimal multi-layer stixel representation of dense 3D data. In: BMVC, vol. 11, pp. 51–1 (2011)Google Scholar
  22. 22.
    Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4058–4066 (2016)Google Scholar
  23. 23.
    Schneider, L., et al.: Semantic stixels: depth is not enough. In: 2016 IEEE Intelligent Vehicles Symposium (IV), pp. 110–117. IEEE (2016)Google Scholar
  24. 24.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  25. 25.
    Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)Google Scholar
  26. 26.
    Torr, P.H., Zisserman, A.: MLESAC: a new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000)CrossRefGoogle Scholar
  27. 27.
    Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  28. 28.
    Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017)
  29. 29.
    Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392 (2013)Google Scholar
  30. 30.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Robert Bosch GmbHHildesheimGermany
  2. 2.VSI LaboratoryGoethe UniversityFrankfurtGermany

Personalised recommendations