Exploiting Single Image Depth Prediction for Mono-stixel Estimation
Abstract
The stixel-world is a compact and detailed environment representation specially designed for street scenes and automotive vision applications. A recent work proposes a monocamera based stixel estimation method based on the structure from motion principle and scene model to predict the depth and translational motion of the static and dynamic parts of the scene. In this paper, we propose to exploit the recent advantages in deep learning based single image depth prediction for mono-stixel estimation. In our approach, the mono-stixels are estimated based on the single image depth predictions, a dense optical flow field and semantic segmentation supported by the prior knowledge about the characteristic of typical street scenes. To provide a meaningful estimation, it is crucial to model the statistical distribution of all measurements, which is especially challenging for the single image depth predictions. Therefore, we present a semantic class dependent measurement model of the single image depth prediction derived from the empirical error distribution on the Kitti dataset.
Our experiments on the Kitti-Stereo’2015 dataset show that we can significantly improve the quality of mono-stixel estimation by exploiting the single image depth prediction. Furthermore, our proposed approach is able to handle partly occluded moving objects as well as scenarios without translational motion of the camera.
Keywords
Mono-stixel Single image depth prediction Scene reconstruction Scene flow Monocamera AutomotiveReferences
- 1.Badino, H., Franke, U., Pfeiffer, D.: The stixel world - a compact medium level representation of the 3D-world. In: Denzler, J., Notni, G., Süße, H. (eds.) DAGM 2009. LNCS, vol. 5748, pp. 51–60. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03798-6_6CrossRefGoogle Scholar
- 2.Brickwedde, F., Abraham, S., Mester, R.: Mono-Stixels: monocular depth reconstruction of dynamic street scenes. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. IEEE (2018)Google Scholar
- 3.Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)Google Scholar
- 4.Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
- 5.Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54CrossRefGoogle Scholar
- 6.Fanani, N., Stürck, A., Ochs, M., Bradler, H., Mester, R.: Predictive monocular odometry (PMO): What is possible without RANSAC and multiframe bundle adjustment? Image Vis. Comput. 68, 3–13 (2017)CrossRefGoogle Scholar
- 7.Fcil, J.M., Concha, A., Montesano, L., Civera, J.: Single-view and multi-view depth fusion. IEEE Robot. Autom. Lett. 2(4), 1994–2001 (2017). https://doi.org/10.1109/LRA.2017.2715400CrossRefGoogle Scholar
- 8.Carneiro, G., Reid, I., Garg, R.B.G.V.K., et al.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
- 9.Garnett, N., et al.: Real-time category-based and general obstacle detection for autonomous driving. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
- 10.Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: dense 3D reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968. IEEE (2011)Google Scholar
- 11.Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
- 12.Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
- 13.Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems, pp. 5574–5584 (2017)Google Scholar
- 14.Klappstein, J.: Optical-flow based detection of moving objects in traffic scenes. Ph.D. thesis (2008)Google Scholar
- 15.Levi, D., Garnett, N., Fetaya, E., Herzlyia, I.: StixelNet: a deep convolutional network for obstacle detection and road segmentation. In: BMVC, pp. 109:1 (2015)Google Scholar
- 16.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
- 17.Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
- 18.Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)CrossRefGoogle Scholar
- 19.Pereira, F.I., Ilha, G., Luft, J., Negreiros, M., Susin, A.: Monocular visual odometry with cyclic estimation. In: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 1–6. IEEE (2017)Google Scholar
- 20.Pfeiffer, D., Franke, U.: Modeling dynamic 3D environments by means of the Stixel World. IEEE Intell. Transp. Syst. Mag. 3(3), 24–36 (2011)CrossRefGoogle Scholar
- 21.Pfeiffer, D., Franke, U.: Towards a global optimal multi-layer stixel representation of dense 3D data. In: BMVC, vol. 11, pp. 51–1 (2011)Google Scholar
- 22.Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4058–4066 (2016)Google Scholar
- 23.Schneider, L., et al.: Semantic stixels: depth is not enough. In: 2016 IEEE Intelligent Vehicles Symposium (IV), pp. 110–117. IEEE (2016)Google Scholar
- 24.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- 25.Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)Google Scholar
- 26.Torr, P.H., Zisserman, A.: MLESAC: a new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000)CrossRefGoogle Scholar
- 27.Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
- 28.Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017)
- 29.Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392 (2013)Google Scholar
- 30.Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)Google Scholar