Abstract
We present a novel approach to improve 3D structure estimation from an image stream in urban scenes. We consider a particular setup where the camera is installed on a moving vehicle. Applying traditional structure from motion (SfM) technique in this case generates poor estimation of the 3d structure due to several reasons such as texture-less images, small baseline variations and dominant forward camera motion. Our idea is to introduce the monocular depth cues that exist in a single image, and add time constraints on the estimated 3D structure. We assume that our scene is made up of small planar patches which are obtained using over-segmentation method, and our goal is to estimate the 3D positioning for each of these planes. We propose a fusion framework that employs Markov Random Field (MRF) model to integrate both spatial and temporal depth information. An advantage of our model is that it performs well even in the absence of some depth information. Spatial depth information is obtained through a global and local feature extraction method inspired by Saxena et al. [1]. Temporal depth information is obtained via sparse optical flow based structure from motion approach. That allows decreasing the estimation ambiguity by forcing some constraints on camera motion. Finally, we apply a fusion scheme to create unique 3D structure estimation.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Saxena, A., Sun, M., Ng, A.: Learning 3-d scene structure from a single still image. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8. IEEE (2007)
Aanæs, H.: Methods for structure from motion. IMM, Informatik og Matematisk Modellering, Danmarks Tekniske Universitet (2003)
Vedaldi, A., Guidi, G., Soatto, S.: Moving forward in structure from motion. In: IEEE Conference on CVPR 2007, pp. 1–7. IEEE (2007)
Saxena, A., Chung, S., Ng, A.: 3-d depth reconstruction from a single still image. International Journal of Computer Vision 76, 53–69 (2008)
Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: IEEE Conference on CVPR, pp. 1253–1260. IEEE (2010)
Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. International Journal of Computer Vision 59, 167–181 (2004)
Humayun, A., Mac Aodha, O., Brostow, G.: Learning to find occlusion regions. In: IEEE Conference on CVPR 2011, pp. 2161–2168. IEEE (2011)
Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004)
Hartley, R., Zisserman, A., Ebrary, I.: Multiple view geometry in computer vision, vol. 2. Cambridge Univ. Press (2003)
Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle Adjustment – A Modern Synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000)
Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–395 (1981)
Lindeberg, T., Garding, J.: Shape from texture from a multi-scale perspective. In: Fourth International Conference on Computer Vision, pp. 683–691. IEEE (1993)
Torralba, A., Oliva, A.: Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 1226–1238 (2002)
Hoiem, D., Efros, A., Hebert, M.: Automatic photo pop-up. ACM Transactions on Graphics 24, 577–584 (2005)
Hoiem, D., Efros, A., Hebert, M.: Recovering surface layout from an image. International Journal of Computer Vision 75, 151–172 (2007)
Sturgess, P., Alahari, K., Ladicky, L., Torr, P.: Combining appearance and structure from motion features for road scene understanding (2009)
Bao, S., Savarese, S.: Semantic structure from motion. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2025–2032. IEEE (2011)
Saxena, A.: Monocular depth perception and robotic grasping of novel objects. Stanford University (2009)
Saxena, A.: State-of-the-art results of the depth prediction from single image. Website (2012), http://make3d.cs.cornell.edu/results_stateoftheart.html
Civera, J., Davison, A., Montiel, J.: Structure from Motion Using the Extended Kalman Filter, vol. 75. Springer (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nawaf, M.M., Trémeau, A. (2012). Joint Spatio-temporal Depth Features Fusion Framework for 3D Structure Estimation in Urban Environment. In: Fusiello, A., Murino, V., Cucchiara, R. (eds) Computer Vision – ECCV 2012. Workshops and Demonstrations. ECCV 2012. Lecture Notes in Computer Science, vol 7585. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33885-4_53
Download citation
DOI: https://doi.org/10.1007/978-3-642-33885-4_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33884-7
Online ISBN: 978-3-642-33885-4
eBook Packages: Computer ScienceComputer Science (R0)