Truly understanding a scene involves integrating information at multiple levels as well as studying the interactions between scene elements. Individual object detectors, layout estimators and scene classifiers are powerful but ultimately confounded by complicated real-world scenes with high variability, different viewpoints and occlusions. We propose a method that can automatically learn the interactions among scene elements and apply them to the holistic understanding of indoor scenes from a single image. This interpretation is performed within a hierarchical interaction model which describes an image by a parse graph, thereby fusing together object detection, layout estimation and scene classification. At the root of the parse graph is the scene type and layout while the leaves are the individual detections of objects. In between is the core of the system, our 3D Geometric Phrases (3DGP). We conduct extensive experimental evaluations on single image 3D scene understanding using both 2D and 3D metrics. The results demonstrate that our model with 3DGPs can provide robust estimation of scene type, 3D space, and 3D objects by leveraging the contextual relationships among the visual elements.
This is a preview of subscription content,to check access.
Access this article
Similar content being viewed by others
This representation ensures that all observation features associated with a detection have values distributed from negative to positive, make graphs with different numbers of objects are comparable.
Although the view-dependent biases are not view-point invariant, there are still only a few parameters (8 views per 3DGP).
The dataset is available at http://cvgl.stanford.edu/projects/3dgp/.
Bao, S., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scene layout understanding. In Proceedings of the conference on Computer Vision and Pattern Recognition.
Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2, 27:1–27:27.
Chao, Y.W., Choi, W., Pantofaru, C., & Savarese, S. (2013). Layout estimation of highly cluttered indoor scenes using geometric and semantic cues. In Proceedings of the International Conference on Image Analysis and Processing.
Choi, W., Chao, Y., Pantofaru, C., & Savarese, S. (2013) Understanding indoor scenes using 3D geometric phrases. In CVPR.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Desai, C., Ramanan, D., & Fowlkes, C. C. (2011). Discriminative models for multi-class object layout. IJCV.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. IJCV.
Fei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. CVPR pp. 524–531.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.
Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.
Geiger, A., Wojek, C., & Urtasun, R. (2011). Joint 3D estimation of objects and scene layout. In NIPS.
Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.
Hartley, R. I., & Zisserman, A. (2004). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge: Cambridge University Press, ISBN: 0521540518.
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered room. In ICCV (2009)
Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.
Hedau, V., Hoiem, D., & Forsyth, D. (2012). Recovering free space of indoor scenes from a single image. In CVPR.
Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. IJCV.
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV.
Lagarias, J. C., Reeds, J. A., Wright, M. H., & Wright, P. E. (1998). Convergence properties of the nelder-mead simplex method in low dimensions. SIAM Journal on Optimization, 9(1), 148–158.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.
Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Statistical Learning in Computer Vision, ECCV.
Li, C., Parikh, D., & Chen, T. (2012). Automatic discovery of groups of objects for scene understanding. In CVPR.
Li, L. J., Su, H., Xing, E. P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In NIPS.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110. doi:10.1023/B:VISI.0000029664.99615.94.
Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV.
Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.
Rother, C. (2002). A new approach for vanishing point detection in architectural environments. Journal Image and Vision Computing, 20, 647–656.
Sadeghi, A., & Farhadi, A. (2011). Recognition using visual phrases. In CVPR.
Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In BMVC.
Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.
Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.
Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. In PAMI.
Xiang, Y., & Savarese, S. (2012). Estimating the aspect layout of object categories. In CVPR.
Zhao, Y., & Zhu, S. C. (2011). Image parsing via stochastic scene grammar. In NIPS.
Communicated by Derek Hoiem, James Hays, Jianxiong Xiao and Aditya Khosla.
About this article
Cite this article
Choi, W., Chao, YW., Pantofaru, C. et al. Indoor Scene Understanding with Geometric and Semantic Contexts. Int J Comput Vis 112, 204–220 (2015). https://doi.org/10.1007/s11263-014-0779-4