Skip to main content
Log in

Indoor Scene Understanding with Geometric and Semantic Contexts

International Journal of Computer Vision Aims and scope Submit manuscript

Cite this article


Truly understanding a scene involves integrating information at multiple levels as well as studying the interactions between scene elements. Individual object detectors, layout estimators and scene classifiers are powerful but ultimately confounded by complicated real-world scenes with high variability, different viewpoints and occlusions. We propose a method that can automatically learn the interactions among scene elements and apply them to the holistic understanding of indoor scenes from a single image. This interpretation is performed within a hierarchical interaction model which describes an image by a parse graph, thereby fusing together object detection, layout estimation and scene classification. At the root of the parse graph is the scene type and layout while the leaves are the individual detections of objects. In between is the core of the system, our 3D Geometric Phrases (3DGP). We conduct extensive experimental evaluations on single image 3D scene understanding using both 2D and 3D metrics. The results demonstrate that our model with 3DGPs can provide robust estimation of scene type, 3D space, and 3D objects by leveraging the contextual relationships among the visual elements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others


  1. This representation ensures that all observation features associated with a detection have values distributed from negative to positive, make graphs with different numbers of objects are comparable.

  2. Although the view-dependent biases are not view-point invariant, there are still only a few parameters (8 views per 3DGP).

  3. The dataset is available at

  4. The method in Schwing and Urtasun (2012) produces better layout estimation results, however the code is not publicly available. So we use Hedau et al. (2009) as the baseline.


  • Bao, S., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scene layout understanding. In Proceedings of the conference on Computer Vision and Pattern Recognition.

  • Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2, 27:1–27:27.

    Article  Google Scholar 

  • Chao, Y.W., Choi, W., Pantofaru, C., & Savarese, S. (2013). Layout estimation of highly cluttered indoor scenes using geometric and semantic cues. In Proceedings of the International Conference on Image Analysis and Processing.

  • Choi, W., Chao, Y., Pantofaru, C., & Savarese, S. (2013) Understanding indoor scenes using 3D geometric phrases. In CVPR.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  • Desai, C., Ramanan, D., & Fowlkes, C. C. (2011). Discriminative models for multi-class object layout. IJCV.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. IJCV.

  • Fei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. CVPR pp. 524–531.

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.

    Article  Google Scholar 

  • Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.

  • Geiger, A., Wojek, C., & Urtasun, R. (2011). Joint 3D estimation of objects and scene layout. In NIPS.

  • Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.

  • Hartley, R. I., & Zisserman, A. (2004). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge: Cambridge University Press, ISBN: 0521540518.

  • Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered room. In ICCV (2009)

  • Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.

  • Hedau, V., Hoiem, D., & Forsyth, D. (2012). Recovering free space of indoor scenes from a single image. In CVPR.

  • Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. IJCV.

  • Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV.

  • Lagarias, J. C., Reeds, J. A., Wright, M. H., & Wright, P. E. (1998). Convergence properties of the nelder-mead simplex method in low dimensions. SIAM Journal on Optimization, 9(1), 148–158.

    Article  MathSciNet  Google Scholar 

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.

  • Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.

  • Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Statistical Learning in Computer Vision, ECCV.

  • Li, C., Parikh, D., & Chen, T. (2012). Automatic discovery of groups of objects for scene understanding. In CVPR.

  • Li, L. J., Su, H., Xing, E. P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In NIPS.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110. doi:10.1023/B:VISI.0000029664.99615.94.

    Article  Google Scholar 

  • Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV.

  • Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.

  • Rother, C. (2002). A new approach for vanishing point detection in architectural environments. Journal Image and Vision Computing, 20, 647–656.

  • Sadeghi, A., & Farhadi, A. (2011). Recognition using visual phrases. In CVPR.

  • Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In BMVC.

  • Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.

  • Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.

  • Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. In PAMI.

  • Xiang, Y., & Savarese, S. (2012). Estimating the aspect layout of object categories. In CVPR.

  • Zhao, Y., & Zhu, S. C. (2011). Image parsing via stochastic scene grammar. In NIPS.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Wongun Choi.

Additional information

Communicated by Derek Hoiem, James Hays, Jianxiong Xiao and Aditya Khosla.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, W., Chao, YW., Pantofaru, C. et al. Indoor Scene Understanding with Geometric and Semantic Contexts. Int J Comput Vis 112, 204–220 (2015).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: