PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding
Abstract
The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360° full-view panoramas in scene understanding, and propose a whole-room context model in 3D. For an input panorama, our method outputs 3D bounding boxes of the room and all major objects inside, together with their semantic categories. Our method generates 3D hypotheses based on contextual constraints and ranks the hypotheses holistically, combining both bottom-up and top-down context information. To train our model, we construct an annotated panorama dataset and reconstruct the 3D model from single-view using manual annotation. Experiments show that solely based on 3D context without any image region category classifier, we can achieve a comparable performance with the state-of-the-art object detector. This demonstrates that when the FOV is large, context is as powerful as object appearance. All data and source code are available online.
Keywords
Support Vector Machine Context Model Perspective Image Match Cost Indoor SceneReferences
- 1.Roberts, L.G.: Machine perception of 3-D solids. PhD thesis, Massachusetts Institute of Technology (1963)Google Scholar
- 2.Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI (2010)Google Scholar
- 3.Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (voc) challenge. IJCV (2010)Google Scholar
- 4.Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV (2013)Google Scholar
- 5.Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. In: ICCV (2013)Google Scholar
- 6.Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524 (2013)Google Scholar
- 7.Biederman, I.: On the semantics of a glance at a scene (1981)Google Scholar
- 8.Torralba, A.: Contextual influences on saliency (2004)Google Scholar
- 9.Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)Google Scholar
- 10.Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV (2003)Google Scholar
- 11.Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariant features. IJCV (2007)Google Scholar
- 12.Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. JMLR (2008)Google Scholar
- 13.von Gioi, R.G., Jakubowicz, J., Morel, J.M., Randall, G.: LSD: a Line Segment Detector. Image Processing On Line (2012)Google Scholar
- 14.Hough, P.V.: Machine analysis of bubble chamber pictures. In: International Conference on High Energy Accelerators and Instrumentation, vol. 73 (1959)Google Scholar
- 15.Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: ICCV (2009)Google Scholar
- 16.Lee, D.C., Hebert, M., Kanade., T.: Geometric reasoning for single image structure recovery. In: CVPR (2009)Google Scholar
- 17.Xiao, J., Russell, B.C., Torralba, A.: Localizing 3D cuboids in single-view images. In: NIPS (2012)Google Scholar
- 18.Joachims, T., Finley, T., Yu, C.N.J.: Cutting-plane training of structural svms. In: Machine Learning (2009)Google Scholar
- 19.Xiao, J., Ehinger, K.A., Oliva, A., Torralba, A.: Recognizing scene viewpoint using panoramic place representation. In: CVPR (2012)Google Scholar
- 20.Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large-scale scene recognition from abbey to zoo. In: CVPR (2010)Google Scholar
- 21.Delage, E., Lee, H., Ng, A.Y.: Automatic single-image 3D reconstructions of indoor manhattan world scenes. In: ISRR (2005)Google Scholar
- 22.Coughlan, J.M., Yuille, A.: Manhattan world: Compass direction from a single image by bayesian inference. In: ICCV (1999)Google Scholar
- 23.Hoiem, D.: Seeing the world behind the image: spatial layout for 3D scene understanding. PhD thesis, Carnegie Mellon University (2007)Google Scholar
- 24.Saxena, A., Sun, M., Ng, A.: Make3D: Learning 3D scene structure from a single still image. PAMI (2009)Google Scholar
- 25.Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. TOG (2005)Google Scholar
- 26.Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. IJCV (2008)Google Scholar
- 27.Hoiem, D., Efros, A.A., Hebert, M.: Closing the loop in scene interpretation. In: CVPR (2008)Google Scholar
- 28.Hoiem, D., Efros, A.A., Hebert, M.: Geometric context from a single image. In: ICCV (2005)Google Scholar
- 29.Gupta, A., Satkin, S., Efros, A.A., Hebert, M.: From scene geometry to human workspace. In: CVPR (2011)Google Scholar
- 30.Han, F., Zhu, S.C.: Bottom-up/top-down image parsing by attribute graph grammar. In: ICCV (2005)Google Scholar
- 31.Zhao, Y.: chun Zhu, S.: Image parsing with stochastic scene grammar. In: NIPS (2011)Google Scholar
- 32.Wang, H., Gould, S., Koller, D.: Discriminative learning with latent variables for cluttered indoor scene understanding. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 435–449. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 33.Yu, S., Zhang, H., Malik, J.: Inferring spatial layout from a single image via depth-ordered grouping. In: IEEE Workshop on Perceptual Organization in Computer Vision (2008)Google Scholar
- 34.Hedau, V., Hoiem, D., Forsyth, D.: Thinking inside the box: Using appearance models and context based on room geometry. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 224–237. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 35.Lee, D.C., Gupta, A., Hebert, M., Kanade, T.: Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In: NIPS (2010)Google Scholar
- 36.Pero, L.D., Guan, J., Brau, E., Schlecht, J., Barnard, K.: Sampling bedrooms. In: CVPR (2011)Google Scholar
- 37.Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.: Make it home: automatic optimization of furniture arrangement. TOG (2011)Google Scholar
- 38.Pero, L.D., Bowdish, J.C., Fried, D., Kermgard, B.D., Hartley, E.L., Barnard, K.: Bayesian geometric modelling of indoor scenes. In: CVPR (2012)Google Scholar
- 39.Hedau, V., Hoiem, D., Forsyth, D.: Recovering free space of indoor scenes from a single image. In: CVPR (2012)Google Scholar
- 40.Schwing, A.G., Hazan, T., Pollefeys, M., Urtasun, R.: Efficient structured prediction for 3D indoor scene understanding. In: CVPR (2012)Google Scholar
- 41.Xiao, J., Hays, J., Russell, B.C., Patterson, G., Ehinger, K., Torralba, A., Oliva, A.: Basic level scene understanding: Categories, attributes and structures. Frontiers in Psychology (2013)Google Scholar
- 42.Guo, R., Hoiem, D.: Beyond the line of sight: Labeling the underlying surfaces. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 761–774. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 43.Satkin, S., Hebert, M.: 3DNN: Viewpoint invariant 3D geometry matching for scene understanding. In: ICCV (2013)Google Scholar
- 44.Satkin, S., Lin, J., Hebert, M.: Data-driven scene understanding from 3D models. In: BMVC (2012)Google Scholar
- 45.Choi, W., Chao, Y.W., Pantofaru, C., Savarese, S.: Understanding indoor scenes using 3D geometric phrases. In: CVPR (2013)Google Scholar
- 46.Del Pero, L., Bowdish, J., Kermgard, B., Hartley, E., Barnard, K.: Understanding bayesian rooms using composite 3D object models. In: CVPR (2013)Google Scholar
- 47.Zhao, Y., Zhu, S.C.: Scene parsing by integrating function, geometry and appearance models. In: CVPR (2013)Google Scholar
- 48.Schwing, A.G., Fidler, S., Pollefeys, M., Urtasun, R.: Box in the box: Joint 3D layout and object reasoning from single images (2013)Google Scholar
- 49.Schwing, A.G., Urtasun, R.: Efficient exact inference for 3D indoor scene understanding. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 299–313. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 50.Chao, Y.-W., Choi, W., Pantofaru, C., Savarese, S.: Layout estimation of highly cluttered indoor scenes using geometric and semantic cues. In: Petrosino, A. (ed.) ICIAP 2013, Part II. LNCS, vol. 8157, pp. 489–499. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 51.Furlan, A., Miller, D., Sorrenti, D.G., Fei-Fei, L., Savarese, S.: Free your camera: 3D indoor scene understanding from arbitrary camera motion. In: BMVC (2013)Google Scholar
- 52.Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: ICCV (2007)Google Scholar
- 53.Tu, Z.: Auto-context and its application to high-level vision tasks. In: CVPR (2008)Google Scholar
- 54.Choi, M.J., Torralba, A., Willsky, A.S.: A tree-based context model for object recognition. PAMI (2012)Google Scholar
- 55.Choi, M.J., Torralba, A., Willsky, A.S.: Context models and out-of-context objects. Pattern Recognition Letters (2012)Google Scholar
- 56.Choi, M.J., Lim, J.J., Torralba, A., Willsky, A.S.: Exploiting hierarchical context on a large database of object categories. In: CVPR (2010)Google Scholar
- 57.Desai, C., Ramanan, D., Fowlkes, C.C.: Discriminative models for multi-class object layout. IJCV (2011)Google Scholar
- 58.Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph cut based inference with co-occurrence statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 59.Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Describing visual scenes using transformed objects and parts. IJCV (2008)Google Scholar
- 60.Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Depth from familiar objects: A hierarchical model for 3D scenes. In: CVPR (2006)Google Scholar
- 61.Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Describing visual scenes using transformed dirichlet processes. In: NIPS (2005)Google Scholar
- 62.Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Learning hierarchical models of scenes, objects, and parts. In: ICCV (2005)Google Scholar
- 63.Sudderth, E.B., Jordan, M.I.: Shared segmentation of natural scenes using dependent pitman-yor processes. In: NIPS (2008)Google Scholar
- 64.Li, C., Kowdle, A., Saxena, A., Chen, T.: Towards holistic scene understanding: Feedback enabled cascaded classification models. PAMI (2012)Google Scholar
- 65.Heitz, G., Gould, S., Saxena, A., Koller, D.: Cascaded classification models: Combining models for holistic scene understanding. In: NIPS (2008)Google Scholar
- 66.Wu, T., Zhu, S.C.: A numerical study of the bottom-up and top-down inference processes in and-or graphs. IJCV (2011)Google Scholar
- 67.Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences (2013)Google Scholar
- 68.Tenenbaum, J.B., Kemp, C., Griffiths, T.L., Goodman, N.D.: How to grow a mind: Statistics, structure, and abstraction. Science (2011)Google Scholar
- 69.Mansinghka, V.K., Kulkarni, T.D., Perov, Y.N., Tenenbaum, J.B.: Approximate bayesian image interpretation using generative probabilistic graphics programs. In: NIPS (2013)Google Scholar
- 70.Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. PAMI (2009)Google Scholar
- 71.Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: Unifying segmentation, detection, and recognition. IJCV (2005)Google Scholar
- 72.Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In: CVPR (2009)Google Scholar
- 73.Li, L.J., Su, H., Xing, E.P., Li, F.F.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: NIPS (2010)Google Scholar
- 74.Lin, D., Fidler, S., Urtasun, R.: Holistic scene understanding for 3D object detection with rgbd cameras. In: ICCV (2013)Google Scholar
- 75.Fidler, S., Dickinson, S.J., Urtasun, R.: 3D object detection and viewpoint estimation with a deformable 3d cuboid model. In: NIPS (2012)Google Scholar
- 76.Xiao, J., Furukawa, Y.: Reconstructing the world’s museums. IJCV (2014)Google Scholar
- 77.Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. IJCV (2008)Google Scholar
- 78.Bell, S., Upchurch, P., Snavely, N., Bala, K.: OpenSurfaces: a richly annotated catalog of surface appearance. TOG (2013)Google Scholar
- 79.Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)Google Scholar
- 80.Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV (2009)Google Scholar
- 81.Russell, B.C., Torralba, A.: Building a database of 3D scenes from user annotations. In: CVPR (2009)Google Scholar
- 82.Ni, K., Kannan, A., Criminisi, A., Winn, J.: Epitomic location recognition. In: CVPR (2008)Google Scholar
- 83.Zhang, Y., Xiao, J., Hays, J., Tan, P.: Framebreak: Dramatic image extrapolation by guided shift-maps. In: CVPR (2013)Google Scholar
- 84.He, K., Chang, H., Sun, J.: Rectangling panoramic images via warping. TOG (2013)Google Scholar
- 85.Song, S., Xiao, J.: Sliding shapes for 3D object detection in depth images. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 647–664. Springer, Heidelberg (2014)Google Scholar
- 86.Wu, Z., Song, S., Khosla, A., Tang, X., Xiao, J.: 3D ShapeNets for 2.5D object recognition and Next-Best-View prediction. ArXiv e-prints (2014)Google Scholar
- 87.Guo, R., Hoiem, D.: Support surface prediction in indoor scenes (2013)Google Scholar
- 88.Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from rgb-d images. In: CVPR (2013)Google Scholar
- 89.Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 90.Jiang, H., Xiao, J.: A linear approach to matching cuboids in RGBD images. In: CVPR (2013)Google Scholar
- 91.Kim, B., Kohli, P., Savarese, S.: 3D scene understanding by Voxel-CRF. In: ICCV (2013)Google Scholar
- 92.Zhang, J., Kan, C., Schwing, A.G., Urtasun, R.: Estimating the 3D layout of indoor scenes and its clutter from depth sensors. In: ICCV (2013)Google Scholar
- 93.Jia, Z., Gallagher, A., Saxena, A., Chen, T.: 3D-based reasoning with blocks, support, and stability. In: CVPR (2013)Google Scholar
- 94.Zheng, B., Zhao, Y., Yu, J.C., Ikeuchi, K., Zhu, S.C.: Beyond point clouds: Scene understanding by reasoning geometry and physics. In: CVPR (2013)Google Scholar
- 95.Xiao, J., Owens, A., Torralba, A.: SUN3D: A database of big spaces reconstructed using sfm and object labels. In: ICCV (2013)Google Scholar