Semantic Structure from Motion: A Novel Framework for Joint Object Recognition and 3D Reconstruction

  • Sid Yingze Bao
  • Silvio Savarese
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7474)


Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this chapter, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in a scene. These properties allow to jointly estimate the structure of the scene, the camera parameters as well as the 3D locations, poses, and categories of objects in a scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets – LiDAR cars, street-view pedestrians, and Kinect office desktop – verify our theoretical claims.


Object Detection Camera Parameter Semantic Structure Bundle Adjustment Structure From Motion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bao, S.Y., Savarese, S.: Semantic structure from motion. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  2. 2.
    Bao, S.Y., Sun, M., Savarese, S.: Toward coherent object detection and scene layout understanding. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2010)Google Scholar
  3. 3.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Cheng, Y.: Mean shift, mode seeking, and clustering. PAMI (1995)Google Scholar
  5. 5.
    Cornelis, N., Leibe, B., Cornelis, K., Gool, L.: 3d urban scene modeling integrating recognition and reconstruction. IJCV 78(2-3), 121–141 (2008)CrossRefGoogle Scholar
  6. 6.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2005)Google Scholar
  7. 7.
    Dellaert, F., Seitz, S., Thrun, S., Thorpe, C.: Feature correspondence: A markov chain monte carlo approach. In: NIPS (2000)Google Scholar
  8. 8.
    Dick, A.R., Torr, P.H.S., Cipolla, R.: Modelling and interpretation of architecture from several images. IJCV 60(2), 111–134 (2004)CrossRefGoogle Scholar
  9. 9.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. TPAMI (2009)Google Scholar
  10. 10.
    Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR, vol. 2, pp. 264–271 (2003)Google Scholar
  11. 11.
    Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Gilks, W., Richardson, S., Spiegelhalter, D.: Markov chain Monte Carlo in practice. Chapman and Hall (1996)Google Scholar
  13. 13.
    Golparvar-Fard, M., Pena-Mora, F., Savarese, S.: D4ar- a 4-dimensional augmented reality model for automating construction progress data collection, processing and communication. In: TCON Special Issue: Next Generation Construction IT (2009)Google Scholar
  14. 14.
    Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV (2009)Google Scholar
  15. 15.
    Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000)Google Scholar
  16. 16.
    Helmer, S., Meger, D., Muja, M., Little, J., Lowe, D.: Multiple viewpoint recognition and localization. In: ACCV (2011)Google Scholar
  17. 17.
    Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. International Journal of Computer Vision 80(1) (2008)Google Scholar
  18. 18.
    Huber, D.: Automatic 3d modeling using range images obtained from unknown viewpoints. In: Int. Conf. on 3-D Digital Imaging and Modeling (2001)Google Scholar
  19. 19.
    Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. 20.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)Google Scholar
  21. 21.
    Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision (2004)Google Scholar
  22. 22.
    Li, L.-J., Socher, R., Fei-Fei, L.: Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In: CVPR (2009)Google Scholar
  23. 23.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)Google Scholar
  24. 24.
    Nister, D.: An efficient solution to the five-point relative pose problem. TPAMI (2004)Google Scholar
  25. 25.
    Pandey, G., McBride, J.R., Eustice, R.M.: Ford campus vision and lidar data set. International Journal of Robotics Research (2011)Google Scholar
  26. 26.
    Pollefeys, M., Gool, L.V.: From images to 3d models. Commun. ACM 45(7), 50–55 (2002)CrossRefGoogle Scholar
  27. 27.
    Reynolds, M., Doboš, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-flight data with confidence. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  28. 28.
    Rusu, R., Marton, Z., Blodow, N., Dolha, M., Beetz, M.: Towards 3d point cloud based object maps for household environments. Robotics and Autonomous Systems 56(11) (2008)Google Scholar
  29. 29.
    Savarese, S., Fei-Fei, L.: 3d generic object categorization, localization and pose estimation. In: ICCV (2007)Google Scholar
  30. 30.
    Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. PAMI 31(5), 824–840 (2009)CrossRefGoogle Scholar
  31. 31.
    Snavely, N., Seitz, S.M., Szeliski, R.S.: Modeling the world from internet photo collections. IJCV (2) (2008)Google Scholar
  32. 32.
    Soatto, S., Perona, P.: Reducing ”structure from motion”: a general framework for dynamic vision. part 1: modeling. International Journal of Computer Vision 20 (1998)Google Scholar
  33. 33.
    Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Depth from familiar objects: A hierarchical model for 3d scenes. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2006)Google Scholar
  34. 34.
    Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbob, A.: Bundle adjustment: a modern synthesis. In: Vision Algorithms: Theory and Practice (1999)Google Scholar
  35. 35.
    Tuytelaars, T., Van Gool, L.: Wide baseline stereo matching based on local, affinely invariant regions. In: British Machine Vision Conference (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Sid Yingze Bao
    • 1
  • Silvio Savarese
    • 1
  1. 1.The University of MichiganAnn ArborUSA

Personalised recommendations