A Generic Model to Compose Vision Modules for Holistic Scene Understanding

  • Congcong Li
  • Adarsh Kowdle
  • Ashutosh Saxena
  • Tsuhan Chen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6553)


The problem of holistic scene understanding involves many vision tasks such as depth estimation, scene categorization, event categorization, etc. Each of these tasks explores some aspects of the scene but, these tasks are related in that, they represent attributes of the same scene. An intuition is that one task can provide meaningful attributes to aid the learning process of another task. In this work, we propose a generic model (together with learning and inference techniques) for connecting different vision tasks in the form of a 2-layer cascade. Our model considers the first layer as a hidden layer, where the latent variables are inferred by feedback from the second layer. The feedback mechanism allows the first layer classifiers to focus on more important image modes, and draws their output towards “attributes” rather than the original “labels”. Our model also automatically discovers sparse connections between the learned attributes on the first layer and the target task on the second layer. Note that in our model, the same vision tasks can act as attribute learners as well as target tasks, while being set up on different layers. In extensive experiments, we show that the same proposed model improves the performance in all the tasks we consider: single image depth estimation, scene categorization, saliency detection and event categorization.


Vision Task Sparse Code Event Categorization Depth Estimation Salient Region 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Saxena, A., Chung, S.H., Ng, A.Y.: 3-D depth reconstruction from a single still image. IJCV 76, 53–69 (2007)CrossRefGoogle Scholar
  2. 2.
    Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In: CVPR (2009)Google Scholar
  3. 3.
    Hoiem, D., Efros, A.A., Hebert, M.: Closing the loop on scene interpretation. In: CVPR (2008)Google Scholar
  4. 4.
    Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Depth from familiar objects: A hierarchical model for 3D scenes. In: CVPR (2006)Google Scholar
  5. 5.
    Heitz, G., Gould, S., Saxena, A., Koller, D.: Cascaded classification models: Combining models for holistic scene understanding. In: NIPS (2008)Google Scholar
  6. 6.
    Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS (2007)Google Scholar
  7. 7.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: CVPR (2009)Google Scholar
  8. 8.
    Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)Google Scholar
  9. 9.
    Wang, G., Forsyth, D.: Joint learning of visual attributes, object classes and visual saliency. In: ICCV (2009)Google Scholar
  10. 10.
    Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Learning hierarchical models of scenes, objects, and parts. In: ICCV (2005)Google Scholar
  11. 11.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. In: CVPR (2006)Google Scholar
  12. 12.
    Tu, Z., Chen, X., Yuille, A.L., Zhu, S.: Image parsing: Unifying segmentation, detection, and recognition. In: ICCV (2003)Google Scholar
  13. 13.
    Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based classification. In: ICCV (2005)Google Scholar
  14. 14.
    Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: ICCV (2007)Google Scholar
  15. 15.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: EuroCOLT (1995)Google Scholar
  16. 16.
    Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: NIPS (2007)Google Scholar
  17. 17.
    Mairal, J., Leordeanu, M., Bach, F., Hebert, M., Ponce, J.: Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 43–56. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  18. 18.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. of Royal Stat. Soc., Series B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Gibbs, M., Mackay, D.: Variational gaussian process classifiers. IEEE Transactions on Neural Networks 11, 1458–1464 (1997)Google Scholar
  20. 20.
    Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M.: Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol. Rev. 113, 766–786 (2006)CrossRefGoogle Scholar
  21. 21.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 99 (2009)Google Scholar
  22. 22.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: Pascal, voc2008 (2008),
  23. 23.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42, 145–175 (2001)zbMATHCrossRefGoogle Scholar
  24. 24.
    Oliva, A., Torralba, A.: Mit outdoor scene dataset (2009),
  25. 25.
    Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned Salient Region Detection. In: CVPR (2009)Google Scholar
  26. 26.
    Saxena, A., Sun, M., Ng, A.: Make3D: Learning 3D scene structure from a single still image. PAMI 31, 824–840 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Congcong Li
    • 1
  • Adarsh Kowdle
    • 1
  • Ashutosh Saxena
    • 2
  • Tsuhan Chen
    • 1
  1. 1.School of Electrical & Computer EngineeringCornell UniversityUSA
  2. 2.Department of Computer ScienceCornell UniversityUSA

Personalised recommendations