Skip to main content

Unsupervised Learning of Semantics of Object Detections for Scene Categorization

  • Conference paper
  • First Online:
Pattern Recognition Applications and Methods

Abstract

Classifying scenes (e.g. into “street”, “home” or “leisure”) is an important but complicated task nowadays, because images come with variability, ambiguity, and a wide range of illumination or scale conditions. Standard approaches build an intermediate representation of the global image and learn classifiers on it. Recently, it has been proposed to depict an image as an aggregation of its contained objects:  the representation on which classifiers are trained is composed of many heterogeneous feature vectors derived from various object detectors. In this paper, we propose to study different approaches to efficiently learn contextual semantics out of these object detections. We use the features provided by Object-Bank [24] (177 different object detectors producing 252 attributes each), and show on several benchmarks for scene categorization that careful combinations, taking into account the structure of the data, allows to greatly improve over original results (from \(+5\) to \(+11\,\%\)) while drastically reducing the dimensionality of the representation by 97 % (from 44,604 to 1,000). We also show that the uncertainty relative to object detectors hampers the use of external semantic knowledge to improve detectors combination, unlike our unsupervised learning approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available from http://scikits.appspot.com/.

References

  1. Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 53–58 (1989)

    Article  Google Scholar 

  2. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Proc. Sys. 19, 153–160 (2007)

    Google Scholar 

  3. Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009). Also published as a book. Now Publishers, 2009

    Google Scholar 

  4. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation

    Google Scholar 

  5. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via plsa. In: In Proceedings of the ECCV, pp. 517–530 (2006)

    Google Scholar 

  6. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR09 (2009)

    Google Scholar 

  7. Espinace, P., Kollar, T., Soto, A., Roy, N.: Indoor scene recognition through object detection. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Anchorage, AK (2010)

    Google Scholar 

  8. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  9. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785 (2009)

    Google Scholar 

  10. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)—Volume 2—Volume 02, CVPR’05, pp. 524–531. IEEE Computer Society (2005)

    Google Scholar 

  11. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discrimitatively trained, multiscale, deformable part model. In: CVPR (2008)

    Google Scholar 

  12. Gao, S., Tsang, I., Chia, L., Zhao, P.: Local features are not lonely laplacian sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)

    Google Scholar 

  13. Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: NIPS’09, pp. 646–654 (2009)

    Google Scholar 

  14. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  15. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001)

    Article  MATH  Google Scholar 

  16. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. SIGGRAPH 24(3), 577584 (2005)

    Google Scholar 

  17. Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441, 498–520 (1933)

    Google Scholar 

  18. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition?. In: Proceedings of the International Conference on Computer Vision (ICCV’09), pp. 2146–2153. IEEE (2009)

    Google Scholar 

  19. Kavukcuoglu, K., Ranzato, M., Fergus, R., LeCun, Y.: Learning invariant features through topographic filter maps. In: Proceedings of the CVPR’09, pp. 1605–1612. IEEE (2009)

    Google Scholar 

  20. Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for training deep neural networks. JMLR 10, 1–40 (2009)

    MATH  Google Scholar 

  21. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (2006)

    Google Scholar 

  22. LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision, pp. 319–345. Springer (1999)

    Google Scholar 

  23. Li, L.-J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: ICCV (2007)

    Google Scholar 

  24. Li-Jia Li, E.P.X., Su, H., Fei-Fei, L.: Object bank: a high-level image representation for scene classification and semantic feature sparsification. In: Proceedings of the Neural Information Processing Systems (NIPS) (2010)

    Google Scholar 

  25. Li-Jia Li, Y.L., Su, H., Fei-Fei, L.: Objects as attributes for scene classification. In: European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, Crete, Greece, September 2010

    Google Scholar 

  26. Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: Guyon I., Dror, G., Lemaire, V., Taylor, G., Silver, D. (Eds.) JMLR W & CP: Proceedings of the Unsupervised and Transfer Learning challenge and workshop, vol. 27, pp. 97–110 (2012)

    Google Scholar 

  27. Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  28. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. In: Visual Perception, Progress in Brain Research, vol. 155 (2006)

    Google Scholar 

  29. Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object localization with deformable part-based models. In: ICCV (2011)

    Google Scholar 

  30. Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(6), 559–572 (1901)

    Article  Google Scholar 

  31. Quattoni, A., Torralba, A., Recognizing indoor scenes. In: CVPR (2009)

    Google Scholar 

  32. Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: NIPS’06 (2007)

    Google Scholar 

  33. Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., Glorot, X.: Higher order contractive auto-encoder. In: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) (2011)

    Google Scholar 

  34. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting auto-encoders: explicit invariance during feature extraction. In: Proceedings of the Twenty-eight International Conference on Machine Learning (ICML’11), June 2011

    Google Scholar 

  35. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: a database and web-based tool for image annotation. Int. J. Comput. Vision 77, 157–173 (2008)

    Article  Google Scholar 

  36. Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual cortex. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)

    Google Scholar 

  37. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1349–1380 (2000)

    Google Scholar 

  38. Torralba, A.: Contextual priming for object detection. Int. J. Comput. Vis. 53(2), 169–191 (2003)

    Article  MathSciNet  Google Scholar 

  39. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Cohen W.W., McCallum A., Roweis, S.T. (eds.) ICML’08, pp. 1096–1103. ACM (2008)

    Google Scholar 

  40. Vogel, J., Schiele, B.: Natural scene retrieval based on a semantic modeling step. In: Proceeedings of the International Conference on Image and Video Retrieval CIVR 2004, Dublin, Ireland, LNCS, vol. 3115, pp. 7 (2004)

    Google Scholar 

  41. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492. IEEE, June 2010

    Google Scholar 

Download references

Acknowledgments

We would like to thank Gloria Zen for her helpful comments. This work was supported by NSERC, CIFAR, the Canada Research Chairs, Compute Canada and by the French ANR Project ASAP ANR-09-EMER-001. Codes for the experiments have been implemented using Theano [4] Machine Learning library.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Grégoire Mesnil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Mesnil, G., Rifai, S., Bordes, A., Glorot, X., Bengio, Y., Vincent, P. (2015). Unsupervised Learning of Semantics of Object Detections for Scene Categorization. In: Fred, A., De Marsico, M. (eds) Pattern Recognition Applications and Methods. Advances in Intelligent Systems and Computing, vol 318. Springer, Cham. https://doi.org/10.1007/978-3-319-12610-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12610-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12609-8

  • Online ISBN: 978-3-319-12610-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics