Learning extremely shared middle-level image representation for scene classification

Abstract

Learning middle-level image representations is very important for the computer vision community, especially for scene classification tasks. Middle-level image representations currently available are not sparse enough to make training and testing times compatible with the increasing number of classes that users want to recognize. In this work, we propose a middle-level image representation based on the pattern that extremely shared among different classes to reduce both training and test time. The proposed learning algorithm first finds some class-specified patterns and then utilizes the lasso regularization to select the most discriminative patterns shared among different classes. The experimental results on some widely used scene classification benchmarks (15 Scenes, MIT-indoor 67, SUN 397) show that the fewest patterns are necessary to achieve very remarkable performance with reduced computation time.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    The “words”, “parts” and “patterns” are interchangeable and this paper chooses “patterns” to represent them.

  2. 2.

    15 Scenes: http://www-cvr.ai.uiuc.edu/ponce_grp/data/scene_categories/. MIT-indoor 67: http://web.mit.edu/torralba/www/indoor.html. SUN 397: http://vision.princeton.edu/projects/2010/SUN/.

  3. 3.

    The implementation code and trained models are available at https://github.com/hust-tp/ESMIR.

References

  1. 1.

    Argyriou A, Evgeniou T, Pontil M (2006) Multi-task feature learning. In: Proceedings of neural information processing systems, pp 41–48

  2. 2.

    Bourdev L, Malik J (2009) Poselets: body part detectors trained using 3d human pose annotations. In: Proceedings of international conference on computer vision, pp 1365–1372

  3. 3.

    Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British machine vision conference

  4. 4.

    Cimpoi M, Maji S, Vedaldi A (2015) Deep filter banks for texture recognition and segmentation. In: Proceedings of computer vision and pattern recognition, pp 3828–3836

  5. 5.

    Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292

    MATH  Google Scholar 

  6. 6.

    Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of workshop on statistical learning in computer vision, European conference on computer vision, pp 1–22

  7. 7.

    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of computer vision and pattern recognition, pp 886–893

  8. 8.

    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of computer vision and pattern recognition, pp 248–255

  9. 9.

    Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: Proceedings of computer vision and pattern recognition, pp 2974–2983

  10. 10.

    Doersch C, Gupta A, Efros AA (2013) Mid-level visual element discovery as discriminative mode seeking. In: Proceedings of neural information processing systems, pp 494–502

  11. 11.

    Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York

    Google Scholar 

  12. 12.

    Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  13. 13.

    Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: Proceedings of computer vision and pattern recognition, pp 1778–1785

  14. 14.

    Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of European conference on computer vision, pp 392–407

  15. 15.

    Hwang SJ, Sha F, Grauman K (2011) Sharing features between objects and their attributes. In: Proceedings of computer vision and pattern recognition, pp 1761–1768

  16. 16.

    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia, pp 675–678

  17. 17.

    Juneja M, Vedaldi A, Jawahar CV, Zisserman A (2013) Blocks that shout: Distinctive parts for scene classification. In: Proceedings of computer vision and pattern recognition, pp 923–930

  18. 18.

    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of neural information processing systems, pp 1097–1105

  19. 19.

    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of computer vision and pattern recognition, pp 2169–2178

  20. 20.

    Li LJ, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Proceedings of neural information processing systems, pp 1378–1386

  21. 21.

    Li Q, Wu J, Tu Z (2013) Harvesting mid-level visual concepts from large-scale internet images. In: Proceedings of computer vision and pattern recognition, pp 851–858

  22. 22.

    Li P, Lu X, Wang Q (2015a) From dictionary of visual words to subspaces: locality-constrained affine subspace coding. In: Proceedings of computer vision and pattern recognition, pp 2348–2357

  23. 23.

    Li Y, Liu L, Shen C, van den Hengel A (2015b) Mid-level deep pattern mining. In: Proceedings of computer vision and pattern recognition, pp 971–980

  24. 24.

    Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: Proceedings of international conference on computer vision, pp 2486–2493

  25. 25.

    Liu L, Shen C, Wang L, van den Hengel A, Wang C (2014) Encoding high dimensional local features by sparse coding based fisher vectors. In: Proceedings of neural information processing systems, pp 1143–1151

  26. 26.

    Liu L, Shen C, van den Hengel A (2015) The treasure beneath convolutional layers: cross-convolutional-layer pooling for image classification. In: Proceedings of computer vision and pattern recognition, pp 4749–4757

  27. 27.

    Lobel H, Vidal R, Soto A (2013) Hierarchical joint max-margin learning of mid and top level representations for visual recognition. In: Proceedings of international conference on computer vision, pp 1697–1704

  28. 28.

    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  29. 29.

    Neumann B, Möller R (2008) On scene interpretation with description logics. Image Vis Comput 26(1):82–101

    Article  Google Scholar 

  30. 30.

    NG AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of international conference on machine learning

  31. 31.

    Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of computer vision and pattern recognition, pp 1717–1724

  32. 32.

    Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables. Academic Press, New York

    Google Scholar 

  33. 33.

    Ott P, Everingham M (2011) Shared parts for deformable part-based models. In: Proceedings of computer vision and pattern recognition, pp 1513–1520

  34. 34.

    Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: Proceedings of international conference on computer vision, pp 1307–1314

  35. 35.

    Parameswaran S, Weinberger KQ (2010) Large margin multi-task metric learning. In: Proceedings of neural information processing systems, pp. 1867–1875

  36. 36.

    Parikh D, Grauman K (2011) Relative attributes. In: Proceedings of international conference on computer vision, pp 503–510

  37. 37.

    Parizi SN, Vedaldi A, Zisserman A, Felzenszwalb P (2015) Automatic discovery and optimization of parts for image classification. In: Proceedings of international conference on learning representations

  38. 38.

    Pechyony D, Vapnik V (2010) On the theory of learning with privileged information. In: Proceedings of neural information processing systems, pp 1894–1902

  39. 39.

    Peraldi SE, Kaya A, Melzer S, Möller R, Wessel M (2007) Multimedia interpretation as abduction. In: Proceedings of the dl-2007: international workshop on description logics

  40. 40.

    Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: Proceedings of computer vision and pattern recognition, pp 413–420

  41. 41.

    Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of computer vision and pattern recognition workshop, pp 512–519

  42. 42.

    Singh S, Gupta A, Efros A (2012) Unsupervised discovery of mid-level discriminative patches. In: Proceedings of European conference on computer vision, pp 73–86

  43. 43.

    Song X, Jiang S, Herranz L (2015) Joint multi-feature spatial context for scene recognition in the semantic manifold. In: Proceedings of computer vision and pattern recognition, pp 1312–1320

  44. 44.

    Sun J, Ponce J (2013) Learning discriminative part detectors for image classification and cosegmentation. In: Proceedings of international conference on computer vision, pp 3400–3407

  45. 45.

    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288

  46. 46.

    Torralba A, Murphy KP, Freeman WT (2007) Sharing visual features for multiclass and multiview object detection. IEEE Trans Pattern Anal Mach Intell 29(5):854–869

    Article  Google Scholar 

  47. 47.

    VanGemert J, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271–1283

    Article  Google Scholar 

  48. 48.

    Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of Multimedia, pp 1469–1472

  49. 49.

    Wang G, Forsyth DA (2009) Joint learning of visual attributes, object classes and visual saliency. In: Proceedings of international conference on computer vision, pp 537–544

  50. 50.

    Wang X, Wang B, Bai X, Liu W, Tu Z (2013) Max-margin multiple-instance dictionary learning. In: Proceedings of the international conference on machine learning, pp 846–854

  51. 51.

    Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of computer vision and pattern recognition, pp 3485–3492

  52. 52.

    Yuille AL, Rangarajan A (2003) The concave–convex procedure. Neural Comput 15(4):915–936

    Article  MATH  Google Scholar 

  53. 53.

    Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Proceedings of neural information processing systems, pp 487–495

Download references

Acknowledgements

We thank anonymous reviewers for their very useful comments and suggestions. This work was supported in part by the National Natural Science Foundation of China under Grant 61572207 and Grant 61503145, and the CAST Young Talent Supporting Program.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xinggang Wang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tang, P., Zhang, J., Wang, X. et al. Learning extremely shared middle-level image representation for scene classification. Knowl Inf Syst 52, 509–530 (2017). https://doi.org/10.1007/s10115-016-1015-z

Download citation

Keywords

  • Scene classification
  • Middle-level image representation
  • Extremely shared patterns