Multimedia Tools and Applications

, Volume 71, Issue 3, pp 1033–1050 | Cite as

Beyond Bag-of-Words: combining generative and discriminative models for scene categorization

  • Zhen Li
  • Kim-Hui Yap


This paper proposes an efficient framework for scene categorization by combining generative model and discriminative model. A state-of-the-art approach for scene categorization is the Bag-of-Words (BoW) framework. However, there exist many categories in scenes. Generally when a new category is considered, the codebook in BoW framework needs to be re-generated, which will involve exhaustive computation. In view of this, this paper tries to address the issue by designing a new framework with good scalability. When an additional category is considered, much lower computational cost is needed while the resulting image signatures are still discriminative. The image signatures for training discriminative model are carefully designed based on the generative model. The soft relevance value of the extracted image signatures are estimated by image signature space modeling and are incorporated in Fuzzy Support Vector Machine (FSVM). The effectiveness of the proposed method is validated on UIUC Scene-15 dataset and NTU-25 dataset, and it is shown to outperform other state-of-the-art approaches for scene categorization.


Scene categorization Bag-of-Words Generative model Discriminative model Scalability 



This work is supported by Agency for Science, Technology and Research (A*STAR), Singapore under SERC Grant 062 130 0055. Thank Dr. J. C. van Gemert for kindly providing the source code of UNC in [15]. Thank the anonymous reviewers for providing the valuable suggestions that significantly improve the quality of the paper.


  1. 1.
    Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: IEEE conference on computer vision and pattern recognition, pp 1–8Google Scholar
  2. 2.
    Bosch A, Zisserman A, Muñoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell 30(4):712–727CrossRefGoogle Scholar
  3. 3.
    Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, European conference on computer visionGoogle Scholar
  4. 4.
    Deselaers T, Heigold G, Ney H (2010) Object classification by fusing SVMs and Gaussian mixtures. Pattern Recogn 43(7):2476–2484CrossRefzbMATHGoogle Scholar
  5. 5.
    Dorko G, Schmid C (2005) Object class recognition using discriminative local features. In: INRIA Technical Report, RR-5497Google Scholar
  6. 6.
  7. 7.
    Jiang, Y-G, Ngo C-W, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: ACM international conference on image and video retrievalGoogle Scholar
  8. 8.
    Li T, Mei T, Kweon I-S, Hua X-S (2011) Contextual bag-of-words for visual categorization. IEEE Trans Circuits Syst Video Technol 21(4):381–392CrossRefGoogle Scholar
  9. 9.
    Li Z, Yap K-H, Chen X (2011) Beyond bags of words: combining generative and discriminative models for natural scene categorization. In: International conference on acoustics, speech and signal processing, pp 965–968Google Scholar
  10. 10.
    Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: European conference on computer vision, pp 490–503Google Scholar
  11. 11.
    Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proc. int. conf. comput. vis., vol 2, pp 1470–1477Google Scholar
  12. 12.
    Smeulders AW, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380CrossRefGoogle Scholar
  13. 13.
    Swain M, Ballard D (1991) Color indexing. Int J Comput Vis 7(1):11–32CrossRefGoogle Scholar
  14. 14.
    Szummer M, Picard RW (1998) Indoor-outdoor image classification. In: IEEE international workshop on content-based access of image and video database, pp 42–51Google Scholar
  15. 15.
    van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271–1283CrossRefGoogle Scholar
  16. 16.
    Wu L, Hoi SCH, Yu NH (2010) Semantics-preserving bag-of-words models and applications. IEEE Trans Image Process 19(7):1908CrossRefMathSciNetGoogle Scholar
  17. 17.
    Yu Z, Wong HS (2006) FEMA: A fast expectation maximization algorithm based on grid and PCA. In: IEEE international conference on multimedia & expo, pp 1913–1916Google Scholar
  18. 18.
    Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.School of Electrical and Electronic EngineeringNanyang Technological UniversitySingaporeSingapore

Personalised recommendations