Abstract
Scene image understanding has drawn much attention for its intriguing applications in the past years. In this paper, we propose a unified probabilistic graphical model called Topic-based Coherent Region Annotation (TCRA) for weakly-supervised scene region annotation. The multiscale over-segmented regions within a scene image are considered as the “words” of our topic model, which impose neighborhood contextual constraints on topic level through spatial MRF modeling, and incorporate an annotation reasoning mechanism for learning and inferring region labels automatically. Mean field variational inference is provided for model learning. The proposed TCRA has the following two main advantages for understanding natural scene images. First, spatial information of multiscale over-segmented regions is explicitly modeled to obtain coherent region annotations. Second, only image-level labels are needed for automatically inferring the label of every region within the scene. This is particularly helpful in reducing human burden on manually labeling pixel-level semantics in the scene understanding research. Thus, given a scene image that has no textual prior, the regions in it can be automatically labeled using the learned TCRA model. The experimental results conducted on three benchmarks consisting of the MSRCORID image dataset, the UIUC Events image dataset and the SIFT FLOW dataset show that the proposed model outperforms the recent state-of-the-art methods.
Similar content being viewed by others
References
Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022
Blei DM, Jordan MI (2003) Modeling annotated data. SIGIR, pp 127–134
Cao L, Li F-F (2007) Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. ICCV, pp 1–8
Comaniciu D, Meer P, Shift M (2002) A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell (PAMI) 24(5):603–619
Crandall DJ, Huttenlocher DP (2007) Composite models of objects and scenes for category recognition. CVPR
Felzenszwalb PF, Huttenlocher DP (2004) Efficient graph-based image segmentation. Int J Comput Vis (IJCV) 59(2):167–181
Hoiem D, Efros AA, Hebert M Putting Objects in Perspective. CVPR 2006:2137–2144
Holzinger A, Malle B, Bloice M, Wiltgen M, Ferri M, Stanganelli I, Hofmann-Wellenhof R (2014). In: Holzinger A, Jurisica I (eds) On the generation of point cloud data sets: the first step in the knowledge discovery process. In interactive knowledge discovery and data mining: state-of-the-art and future challenges in biomedical informatics. Springer Lecture Notes in Computer Science, Berlin, Heidelberg, pp 57–80. LNCS 8401
Holzinger A, Malle B, Giuliani N (2014). In: Slezak D, Peters JF, Tan A-H, Schwabe L (eds) On graph extraction from image data. In brain informatics and health (BIH) 2014, Lecture notes in artificial intelligence, vol 8609. Heidelberg Berlin, LNAI, pp 552–563
Kohli P, Ladicky L, Torr PHS (2009) Robust higher order potentials for enforcing label consistency. Int J Comput Vis (IJCV) 82(3):302–324
Ladicky L, Russell C, Kohli P, Torr PHS Associative hierarchical CRFs for object class image segmentation. ICCV 2009:739-746
Lafferty JD, McCallum A, Pereira FCN Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML 2001:282-289
LeCun Y, Huang FJ, Bottou L Learning methods for generic object recognition with invariance to pose and lighting. CVPR 2004:97-104
Li L-J, Li F-F (2007) What, where and who? Classifying events by scene and object recognition. ICCV:1–8
Li N, Li YF (2003) Feature encoding for unsupervised segmentation of color images. IEEE Trans Syst Man Cybern Part B (TSMC) 33(3):438–447
Lin W, Lu T, Su F (2012) A novel multi-modal integration and propagation model for cross-media information retrieval. MMM, pp 740–749
Li F-F, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. CVPR:524–531
Li L-J, Socher R, Li F-F (2009) Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. CVPR, pp 2036–2043
Liu C, Yuen J, Torralba A (2011) Nonparametric Scene Parsing via Label Transfer. IEEE Trans Pattern Anal Mach Intell (PAMI) 33(12):2368–2382
Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans Pattern Anal Mach Intell (PAMI) 25(9):1075–1088
Lowe DG (1999) Object recognition from local scale-invariant features. ICCV:1150–1157
Lu T, Jin Y, Su F, Shivakumara P, Tan CL Content-oriented multimedia document understanding through cross-media correlation. Multimedia Tools and Applications, to appear
Luo J, Boutell MR, Gray RT, Brown CM (2005) Image transform bootstrapping and its applications to semantic scene classification. IEEE Trans Syst Man Cybern Part B (TSMC) 35(3): 563–570
Marco A, Lihi Z-M, Pietro P (2012) Unsupervised learning of categorical segments in image collections. IEEE Trans Pattern Anal Mach Intell (PAMI) 34(9):1842–1855
Malisiewicz T, Efros AA (2008) Recognition by association via learning per-exemplar distances. CVPR
Ma X, Lu T, Xu F, Su F (2012) Anomaly detection with spatio-temporal context using depth images. ICPR, pp 2590–2593
Mikolajczyk K, Schmid C (2004) Scale Affine Invariant Interest Point Detectors. Int J Comput Vis (IJCV) 60(1):63–86
Niu Z, Hua G, Gao X, Tian Q (2012) Context aware topic model for scene recognition. CVPR, pp 2743–2750
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis (IJCV) 42(3):145–175
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S Objects in Context. ICCV 2007:1-8
Russell BC, Torralba A, Murphy KP, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vis (IJCV) 77(1-3):157–173
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell (PAMI) 22(8):888–905
Shotton J, Johnson M, Cipolla R Semantic texton forests for image categorization and segmentation. CVPR 2008
Shotton J, Winn JM, Rother C, Criminisi A (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int J Comput Vis (IJCV) 81(1):2–23
Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT Discovering Object Categories in Image Collections. ICCV 2005:65-76.
Sudderth EB, Torralba A, Freeman WT, Willsky AS (2008) Describing Visual Scenes Using Transformed Objects and Parts. Int J Comput Vis (IJCV) 77(1-3):291–330
Su F, Yang L, Lu T, Wang G (2011) Environmental sound classification for scene recognition using local discriminant bases and HMM. ACM Multimedia, pp 1389–1392
Tao W, Jin H, Zhang Y (2007) Color image segmentation based on mean shift and normalized cuts. IEEE Trans Syst Man, Cybern Part B (TSMC) 37(5):1382–1389
Tighe J, Lazebnik S (2010) SuperParsing: Scalable Nonparametric Image Parsing with Superpixels. ECCV, pp 352–365
Torralba A, Murphy KP, Freeman WT Contextual models for object detection using boosted random fields. NIPS 2004
Jakob JV, Bill T (2007) Region classification with markov field aspect models. CVPR
Vezhnevets A, Ferrari V, Buhmann JM (2011) Weakly supervised semantic segmentation with a multi-image model. ICCV, pp 643–650
Vezhnevets A, Ferrari V, Buhmann JM (2012) Weakly supervised structured output learning for semantic segmentation. CVPR, pp 845–852
Wang L, Wu Y, Lu T, Chen K (2011) Multiclass object detection by combining local appearances and context. ACM Multimedia, pp 1161–1164
Winn JM, Bishop CM (2005) Variational message passing. J Mach Learn Res (JMLR) 6:661–694
Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis (IJCV) 73(2):213–238
Acknowledgments
The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61272218 and No. 61321491, and the Program for New Century Excellent Talents under NCET-11-0232.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Here we briefly describe how to derive the equations in our EM algorithm. We use Maximum Likelihood Estimation (MLE) to estimate both the latent variables and the parameters. For the joint probability distribution or likelihood p(X), where X indicates the observations, we aim at maximizing the equivalent log likelihood lnp(X). Let L denote the latent variables, we use Jensen’s inequality as follows:
To be tractable, we need maximize the lower bound of p(X). In our TCRA model, we substitute \(\prod \limits _{d = 1}^{D} {p\left ({{\theta ^{d}},{{\boldmath {z}}^{d}},\beta ,{{\boldmath {r}}^{d}},\pi ,{{\boldmath {w}}^{d}},{{\boldmath {y}}^{d}}|\alpha ,\eta ,\rho ,\delta } \right )}\) in (5) and the variational posterior distribution q(β, π, 𝜃, z, y) in (6) into the above equation.
In variational inference, to optimize the proper latent variables, namely, { 𝜃, z, β, π, y}, we use the Lagrangian with constraints and set the derivative to zero. Denote the lower bound of the log-likelihood as LL, then the update (10) of latent variable \(\phi _{nk}^{d}\) can be derived as follows:
and
Next,
and
Then,
and
The last term is:
and
Finally, by combining all the above terms and setting the derivative to zero:
we can get the update (10). Through similar ways, the other update equations can be derived. Thus in the E step, we can update one latent variable while fixing the others in each iteration. This process can be repeated from topic level to scene image level for parsing the semantics of images.
Rights and permissions
About this article
Cite this article
Wang, H., Lu, T., Wang, Y. et al. Weakly-supervised region annotation for understanding scene images. Multimed Tools Appl 75, 3027–3051 (2016). https://doi.org/10.1007/s11042-014-2420-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2420-5