Weakly-supervised region annotation for understanding scene images

Wang, Hao; Lu, Tong; Wang, Yiming; Shivakumara, Palaiahnakote; Tan, Chew Lim

doi:10.1007/s11042-014-2420-5

Weakly-supervised region annotation for understanding scene images

Published: 27 December 2014

Volume 75, pages 3027–3051, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hao Wang¹,
Tong Lu¹,
Yiming Wang¹,
Palaiahnakote Shivakumara² &
…
Chew Lim Tan³

361 Accesses
1 Citation
Explore all metrics

Abstract

Scene image understanding has drawn much attention for its intriguing applications in the past years. In this paper, we propose a unified probabilistic graphical model called Topic-based Coherent Region Annotation (TCRA) for weakly-supervised scene region annotation. The multiscale over-segmented regions within a scene image are considered as the “words” of our topic model, which impose neighborhood contextual constraints on topic level through spatial MRF modeling, and incorporate an annotation reasoning mechanism for learning and inferring region labels automatically. Mean field variational inference is provided for model learning. The proposed TCRA has the following two main advantages for understanding natural scene images. First, spatial information of multiscale over-segmented regions is explicitly modeled to obtain coherent region annotations. Second, only image-level labels are needed for automatically inferring the label of every region within the scene. This is particularly helpful in reducing human burden on manually labeling pixel-level semantics in the scene understanding research. Thus, given a scene image that has no textual prior, the regions in it can be automatically labeled using the learned TCRA model. The experimental results conducted on three benchmarks consisting of the MSRCORID image dataset, the UIUC Events image dataset and the SIFT FLOW dataset show that the proposed model outperforms the recent state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

Domain-Agnostic Priors for Semantic Segmentation Under Unsupervised Domain Adaptation and Domain Generalization

Article 27 April 2024

Deep learning models for digital image processing: a review

Article 07 January 2024

References

Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022
MATH Google Scholar
Blei DM, Jordan MI (2003) Modeling annotated data. SIGIR, pp 127–134
Cao L, Li F-F (2007) Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. ICCV, pp 1–8
Comaniciu D, Meer P, Shift M (2002) A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell (PAMI) 24(5):603–619
Article Google Scholar
Crandall DJ, Huttenlocher DP (2007) Composite models of objects and scenes for category recognition. CVPR
Felzenszwalb PF, Huttenlocher DP (2004) Efficient graph-based image segmentation. Int J Comput Vis (IJCV) 59(2):167–181
Article Google Scholar
Hoiem D, Efros AA, Hebert M Putting Objects in Perspective. CVPR 2006:2137–2144
Holzinger A, Malle B, Bloice M, Wiltgen M, Ferri M, Stanganelli I, Hofmann-Wellenhof R (2014). In: Holzinger A, Jurisica I (eds) On the generation of point cloud data sets: the first step in the knowledge discovery process. In interactive knowledge discovery and data mining: state-of-the-art and future challenges in biomedical informatics. Springer Lecture Notes in Computer Science, Berlin, Heidelberg, pp 57–80. LNCS 8401
Holzinger A, Malle B, Giuliani N (2014). In: Slezak D, Peters JF, Tan A-H, Schwabe L (eds) On graph extraction from image data. In brain informatics and health (BIH) 2014, Lecture notes in artificial intelligence, vol 8609. Heidelberg Berlin, LNAI, pp 552–563
Kohli P, Ladicky L, Torr PHS (2009) Robust higher order potentials for enforcing label consistency. Int J Comput Vis (IJCV) 82(3):302–324
Article Google Scholar
Ladicky L, Russell C, Kohli P, Torr PHS Associative hierarchical CRFs for object class image segmentation. ICCV 2009:739-746
Lafferty JD, McCallum A, Pereira FCN Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML 2001:282-289
LeCun Y, Huang FJ, Bottou L Learning methods for generic object recognition with invariance to pose and lighting. CVPR 2004:97-104
Li L-J, Li F-F (2007) What, where and who? Classifying events by scene and object recognition. ICCV:1–8
Li N, Li YF (2003) Feature encoding for unsupervised segmentation of color images. IEEE Trans Syst Man Cybern Part B (TSMC) 33(3):438–447
Article Google Scholar
Lin W, Lu T, Su F (2012) A novel multi-modal integration and propagation model for cross-media information retrieval. MMM, pp 740–749
Li F-F, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. CVPR:524–531
Li L-J, Socher R, Li F-F (2009) Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. CVPR, pp 2036–2043
Liu C, Yuen J, Torralba A (2011) Nonparametric Scene Parsing via Label Transfer. IEEE Trans Pattern Anal Mach Intell (PAMI) 33(12):2368–2382
Article Google Scholar
Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans Pattern Anal Mach Intell (PAMI) 25(9):1075–1088
Article Google Scholar
Lowe DG (1999) Object recognition from local scale-invariant features. ICCV:1150–1157
Lu T, Jin Y, Su F, Shivakumara P, Tan CL Content-oriented multimedia document understanding through cross-media correlation. Multimedia Tools and Applications, to appear
Luo J, Boutell MR, Gray RT, Brown CM (2005) Image transform bootstrapping and its applications to semantic scene classification. IEEE Trans Syst Man Cybern Part B (TSMC) 35(3): 563–570
Article Google Scholar
Marco A, Lihi Z-M, Pietro P (2012) Unsupervised learning of categorical segments in image collections. IEEE Trans Pattern Anal Mach Intell (PAMI) 34(9):1842–1855
Article Google Scholar
Malisiewicz T, Efros AA (2008) Recognition by association via learning per-exemplar distances. CVPR
Ma X, Lu T, Xu F, Su F (2012) Anomaly detection with spatio-temporal context using depth images. ICPR, pp 2590–2593
Mikolajczyk K, Schmid C (2004) Scale Affine Invariant Interest Point Detectors. Int J Comput Vis (IJCV) 60(1):63–86
Article Google Scholar
Niu Z, Hua G, Gao X, Tian Q (2012) Context aware topic model for scene recognition. CVPR, pp 2743–2750
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis (IJCV) 42(3):145–175
Article MATH Google Scholar
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S Objects in Context. ICCV 2007:1-8
Russell BC, Torralba A, Murphy KP, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vis (IJCV) 77(1-3):157–173
Article Google Scholar
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell (PAMI) 22(8):888–905
Article Google Scholar
Shotton J, Johnson M, Cipolla R Semantic texton forests for image categorization and segmentation. CVPR 2008
Shotton J, Winn JM, Rother C, Criminisi A (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int J Comput Vis (IJCV) 81(1):2–23
Article Google Scholar
Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT Discovering Object Categories in Image Collections. ICCV 2005:65-76.
Sudderth EB, Torralba A, Freeman WT, Willsky AS (2008) Describing Visual Scenes Using Transformed Objects and Parts. Int J Comput Vis (IJCV) 77(1-3):291–330
Article Google Scholar
Su F, Yang L, Lu T, Wang G (2011) Environmental sound classification for scene recognition using local discriminant bases and HMM. ACM Multimedia, pp 1389–1392
Tao W, Jin H, Zhang Y (2007) Color image segmentation based on mean shift and normalized cuts. IEEE Trans Syst Man, Cybern Part B (TSMC) 37(5):1382–1389
Article Google Scholar
Tighe J, Lazebnik S (2010) SuperParsing: Scalable Nonparametric Image Parsing with Superpixels. ECCV, pp 352–365
Torralba A, Murphy KP, Freeman WT Contextual models for object detection using boosted random fields. NIPS 2004
Jakob JV, Bill T (2007) Region classification with markov field aspect models. CVPR
Vezhnevets A, Ferrari V, Buhmann JM (2011) Weakly supervised semantic segmentation with a multi-image model. ICCV, pp 643–650
Vezhnevets A, Ferrari V, Buhmann JM (2012) Weakly supervised structured output learning for semantic segmentation. CVPR, pp 845–852
Wang L, Wu Y, Lu T, Chen K (2011) Multiclass object detection by combining local appearances and context. ACM Multimedia, pp 1161–1164
Winn JM, Bishop CM (2005) Variational message passing. J Mach Learn Res (JMLR) 6:661–694
MathSciNet MATH Google Scholar
Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis (IJCV) 73(2):213–238
Article Google Scholar

Download references

Acknowledgments

The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61272218 and No. 61321491, and the Program for New Century Excellent Talents under NCET-11-0232.

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, China
Hao Wang, Tong Lu & Yiming Wang
Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
Palaiahnakote Shivakumara
School of Computing, National University of Singapore, Singapore, Singapore
Chew Lim Tan

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Palaiahnakote Shivakumara
View author publications
You can also search for this author in PubMed Google Scholar
Chew Lim Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Lu.

Appendix

Here we briefly describe how to derive the equations in our EM algorithm. We use Maximum Likelihood Estimation (MLE) to estimate both the latent variables and the parameters. For the joint probability distribution or likelihood p(X), where X indicates the observations, we aim at maximizing the equivalent log likelihood lnp(X). Let L denote the latent variables, we use Jensen’s inequality as follows:

$$\begin{array}{@{}rcl@{}} \log P(\mathbf{X})&=&\log\sum\limits_{\mathbf{L}}{P(\mathbf{L,X})}=\log \sum\limits_{\mathbf{L}}{\frac{p(\mathbf{L},\mathbf{X})q(\mathbf{L})}{q(\mathbf{L})}}\\ &\geq&\sum\limits_{\mathbf{L}}{q(\mathbf{L})\log{\frac{p(\mathbf{L},\mathbf{X})}{q(\mathbf{L})}}} =E_{q}{\log{p(\mathbf{L},\mathbf{X})}}-E_{q}{\log{q(\mathbf{L})}} \end{array} $$

(16)

To be tractable, we need maximize the lower bound of p(X). In our TCRA model, we substitute $\prod \limits _{d = 1}^{D} {p\left ({{\theta ^{d}},{{\boldmath {z}}^{d}},\beta ,{{\boldmath {r}}^{d}},\pi ,{{\boldmath {w}}^{d}},{{\boldmath {y}}^{d}}|\alpha ,\eta ,\rho ,\delta } \right )}$ in (5) and the variational posterior distribution q(β, π, 𝜃, z, y) in (6) into the above equation.

In variational inference, to optimize the proper latent variables, namely, { 𝜃, z, β, π, y}, we use the Lagrangian with constraints and set the derivative to zero. Denote the lower bound of the log-likelihood as LL, then the update (10) of latent variable $\phi _{nk}^{d}$ can be derived as follows:

$$\begin{array}{@{}rcl@{}} LL_{[\phi_{nk}^{d}]}^{(1)}&=&E_{q}\{\log p(z_{nk}^{d}|\theta^{d},\delta)\} \\ &=&E_{q}\{z_{nk}^{d} \log {{\theta_{k}^{d}}}+\delta\sum\limits_{m~n}{z_{nk}^{d} z_{mk}^{d}}\} + constant \\ &=&\phi_{nk}^{d} E\left[\log{{\theta_{k}^{d}}}|\gamma^{d}\right] + \delta \sum\limits_{n~m}{\phi_{nk}^{d} \phi_{mk}^{d}} + constant \end{array} $$

(17)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(1)}}{\partial \phi_{nk}^{d}}=E \left[ \log{{\theta_{k}^{d}}} | \gamma^{d} \right] + \delta \sum\limits_{m \in {\mathcal{N}}(n)}{\phi_{mk}^{d}} $$

(18)

Next,

$$\begin{array}{@{}rcl@{}} LL_{[\phi_{nk}^{d}]}^{(2)}&=&E_{q} \left\{ \log \prod\limits_{f=1}^{N_{f}}{ p({r_{n}^{d,f} | {z_{n}^{d}} , {\beta^{f}}}) } \right\} \\ &=& E_{q} \left\{ r_{nv}^{d,f} z_{nk}^{d} \sum\limits_{f=1}^{N_{f}} \sum\limits_{v=1}^{V_{f}} { \log \beta_{kv}^{f} } \right\} \\ &=& r_{nv}^{d,f} \phi_{nk}^{d} \sum\limits_{f=1}^{N_{f}} \sum\limits_{v=1}^{V_{f}} {E \left[ \log \beta_{k,v}^{f} | {\lambda_{k}^{f}} \right]} \end{array} $$

(19)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(2)}}{\partial \phi_{nk}^{d}} = r_{nv}^{d,f} \sum\limits_{f=1}^{N_{f}} \sum\limits_{v=1}^{V_{f}} {E \left[ \log \beta_{k,v}^{f} | {\lambda_{k}^{f}} \right]} $$

(20)

Then,

$$\begin{array}{@{}rcl@{}} LL_{[\phi_{nk}^{d}]}^{(3)} &=& E_{q} \left\{ p({w_{m}^{d}} | y_{mn}^{d}, z_{nk}^{d}, \pi) \right\} \\ &=& E_{q} \left\{ \sum\limits_{m=1}^{M^{d}} \sum\limits_{u=1}^{U} {y_{mn}^{d} z_{nk}^{d} w_{mn}^{d} \log \pi_{ku}} \right\}\\ &=&\sum\limits_{m=1}^{M^{d}} \sum\limits_{u=1}^{U} {\kappa_{mn}^{d} \phi_{nk}^{d} w_{mn}^{d} E \left[ \log \pi_{ku} | \xi_{k} \right]} \end{array} $$

(21)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(3)}}{\partial \phi_{nk}^{d}} = \sum\limits_{m=1}^{M^{d}} \sum\limits_{u=1}^{U} {\kappa_{mn}^{d} w_{mn}^{d} E \left[ \log \pi_{ku} | \xi_{k} \right]} $$

(22)

The last term is:

$$\begin{array}{@{}rcl@{}} -LL_{[\phi_{nk}^{d}]}^{(4)} &=& E_{q} \left\{ q(z_{nk}^{d} | \phi_{nk}^{d} ) \right\} + \lambda (\sum\limits_{k=1}^{K}{\phi_{nk}^{d}}-1)\\ &=& \phi_{nk}^{d} \log{\phi_{nk}^{d}} + \lambda (\sum\limits_{k=1}^{K}{\phi_{nk}^{d}}-1) \end{array} $$

(23)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(4)}}{\partial \phi_{nk}^{d}} = \log{\phi_{nk}^{d}} + 1 + \lambda $$

(24)

Finally, by combining all the above terms and setting the derivative to zero:

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}}{\partial \phi_{nk}^{d}}=\frac{\partial LL_{[\phi_{nk}^{d}]}^{(1)}}{\partial \phi_{nk}^{d}} +\frac{\partial LL_{[\phi_{nk}^{d}]}^{(2)}}{\partial \phi_{nk}^{d}}+\frac{\partial LL_{[\phi_{nk}^{d}]}^{(3)}}{\partial \phi_{nk}^{d}} +\frac{\partial LL_{[\phi_{nk}^{d}]}^{(4)}}{\partial \phi_{nk}^{d}} = 0 $$

(25)

we can get the update (10). Through similar ways, the other update equations can be derived. Thus in the E step, we can update one latent variable while fixing the others in each iteration. This process can be repeated from topic level to scene image level for parsing the semantics of images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Lu, T., Wang, Y. et al. Weakly-supervised region annotation for understanding scene images. Multimed Tools Appl 75, 3027–3051 (2016). https://doi.org/10.1007/s11042-014-2420-5

Download citation

Received: 08 May 2014
Revised: 11 November 2014
Accepted: 12 December 2014
Published: 27 December 2014
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11042-014-2420-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly-supervised region annotation for understanding scene images

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Domain-Agnostic Priors for Semantic Segmentation Under Unsupervised Domain Adaptation and Domain Generalization

Deep learning models for digital image processing: a review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Weakly-supervised region annotation for understanding scene images

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Domain-Agnostic Priors for Semantic Segmentation Under Unsupervised Domain Adaptation and Domain Generalization

Deep learning models for digital image processing: a review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation