Skip to main content
Log in

Weakly-supervised region annotation for understanding scene images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Scene image understanding has drawn much attention for its intriguing applications in the past years. In this paper, we propose a unified probabilistic graphical model called Topic-based Coherent Region Annotation (TCRA) for weakly-supervised scene region annotation. The multiscale over-segmented regions within a scene image are considered as the “words” of our topic model, which impose neighborhood contextual constraints on topic level through spatial MRF modeling, and incorporate an annotation reasoning mechanism for learning and inferring region labels automatically. Mean field variational inference is provided for model learning. The proposed TCRA has the following two main advantages for understanding natural scene images. First, spatial information of multiscale over-segmented regions is explicitly modeled to obtain coherent region annotations. Second, only image-level labels are needed for automatically inferring the label of every region within the scene. This is particularly helpful in reducing human burden on manually labeling pixel-level semantics in the scene understanding research. Thus, given a scene image that has no textual prior, the regions in it can be automatically labeled using the learned TCRA model. The experimental results conducted on three benchmarks consisting of the MSRCORID image dataset, the UIUC Events image dataset and the SIFT FLOW dataset show that the proposed model outperforms the recent state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS

  2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022

    MATH  Google Scholar 

  3. Blei DM, Jordan MI (2003) Modeling annotated data. SIGIR, pp 127–134

  4. Cao L, Li F-F (2007) Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. ICCV, pp 1–8

  5. Comaniciu D, Meer P, Shift M (2002) A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell (PAMI) 24(5):603–619

    Article  Google Scholar 

  6. Crandall DJ, Huttenlocher DP (2007) Composite models of objects and scenes for category recognition. CVPR

  7. Felzenszwalb PF, Huttenlocher DP (2004) Efficient graph-based image segmentation. Int J Comput Vis (IJCV) 59(2):167–181

    Article  Google Scholar 

  8. Hoiem D, Efros AA, Hebert M Putting Objects in Perspective. CVPR 2006:2137–2144

  9. Holzinger A, Malle B, Bloice M, Wiltgen M, Ferri M, Stanganelli I, Hofmann-Wellenhof R (2014). In: Holzinger A, Jurisica I (eds) On the generation of point cloud data sets: the first step in the knowledge discovery process. In interactive knowledge discovery and data mining: state-of-the-art and future challenges in biomedical informatics. Springer Lecture Notes in Computer Science, Berlin, Heidelberg, pp 57–80. LNCS 8401

  10. Holzinger A, Malle B, Giuliani N (2014). In: Slezak D, Peters JF, Tan A-H, Schwabe L (eds) On graph extraction from image data. In brain informatics and health (BIH) 2014, Lecture notes in artificial intelligence, vol 8609. Heidelberg Berlin, LNAI, pp 552–563

  11. Kohli P, Ladicky L, Torr PHS (2009) Robust higher order potentials for enforcing label consistency. Int J Comput Vis (IJCV) 82(3):302–324

    Article  Google Scholar 

  12. Ladicky L, Russell C, Kohli P, Torr PHS Associative hierarchical CRFs for object class image segmentation. ICCV 2009:739-746

  13. Lafferty JD, McCallum A, Pereira FCN Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML 2001:282-289

  14. LeCun Y, Huang FJ, Bottou L Learning methods for generic object recognition with invariance to pose and lighting. CVPR 2004:97-104

  15. Li L-J, Li F-F (2007) What, where and who? Classifying events by scene and object recognition. ICCV:1–8

  16. Li N, Li YF (2003) Feature encoding for unsupervised segmentation of color images. IEEE Trans Syst Man Cybern Part B (TSMC) 33(3):438–447

    Article  Google Scholar 

  17. Lin W, Lu T, Su F (2012) A novel multi-modal integration and propagation model for cross-media information retrieval. MMM, pp 740–749

  18. Li F-F, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. CVPR:524–531

  19. Li L-J, Socher R, Li F-F (2009) Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. CVPR, pp 2036–2043

  20. Liu C, Yuen J, Torralba A (2011) Nonparametric Scene Parsing via Label Transfer. IEEE Trans Pattern Anal Mach Intell (PAMI) 33(12):2368–2382

    Article  Google Scholar 

  21. Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans Pattern Anal Mach Intell (PAMI) 25(9):1075–1088

    Article  Google Scholar 

  22. Lowe DG (1999) Object recognition from local scale-invariant features. ICCV:1150–1157

  23. Lu T, Jin Y, Su F, Shivakumara P, Tan CL Content-oriented multimedia document understanding through cross-media correlation. Multimedia Tools and Applications, to appear

  24. Luo J, Boutell MR, Gray RT, Brown CM (2005) Image transform bootstrapping and its applications to semantic scene classification. IEEE Trans Syst Man Cybern Part B (TSMC) 35(3): 563–570

    Article  Google Scholar 

  25. Marco A, Lihi Z-M, Pietro P (2012) Unsupervised learning of categorical segments in image collections. IEEE Trans Pattern Anal Mach Intell (PAMI) 34(9):1842–1855

    Article  Google Scholar 

  26. Malisiewicz T, Efros AA (2008) Recognition by association via learning per-exemplar distances. CVPR

  27. Ma X, Lu T, Xu F, Su F (2012) Anomaly detection with spatio-temporal context using depth images. ICPR, pp 2590–2593

  28. Mikolajczyk K, Schmid C (2004) Scale Affine Invariant Interest Point Detectors. Int J Comput Vis (IJCV) 60(1):63–86

    Article  Google Scholar 

  29. Niu Z, Hua G, Gao X, Tian Q (2012) Context aware topic model for scene recognition. CVPR, pp 2743–2750

  30. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis (IJCV) 42(3):145–175

    Article  MATH  Google Scholar 

  31. Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S Objects in Context. ICCV 2007:1-8

  32. Russell BC, Torralba A, Murphy KP, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vis (IJCV) 77(1-3):157–173

    Article  Google Scholar 

  33. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell (PAMI) 22(8):888–905

    Article  Google Scholar 

  34. Shotton J, Johnson M, Cipolla R Semantic texton forests for image categorization and segmentation. CVPR 2008

  35. Shotton J, Winn JM, Rother C, Criminisi A (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int J Comput Vis (IJCV) 81(1):2–23

    Article  Google Scholar 

  36. Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT Discovering Object Categories in Image Collections. ICCV 2005:65-76.

  37. Sudderth EB, Torralba A, Freeman WT, Willsky AS (2008) Describing Visual Scenes Using Transformed Objects and Parts. Int J Comput Vis (IJCV) 77(1-3):291–330

    Article  Google Scholar 

  38. Su F, Yang L, Lu T, Wang G (2011) Environmental sound classification for scene recognition using local discriminant bases and HMM. ACM Multimedia, pp 1389–1392

  39. Tao W, Jin H, Zhang Y (2007) Color image segmentation based on mean shift and normalized cuts. IEEE Trans Syst Man, Cybern Part B (TSMC) 37(5):1382–1389

    Article  Google Scholar 

  40. Tighe J, Lazebnik S (2010) SuperParsing: Scalable Nonparametric Image Parsing with Superpixels. ECCV, pp 352–365

  41. Torralba A, Murphy KP, Freeman WT Contextual models for object detection using boosted random fields. NIPS 2004

  42. Jakob JV, Bill T (2007) Region classification with markov field aspect models. CVPR

  43. Vezhnevets A, Ferrari V, Buhmann JM (2011) Weakly supervised semantic segmentation with a multi-image model. ICCV, pp 643–650

  44. Vezhnevets A, Ferrari V, Buhmann JM (2012) Weakly supervised structured output learning for semantic segmentation. CVPR, pp 845–852

  45. Wang L, Wu Y, Lu T, Chen K (2011) Multiclass object detection by combining local appearances and context. ACM Multimedia, pp 1161–1164

  46. Winn JM, Bishop CM (2005) Variational message passing. J Mach Learn Res (JMLR) 6:661–694

    MathSciNet  MATH  Google Scholar 

  47. Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis (IJCV) 73(2):213–238

    Article  Google Scholar 

Download references

Acknowledgments

The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61272218 and No. 61321491, and the Program for New Century Excellent Talents under NCET-11-0232.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Lu.

Appendix

Appendix

Here we briefly describe how to derive the equations in our EM algorithm. We use Maximum Likelihood Estimation (MLE) to estimate both the latent variables and the parameters. For the joint probability distribution or likelihood p(X), where X indicates the observations, we aim at maximizing the equivalent log likelihood lnp(X). Let L denote the latent variables, we use Jensen’s inequality as follows:

$$\begin{array}{@{}rcl@{}} \log P(\mathbf{X})&=&\log\sum\limits_{\mathbf{L}}{P(\mathbf{L,X})}=\log \sum\limits_{\mathbf{L}}{\frac{p(\mathbf{L},\mathbf{X})q(\mathbf{L})}{q(\mathbf{L})}}\\ &\geq&\sum\limits_{\mathbf{L}}{q(\mathbf{L})\log{\frac{p(\mathbf{L},\mathbf{X})}{q(\mathbf{L})}}} =E_{q}{\log{p(\mathbf{L},\mathbf{X})}}-E_{q}{\log{q(\mathbf{L})}} \end{array} $$
(16)

To be tractable, we need maximize the lower bound of p(X). In our TCRA model, we substitute \(\prod \limits _{d = 1}^{D} {p\left ({{\theta ^{d}},{{\boldmath {z}}^{d}},\beta ,{{\boldmath {r}}^{d}},\pi ,{{\boldmath {w}}^{d}},{{\boldmath {y}}^{d}}|\alpha ,\eta ,\rho ,\delta } \right )}\) in (5) and the variational posterior distribution q(β, π, 𝜃, z, y) in (6) into the above equation.

In variational inference, to optimize the proper latent variables, namely, { 𝜃, z, β, π, y}, we use the Lagrangian with constraints and set the derivative to zero. Denote the lower bound of the log-likelihood as LL, then the update (10) of latent variable \(\phi _{nk}^{d}\) can be derived as follows:

$$\begin{array}{@{}rcl@{}} LL_{[\phi_{nk}^{d}]}^{(1)}&=&E_{q}\{\log p(z_{nk}^{d}|\theta^{d},\delta)\} \\ &=&E_{q}\{z_{nk}^{d} \log {{\theta_{k}^{d}}}+\delta\sum\limits_{m~n}{z_{nk}^{d} z_{mk}^{d}}\} + constant \\ &=&\phi_{nk}^{d} E\left[\log{{\theta_{k}^{d}}}|\gamma^{d}\right] + \delta \sum\limits_{n~m}{\phi_{nk}^{d} \phi_{mk}^{d}} + constant \end{array} $$
(17)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(1)}}{\partial \phi_{nk}^{d}}=E \left[ \log{{\theta_{k}^{d}}} | \gamma^{d} \right] + \delta \sum\limits_{m \in {\mathcal{N}}(n)}{\phi_{mk}^{d}} $$
(18)

Next,

$$\begin{array}{@{}rcl@{}} LL_{[\phi_{nk}^{d}]}^{(2)}&=&E_{q} \left\{ \log \prod\limits_{f=1}^{N_{f}}{ p({r_{n}^{d,f} | {z_{n}^{d}} , {\beta^{f}}}) } \right\} \\ &=& E_{q} \left\{ r_{nv}^{d,f} z_{nk}^{d} \sum\limits_{f=1}^{N_{f}} \sum\limits_{v=1}^{V_{f}} { \log \beta_{kv}^{f} } \right\} \\ &=& r_{nv}^{d,f} \phi_{nk}^{d} \sum\limits_{f=1}^{N_{f}} \sum\limits_{v=1}^{V_{f}} {E \left[ \log \beta_{k,v}^{f} | {\lambda_{k}^{f}} \right]} \end{array} $$
(19)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(2)}}{\partial \phi_{nk}^{d}} = r_{nv}^{d,f} \sum\limits_{f=1}^{N_{f}} \sum\limits_{v=1}^{V_{f}} {E \left[ \log \beta_{k,v}^{f} | {\lambda_{k}^{f}} \right]} $$
(20)

Then,

$$\begin{array}{@{}rcl@{}} LL_{[\phi_{nk}^{d}]}^{(3)} &=& E_{q} \left\{ p({w_{m}^{d}} | y_{mn}^{d}, z_{nk}^{d}, \pi) \right\} \\ &=& E_{q} \left\{ \sum\limits_{m=1}^{M^{d}} \sum\limits_{u=1}^{U} {y_{mn}^{d} z_{nk}^{d} w_{mn}^{d} \log \pi_{ku}} \right\}\\ &=&\sum\limits_{m=1}^{M^{d}} \sum\limits_{u=1}^{U} {\kappa_{mn}^{d} \phi_{nk}^{d} w_{mn}^{d} E \left[ \log \pi_{ku} | \xi_{k} \right]} \end{array} $$
(21)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(3)}}{\partial \phi_{nk}^{d}} = \sum\limits_{m=1}^{M^{d}} \sum\limits_{u=1}^{U} {\kappa_{mn}^{d} w_{mn}^{d} E \left[ \log \pi_{ku} | \xi_{k} \right]} $$
(22)

The last term is:

$$\begin{array}{@{}rcl@{}} -LL_{[\phi_{nk}^{d}]}^{(4)} &=& E_{q} \left\{ q(z_{nk}^{d} | \phi_{nk}^{d} ) \right\} + \lambda (\sum\limits_{k=1}^{K}{\phi_{nk}^{d}}-1)\\ &=& \phi_{nk}^{d} \log{\phi_{nk}^{d}} + \lambda (\sum\limits_{k=1}^{K}{\phi_{nk}^{d}}-1) \end{array} $$
(23)

and

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}^{(4)}}{\partial \phi_{nk}^{d}} = \log{\phi_{nk}^{d}} + 1 + \lambda $$
(24)

Finally, by combining all the above terms and setting the derivative to zero:

$$ \frac{\partial LL_{[\phi_{nk}^{d}]}}{\partial \phi_{nk}^{d}}=\frac{\partial LL_{[\phi_{nk}^{d}]}^{(1)}}{\partial \phi_{nk}^{d}} +\frac{\partial LL_{[\phi_{nk}^{d}]}^{(2)}}{\partial \phi_{nk}^{d}}+\frac{\partial LL_{[\phi_{nk}^{d}]}^{(3)}}{\partial \phi_{nk}^{d}} +\frac{\partial LL_{[\phi_{nk}^{d}]}^{(4)}}{\partial \phi_{nk}^{d}} = 0 $$
(25)

we can get the update (10). Through similar ways, the other update equations can be derived. Thus in the E step, we can update one latent variable while fixing the others in each iteration. This process can be repeated from topic level to scene image level for parsing the semantics of images.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Lu, T., Wang, Y. et al. Weakly-supervised region annotation for understanding scene images. Multimed Tools Appl 75, 3027–3051 (2016). https://doi.org/10.1007/s11042-014-2420-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2420-5

Keywords

Navigation