A Statistical Model for General Contextual Object Recognition

  • Peter Carbonetto
  • Nando de Freitas
  • Kobus Barnard
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3021)


We consider object recognition as the process of attaching meaningful labels to specific regions of an image, and propose a model that learns spatial relationships between objects. Given a set of images and their associated text (e.g. keywords, captions, descriptions), the objective is to segment an image, in either a crude or sophisticated fashion, then to find the proper associations between words and regions. Previous models are limited by the scope of the representation. In particular, they fail to exploit spatial context in the images and words. We develop a more expressive model that takes this into account. We formulate a spatially consistent probabilistic mapping between continuous image feature vectors and the supplied word tokens. By learning both word-to-region associations and object relations, the proposed model augments scene segmentations due to smoothing implicit in spatial consistency. Context introduces cycles to the undirected graph, so we cannot rely on a straightforward implementation of the EM algorithm for estimating the model parameters and densities of the unknown alignment variables. Instead, we develop an approximate EM algorithm that uses loopy belief propagation in the inference step and iterative scaling on the pseudo-likelihood approximation in the parameter update step. The experiments indicate that our approximate inference and learning algorithm converges to good local solutions. Experiments on a diverse array of images show that spatial context considerably improves the accuracy of object recognition. Most significantly, spatial context combined with a nonlinear discrete object representation allows our models to cope well with over-segmented scenes.


Partition Function Object Recognition Spatial Context Translation Model Statistical Machine Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Barnard, K., Duygulu, P., Forsyth, D.A.: Clustering art. In: IEEE Conf. Comp. Vision and Pattern Recognition (2001)Google Scholar
  2. 2.
    Barnard, K., Duygulu, P., Forsyth, D.A., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Machine Learning Res. 3, 1107–1135 (2003)zbMATHCrossRefGoogle Scholar
  3. 3.
    Barnard, K., Duygulu, P., Guru, R., Gabbur, P., Forsyth, D.A.: The Effects of segmentation and feature choice in a translation model of object recognition. In: IEEE Conf. Comp. Vision and Pattern Recognition (2003)Google Scholar
  4. 4.
    Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: Intl. Conf. Comp. Vision (2001)Google Scholar
  5. 5.
    Berger, A.: The Improved iterative scaling algorithm: a gentle introduction. Carnegie Mellon University, Pittsburgh (1997)Google Scholar
  6. 6.
    Besag, J.: On the Statistical analysis of dirty pictures. J. Royal Statistical Society, Series B 48(3), 259–302 (1986)zbMATHMathSciNetGoogle Scholar
  7. 7.
    Blei, D.M., Jordan, M.I.: Modeling annotated data. In: ACM SIGIR Conf. on Research and Development in Information Retrieval (2003)Google Scholar
  8. 8.
    Borra, S., Sarkar, S.: A Framework for performance characterization of intermediate- level grouping modules. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(11), 1306–1312 (1997)CrossRefGoogle Scholar
  9. 9.
    Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The Mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2), 263–311 (1993)Google Scholar
  10. 10.
    Cadez, I., Smyth, P.: Parameter estimation for inhomogeneous Markov random fields using PseudoLikelihood. University of California, Irvine (1998)Google Scholar
  11. 11.
    Carbonetto, P., de Freitas, N., Gustafson, P., Thompson, N.: Bayesian feature weighting for unsupervised learning, with application to object recognition. In: Workshop on Artificial Intelligence and Statistics (2003)Google Scholar
  12. 12.
    Dorkó, G., Schmid, C.: Selection of scale invariant neighborhoods for object class recognition. In: Intl. Conf. Comp. Vision (2003)Google Scholar
  13. 13.
    Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. iN: IEEE Conf. Comp. Vision and Pattern Recognition (2003)Google Scholar
  15. 15.
    Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. Intl. J. of Comp. Vision 40(1), 23–47 (2000)Google Scholar
  16. 16.
    Kumar, S., Hebert, H.: Discriminative Random Fields: a discriminative framework for contextual interaction in classification. In: Intl. Conf. Comp. Vision (2003)Google Scholar
  17. 17.
    Kumar, S., Hebert, H.: Discrminative Fields for modeling spatial dependencies in natural images. In: Adv. in Neural Information Processing Systems, Vol. 16 (2003) Google Scholar
  18. 18.
    Lowe, D.G.: Object recognition from local scale-invariant features. In: Intl. Conf. Comp. Vision (1999)Google Scholar
  19. 19.
    Murphy, K., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Conf. Uncertainty in Artificial Intelligence (1999)Google Scholar
  20. 20.
    Seymour, L.: Parameter estimation and model selection in image analysis using Gibbs-Markov random fields. PhD thesis, U. of North Carolina, Chapel Hill (1993)Google Scholar
  21. 21.
    Mikolajczk, K., Schmid, C.: A Performance evaluation of local descriptors. In: IEEE Conf. Comp. Vision and Pattern Recognition (2003)Google Scholar
  22. 22.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. In: IEEE Conf. Comp. Vision and Pattern Recognition (1997)Google Scholar
  23. 23.
    Teh, Y.W., Welling, M.: The Unified propagation and scaling algorithm. In: Advances in Neural Information Processing Systems, Vol. 14 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Peter Carbonetto
    • 1
  • Nando de Freitas
    • 1
  • Kobus Barnard
    • 2
  1. 1.Dept. of Computer ScienceUniversity of British ColumbiaVancouverCanada
  2. 2.Dept. of Computer ScienceUniversity of ArizonaTucson

Personalised recommendations