Skip to main content

Translating Images to Words for Recognizing Objects in Large Image and Video Collections

  • Chapter

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4170))

Abstract

We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), to predict words associated with the entire images (auto-annotation), or to associate the speech transcript text with the correct video frames (video alignment). We present our results on the Corel data set which consists of annotated images and on the TRECVID 2004 data set which consists of video frames associated with speech transcript text and manual annotations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Giza++, http://www.fjoch.com/GIZA++.html

  2. TREC Video Retrieval Evaluation, http://www-nlpir.nist.gov/projects/trecvid

  3. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D.A., Blei, D., Jordan, M.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)

    Article  MATH  Google Scholar 

  4. Barnard, K., Duygulu, P., Forsyth, D.A.: Clustering art. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 434–439 (2001)

    Google Scholar 

  5. Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision (ICCV), vol. 2, pp. 408–415 (2001)

    Google Scholar 

  6. Blei, D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, 2003, pp. 127–134 (2003)

    Google Scholar 

  7. Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 152–155 (1992)

    Google Scholar 

  8. Brown, P., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)

    Google Scholar 

  9. Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 (2004)

    Google Scholar 

  10. Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  11. Duygulu, P., Wactlar, H.: Associating video frames with text. In: Multimedia Information Retrieval Workshop in conjuction with the 26th annual ACM SIGIR conference on Information Retrieval, Toronto, Canada, August 1 (2003)

    Google Scholar 

  12. Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: The Proceedings of the International Conference on Pattern Recognition (CVPR 2004), vol.2, pp. 1002–1009 (2004)

    Google Scholar 

  13. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Englewood Cliffs (2002)

    Google Scholar 

  14. Gauvain, J., Lamel, L., Adda, G.: The limsi broadcast news transcription system. Speech Communication 37(1–2), 89–108 (2002)

    Article  MATH  Google Scholar 

  15. Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 (2005)

    Google Scholar 

  16. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, 2003, pp. 119–126 (2003)

    Google Scholar 

  17. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition. Prentice-Hall, Englewood Cliffs (2000)

    Google Scholar 

  18. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: The Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)

    Google Scholar 

  19. Li, J., Wang, J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(9), 1075–1088 (2003)

    Article  Google Scholar 

  20. Lin, C.-Y., Tseng, B.L., Smith, J.R.: Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In: NIST TREC 2003 Video Retrieval Evaluation Conference, Gaithersburg, MD (November 2003)

    Google Scholar 

  21. Manning, C.D., utze, H.S.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  22. Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The Fifteenth International Conference on Machine Learning, pp. 341–349 (1998)

    Google Scholar 

  23. Melamed, I.D.: Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge (2001)

    Google Scholar 

  24. Monay, F., Gatica-Perez, D.: On image auto-annotation with latent space models. In: Proc. ACM Int. Conf. on Multimedia (ACM MM), Berkeley, CA, USA (November 2003)

    Google Scholar 

  25. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: First International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)

    Google Scholar 

  26. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 1(29), 19–51 (2003)

    Article  Google Scholar 

  27. Pan, J.-Y., Yang, H.-J., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal correlation discovery. In: Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 (2004)

    Google Scholar 

  28. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)

    Article  Google Scholar 

  29. Duygulu, P., Virga, P.: Systematic Evaluation of Machine Translation Methods for Image and Video Annotation. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 174–183. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  30. Wenyin, L., Dumais, S., Sun, Y., Zhang, H., Czerwinski, M., Field, B.: Semi-automatic image annotation. In: Proc. INTERACT: Conference on Human-Computer Interaction, Tokyo, Japan, July 9-13, 2001, pp. 326–333 (2001)

    Google Scholar 

  31. Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: Correlating Names with Visual Appearances. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 270–278. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Duygulu, P., Baştan, M., Forsyth, D. (2006). Translating Images to Words for Recognizing Objects in Large Image and Video Collections. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds) Toward Category-Level Object Recognition. Lecture Notes in Computer Science, vol 4170. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11957959_14

Download citation

  • DOI: https://doi.org/10.1007/11957959_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68794-8

  • Online ISBN: 978-3-540-68795-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics