Translating Images to Words for Recognizing Objects in Large Image and Video Collections

Duygulu, Pınar; Baştan, Muhammet; Forsyth, David

doi:10.1007/11957959_14

Translating Images to Words for Recognizing Objects in Large Image and Video Collections

Pınar Duygulu²⁰,
Muhammet Baştan²⁰ &
David Forsyth²¹

Chapter

2761 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4170))

Abstract

We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), to predict words associated with the entire images (auto-annotation), or to associate the speech transcript text with the correct video frames (video alignment). We present our results on the Corel data set which consists of annotated images and on the TRECVID 2004 data set which consists of video frames associated with speech transcript text and manual annotations.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Giza++, http://www.fjoch.com/GIZA++.html
TREC Video Retrieval Evaluation, http://www-nlpir.nist.gov/projects/trecvid
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D.A., Blei, D., Jordan, M.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
Article MATH Google Scholar
Barnard, K., Duygulu, P., Forsyth, D.A.: Clustering art. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 434–439 (2001)
Google Scholar
Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision (ICCV), vol. 2, pp. 408–415 (2001)
Google Scholar
Blei, D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, 2003, pp. 127–134 (2003)
Google Scholar
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 152–155 (1992)
Google Scholar
Brown, P., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 (2004)
Google Scholar
Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)
Chapter Google Scholar
Duygulu, P., Wactlar, H.: Associating video frames with text. In: Multimedia Information Retrieval Workshop in conjuction with the 26th annual ACM SIGIR conference on Information Retrieval, Toronto, Canada, August 1 (2003)
Google Scholar
Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: The Proceedings of the International Conference on Pattern Recognition (CVPR 2004), vol.2, pp. 1002–1009 (2004)
Google Scholar
Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Englewood Cliffs (2002)
Google Scholar
Gauvain, J., Lamel, L., Adda, G.: The limsi broadcast news transcription system. Speech Communication 37(1–2), 89–108 (2002)
Article MATH Google Scholar
Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 (2005)
Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, 2003, pp. 119–126 (2003)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition. Prentice-Hall, Englewood Cliffs (2000)
Google Scholar
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: The Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)
Google Scholar
Li, J., Wang, J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(9), 1075–1088 (2003)
Article Google Scholar
Lin, C.-Y., Tseng, B.L., Smith, J.R.: Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In: NIST TREC 2003 Video Retrieval Evaluation Conference, Gaithersburg, MD (November 2003)
Google Scholar
Manning, C.D., utze, H.S.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The Fifteenth International Conference on Machine Learning, pp. 341–349 (1998)
Google Scholar
Melamed, I.D.: Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge (2001)
Google Scholar
Monay, F., Gatica-Perez, D.: On image auto-annotation with latent space models. In: Proc. ACM Int. Conf. on Multimedia (ACM MM), Berkeley, CA, USA (November 2003)
Google Scholar
Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: First International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 1(29), 19–51 (2003)
Article Google Scholar
Pan, J.-Y., Yang, H.-J., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal correlation discovery. In: Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 (2004)
Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Article Google Scholar
Duygulu, P., Virga, P.: Systematic Evaluation of Machine Translation Methods for Image and Video Annotation. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 174–183. Springer, Heidelberg (2005)
Chapter Google Scholar
Wenyin, L., Dumais, S., Sun, Y., Zhang, H., Czerwinski, M., Field, B.: Semi-automatic image annotation. In: Proc. INTERACT: Conference on Human-Computer Interaction, Tokyo, Japan, July 9-13, 2001, pp. 326–333 (2001)
Google Scholar
Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: Correlating Names with Visual Appearances. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 270–278. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Bilkent University, Ankara, Turkey
Pınar Duygulu & Muhammet Baştan
University of Illinois, 405 N. Mathews Avenue, Urbana, IL, 61801, USA
David Forsyth

Authors

Pınar Duygulu
View author publications
You can also search for this author in PubMed Google Scholar
Muhammet Baştan
View author publications
You can also search for this author in PubMed Google Scholar
David Forsyth
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Département d’Informatique, Ecole Normale Supérieure, P.O. Box, Paris, France
Jean Ponce
Carnegie Mellon University, Pittsburgh, USA
Martial Hebert
GRAVIR-INRIA, 655 avenue de l’Europe, P.O. Box, 38330, Montbonnot, France
Cordelia Schmid
Department of Engineering Science, University of Oxford, Parks Road, OX1 3PJ, Oxford, UK
Andrew Zisserman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Duygulu, P., Baştan, M., Forsyth, D. (2006). Translating Images to Words for Recognizing Objects in Large Image and Video Collections. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds) Toward Category-Level Object Recognition. Lecture Notes in Computer Science, vol 4170. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11957959_14

Download citation

DOI: https://doi.org/10.1007/11957959_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68794-8
Online ISBN: 978-3-540-68795-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics