Building descriptive and discriminative visual codebook for large-scale image applications

Tian, Qi; Zhang, Shiliang; Zhou, Wengang; Ji, Rongrong; Ni, Bingbing; Sebe, Nicu

doi:10.1007/s11042-010-0636-6

Building descriptive and discriminative visual codebook for large-scale image applications

Published: 18 November 2010

Volume 51, pages 441–477, (2011)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Qi Tian¹,
Shiliang Zhang²,
Wengang Zhou³,
Rongrong Ji⁴,
Bingbing Ni⁵ &
…
Nicu Sebe⁶

611 Accesses
22 Citations
3 Altmetric
Explore all metrics

Abstract

Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount of image local features and the cluster centers are taken as visual words. This approach is simple and scalable, but results in noisy visual words. Lots of works are reported trying to improve the descriptive and discriminative ability of visual words. This paper gives a comprehensive survey on visual vocabulary and details several state-of-the-art algorithms. A comprehensive review and summarization of the related works on visual vocabulary is first presented. Then, we introduce our recent algorithms on descriptive and discriminative visual word generation, i.e., latent visual context analysis for descriptive visual word identification [74], descriptive visual words and visual phrases generation [68], contextual visual vocabulary which combines both semantic contexts and spatial contexts [69], and visual vocabulary hierarchy optimization [18]. Additionally, we introduce two interesting post processing strategies to further improve the performance of visual vocabulary, i.e., spatial coding [73] is proposed to efficiently remove the mismatched visual words between images for more reasonable image similarity computation; user preference based visual word weighting [44] is developed to make the image similarity computed based on visual words more consistent with users’ preferences or habits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Olga Russakovsky, Jia Deng, … Li Fei-Fei

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

References

Agarwal S, Roth D (2002) Learning a sparse representation for object detection. ECCV
Battiato S, Farinella G, Gallo G, Ravi D (2009) Spatial hierarchy of textons distribution for scene classification. Proc. Eurocom Multimedia Modeling, pp 333–342
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J MLR 3:993–1022
MATH Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hyper textual web search engine. WWW
Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. ICCV
Deerwester S, Dumais S, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Duygulu P, Barnard K, Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. ECCV
Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm ACM 24:381–395
Article MathSciNet Google Scholar
Gemert V, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. T-PAMI 32(7):1271–1283
Google Scholar
K. Grauman and T. Darrell. Approximate correspondences in high dimensions. NIPS, 2007.
Globerson A, Roweis S (2006) Metric learning by collapsing classes. Adv In Neu Info Proce Sys 18:451–458
Google Scholar
Hofmann T (1999) Probabilistic latent semantic indexing. ACM SIGIR
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. ML 41:177–196
Google Scholar
Indyk P, Thaper N (1998) Fast image retrieval via embeddings. Symposium on Theory of Computing
Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient image search. CVPR
Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. ECCV
Ji R, Xie X, Yao H, Wu Y, Ma W (2008) Incremental indexing of visual vocabulary for scalable retrieval. ICME
Ji R, Xie X, Yao H, Ma W (2009) Vocabulary hierarchy optimization for effective and transferable retrieval. CVPR
Ji R, Yao H, Sun X, Zhong B, Gao W (2010) Towards semantic embedding in visual vocabulary. CVPR
Jing Y, Baluja S (2008) VisualRank: applying pagerank to large-scale image search. IEEE Trans on PAMI 30:1877–1890
Google Scholar
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. IJCV, pp 604–610
Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling of object categories using link analysis techniques. CVPR
Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links. ACM MIR
Kohonen T (1986) Learning vector quantization for pattern recognition. Tech. Rep. TKK-F-A601, Helsinki Institute of Technology
Kohonen T (2000) Self-organizing maps, 3rd edition, Springer-Verlag
Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebook by information loss minimization. PAMI 31(7):1294–1309
Google Scholar
Leibe B, Leonardis A, Schiele B (2004) Combined object categorization and segmentation with an implicit shape model. ECCV
Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. ICCV
Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using 3-d textons. IJCV
Li F, Pietro P (2007) A bayesian hierarchical model for learning natural scene categories. ICCV
Li T, Mei T, Kweon I, Hua X (2010) Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video Technology
Liu D, Hua G, Viola P, Chen T (2008) Integrated feature selection and higher-order spatial feature extraction for object categorization. CVPR, pp 1–8
Liu C, Yuen J, Torralba A (2009) Dense scene alignment using SIFT flow for object recognition. CVPR
Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. CVPR
Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. WWW
Lowe D (2004) Distinctive image features form scale-invariant keypoints. IJCV 20(2):91–110
Article Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297
Marszalek M, Schmid C (2006) Spatial weighting for bag-of-features. CVPR, pp 2118–2125
Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPS
Marszalek M, Schmid C (2007) Semantic hierarchies for visual object recognition. CVPR
Matas J, Chum O, Urban M, Pajla T (2002) Robust wide baseline stereo from maximally stable extremal regions. BMVC
Moosmann F, Triggs B, Jurie F (2006) Fast discriminative visual codebooks using randomized clustering forests. NIPS
Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. PAMI 30(9):1632–1646
Google Scholar
Ni B, Tian Q, Yang L, Yan S (2010) Query-log aware content based image retrieval. To be submitted
Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. CVPR, pp 2161–2168
Perronnin F (2008) Universal and adapted vocabularies for generic visual categorization. PAMI 30(7):1243–1256
Google Scholar
Perronnin F, Dance C, Csurka G, Bressan M (2006) Adapted vocabularies for generic visual categorization. ECCV
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. CVPR
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. CVPR
Rao A, Miller D, Rose K, Gersho A (1996) A generalized VQ method for combined compression and estimation. ICASSP
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Article Google Scholar
Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. CVPR, pp 2033–2040
Schindler G, Brown M (2007) City-scale location recognition. CVPR
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. ICCV, pp 1470–1477
Viola P, Jones M (2001) Robust real-time face detection. ICCV, pp 7–14
Wang L (2007) Toward a discriminative codebook: codeword selection across multi-resolution. CVPR
Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and visual relatedness. ACM Multimedia, pp 239–248
Wang S, Huang Q, Jiang S, Qin L, Tian Q (2009) Visual context rank for web image re-ranking. ACM workshop on LSMRM
Wu Z, Ke Q, Sun J (2009) Bundling features for large-scale partial-duplicate web image search. CVPR
Wu L, Hoi S, Yu N (2009) Semantic-preserving bag-of-words models for efficient image annotation. ACM workshop on LSMRM, pp 19–26
Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment. PAMI 30(11):1985–1997
Google Scholar
Yang J (2007) Evaluating bag-of-visual-words representations in scene classification. ACM Multimedia
Yang L, Meer P, Foran D (2007) Multiple class segmentation using a unified framework over mean-shift patches. CVPR, pp 1–8
Yates R, Neto B (1999) Modern information retrieval, Addison Wesley Longman Publishing Co. Inc
Yuan J, Wu Y, Yang M (2007) Discovery of collocation patterns: from visual words to visual phrases. CVPR, pp 1–8
Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: A comprehensive review. IJCV
Zhang S, Tian Q, Hua G, Huang Q, Li S (2009) Descriptive visual words and visual phrases for image applications. ACM Multimedia
Zhang S, Huang Q, Hua G, Jiang S, Gao W, Tian Q (2010) Building contextual visual vocabulary for large-scale image applications. ACM Multimedia
Zhang S, Huang Q, Lu Y, Gao W, Tian Q (2010) Building pair-wise visual word tree for efficient image re-ranking. ICASSP
Zheng Y, Zhao M, Neo S, Chua T, Tian Q (2008) Visual synset: a higher-level visual representation. CVPR, pp 1–8
Zhou W, Li H, Lu Y, Tian Q (2010) Large scale partial-duplicate image retrieval with bi-space quantization and geometric consistency. ICASSP
ZhouW, Lu Y, Song Y, Li H, Tian Q (2010) Spatial coding for large-scale partial-duplicate web image search. ACM Multimedia
Zhou W, Tian Q, Yang L, Li H (2010) Latent visual context analysis for image re-ranking. ACM International Conference on Image and Video Retrieval (CIVR), Xi’an, China

Download references

Acknowledgement

This work is supported in part by NSF IIS 1052851 and by Akiira Media Systems, Inc. The work of Nicu Sebe has been supported by the FP7 IP GLOCAL European project and by the FIRB S-PATTERN project.

Author information

Authors and Affiliations

Computer Science Department, University of Texas at San Antonio, San Antonio, TX, 78249, USA
Qi Tian
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Shiliang Zhang
EEIS Department, University of Science and Technology of China, Heifei, 230027, China
Wengang Zhou
Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
Rongrong Ji
National University of Singapore, 4 Engineering Drive 3, Singapore, 117576, Singapore
Bingbing Ni
Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 14-38100 Povo, Trento, Italy
Nicu Sebe

Authors

Qi Tian
View author publications
You can also search for this author in PubMed Google Scholar
Shiliang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Bingbing Ni
View author publications
You can also search for this author in PubMed Google Scholar
Nicu Sebe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Tian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, Q., Zhang, S., Zhou, W. et al. Building descriptive and discriminative visual codebook for large-scale image applications. Multimed Tools Appl 51, 441–477 (2011). https://doi.org/10.1007/s11042-010-0636-6

Download citation

Published: 18 November 2010
Issue Date: January 2011
DOI: https://doi.org/10.1007/s11042-010-0636-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building descriptive and discriminative visual codebook for large-scale image applications

Abstract

Access this article

Similar content being viewed by others

ImageNet Large Scale Visual Recognition Challenge

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Building descriptive and discriminative visual codebook for large-scale image applications

Abstract

Access this article

Similar content being viewed by others

ImageNet Large Scale Visual Recognition Challenge

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation