Abstract
Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount of image local features and the cluster centers are taken as visual words. This approach is simple and scalable, but results in noisy visual words. Lots of works are reported trying to improve the descriptive and discriminative ability of visual words. This paper gives a comprehensive survey on visual vocabulary and details several state-of-the-art algorithms. A comprehensive review and summarization of the related works on visual vocabulary is first presented. Then, we introduce our recent algorithms on descriptive and discriminative visual word generation, i.e., latent visual context analysis for descriptive visual word identification [74], descriptive visual words and visual phrases generation [68], contextual visual vocabulary which combines both semantic contexts and spatial contexts [69], and visual vocabulary hierarchy optimization [18]. Additionally, we introduce two interesting post processing strategies to further improve the performance of visual vocabulary, i.e., spatial coding [73] is proposed to efficiently remove the mismatched visual words between images for more reasonable image similarity computation; user preference based visual word weighting [44] is developed to make the image similarity computed based on visual words more consistent with users’ preferences or habits.
Similar content being viewed by others
References
Agarwal S, Roth D (2002) Learning a sparse representation for object detection. ECCV
Battiato S, Farinella G, Gallo G, Ravi D (2009) Spatial hierarchy of textons distribution for scene classification. Proc. Eurocom Multimedia Modeling, pp 333–342
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J MLR 3:993–1022
Brin S, Page L (1998) The anatomy of a large-scale hyper textual web search engine. WWW
Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. ICCV
Deerwester S, Dumais S, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Duygulu P, Barnard K, Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. ECCV
Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm ACM 24:381–395
Gemert V, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. T-PAMI 32(7):1271–1283
K. Grauman and T. Darrell. Approximate correspondences in high dimensions. NIPS, 2007.
Globerson A, Roweis S (2006) Metric learning by collapsing classes. Adv In Neu Info Proce Sys 18:451–458
Hofmann T (1999) Probabilistic latent semantic indexing. ACM SIGIR
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. ML 41:177–196
Indyk P, Thaper N (1998) Fast image retrieval via embeddings. Symposium on Theory of Computing
Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient image search. CVPR
Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. ECCV
Ji R, Xie X, Yao H, Wu Y, Ma W (2008) Incremental indexing of visual vocabulary for scalable retrieval. ICME
Ji R, Xie X, Yao H, Ma W (2009) Vocabulary hierarchy optimization for effective and transferable retrieval. CVPR
Ji R, Yao H, Sun X, Zhong B, Gao W (2010) Towards semantic embedding in visual vocabulary. CVPR
Jing Y, Baluja S (2008) VisualRank: applying pagerank to large-scale image search. IEEE Trans on PAMI 30:1877–1890
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. IJCV, pp 604–610
Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling of object categories using link analysis techniques. CVPR
Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links. ACM MIR
Kohonen T (1986) Learning vector quantization for pattern recognition. Tech. Rep. TKK-F-A601, Helsinki Institute of Technology
Kohonen T (2000) Self-organizing maps, 3rd edition, Springer-Verlag
Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebook by information loss minimization. PAMI 31(7):1294–1309
Leibe B, Leonardis A, Schiele B (2004) Combined object categorization and segmentation with an implicit shape model. ECCV
Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. ICCV
Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using 3-d textons. IJCV
Li F, Pietro P (2007) A bayesian hierarchical model for learning natural scene categories. ICCV
Li T, Mei T, Kweon I, Hua X (2010) Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video Technology
Liu D, Hua G, Viola P, Chen T (2008) Integrated feature selection and higher-order spatial feature extraction for object categorization. CVPR, pp 1–8
Liu C, Yuen J, Torralba A (2009) Dense scene alignment using SIFT flow for object recognition. CVPR
Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. CVPR
Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. WWW
Lowe D (2004) Distinctive image features form scale-invariant keypoints. IJCV 20(2):91–110
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297
Marszalek M, Schmid C (2006) Spatial weighting for bag-of-features. CVPR, pp 2118–2125
Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPS
Marszalek M, Schmid C (2007) Semantic hierarchies for visual object recognition. CVPR
Matas J, Chum O, Urban M, Pajla T (2002) Robust wide baseline stereo from maximally stable extremal regions. BMVC
Moosmann F, Triggs B, Jurie F (2006) Fast discriminative visual codebooks using randomized clustering forests. NIPS
Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. PAMI 30(9):1632–1646
Ni B, Tian Q, Yang L, Yan S (2010) Query-log aware content based image retrieval. To be submitted
Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. CVPR, pp 2161–2168
Perronnin F (2008) Universal and adapted vocabularies for generic visual categorization. PAMI 30(7):1243–1256
Perronnin F, Dance C, Csurka G, Bressan M (2006) Adapted vocabularies for generic visual categorization. ECCV
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. CVPR
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. CVPR
Rao A, Miller D, Rose K, Gersho A (1996) A generalized VQ method for combined compression and estimation. ICASSP
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. CVPR, pp 2033–2040
Schindler G, Brown M (2007) City-scale location recognition. CVPR
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. ICCV, pp 1470–1477
Viola P, Jones M (2001) Robust real-time face detection. ICCV, pp 7–14
Wang L (2007) Toward a discriminative codebook: codeword selection across multi-resolution. CVPR
Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and visual relatedness. ACM Multimedia, pp 239–248
Wang S, Huang Q, Jiang S, Qin L, Tian Q (2009) Visual context rank for web image re-ranking. ACM workshop on LSMRM
Wu Z, Ke Q, Sun J (2009) Bundling features for large-scale partial-duplicate web image search. CVPR
Wu L, Hoi S, Yu N (2009) Semantic-preserving bag-of-words models for efficient image annotation. ACM workshop on LSMRM, pp 19–26
Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment. PAMI 30(11):1985–1997
Yang J (2007) Evaluating bag-of-visual-words representations in scene classification. ACM Multimedia
Yang L, Meer P, Foran D (2007) Multiple class segmentation using a unified framework over mean-shift patches. CVPR, pp 1–8
Yates R, Neto B (1999) Modern information retrieval, Addison Wesley Longman Publishing Co. Inc
Yuan J, Wu Y, Yang M (2007) Discovery of collocation patterns: from visual words to visual phrases. CVPR, pp 1–8
Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: A comprehensive review. IJCV
Zhang S, Tian Q, Hua G, Huang Q, Li S (2009) Descriptive visual words and visual phrases for image applications. ACM Multimedia
Zhang S, Huang Q, Hua G, Jiang S, Gao W, Tian Q (2010) Building contextual visual vocabulary for large-scale image applications. ACM Multimedia
Zhang S, Huang Q, Lu Y, Gao W, Tian Q (2010) Building pair-wise visual word tree for efficient image re-ranking. ICASSP
Zheng Y, Zhao M, Neo S, Chua T, Tian Q (2008) Visual synset: a higher-level visual representation. CVPR, pp 1–8
Zhou W, Li H, Lu Y, Tian Q (2010) Large scale partial-duplicate image retrieval with bi-space quantization and geometric consistency. ICASSP
ZhouW, Lu Y, Song Y, Li H, Tian Q (2010) Spatial coding for large-scale partial-duplicate web image search. ACM Multimedia
Zhou W, Tian Q, Yang L, Li H (2010) Latent visual context analysis for image re-ranking. ACM International Conference on Image and Video Retrieval (CIVR), Xi’an, China
Acknowledgement
This work is supported in part by NSF IIS 1052851 and by Akiira Media Systems, Inc. The work of Nicu Sebe has been supported by the FP7 IP GLOCAL European project and by the FIRB S-PATTERN project.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tian, Q., Zhang, S., Zhou, W. et al. Building descriptive and discriminative visual codebook for large-scale image applications. Multimed Tools Appl 51, 441–477 (2011). https://doi.org/10.1007/s11042-010-0636-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-010-0636-6