Skip to main content
Log in

Building descriptive and discriminative visual codebook for large-scale image applications

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount of image local features and the cluster centers are taken as visual words. This approach is simple and scalable, but results in noisy visual words. Lots of works are reported trying to improve the descriptive and discriminative ability of visual words. This paper gives a comprehensive survey on visual vocabulary and details several state-of-the-art algorithms. A comprehensive review and summarization of the related works on visual vocabulary is first presented. Then, we introduce our recent algorithms on descriptive and discriminative visual word generation, i.e., latent visual context analysis for descriptive visual word identification [74], descriptive visual words and visual phrases generation [68], contextual visual vocabulary which combines both semantic contexts and spatial contexts [69], and visual vocabulary hierarchy optimization [18]. Additionally, we introduce two interesting post processing strategies to further improve the performance of visual vocabulary, i.e., spatial coding [73] is proposed to efficiently remove the mismatched visual words between images for more reasonable image similarity computation; user preference based visual word weighting [44] is developed to make the image similarity computed based on visual words more consistent with users’ preferences or habits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Agarwal S, Roth D (2002) Learning a sparse representation for object detection. ECCV

  2. Battiato S, Farinella G, Gallo G, Ravi D (2009) Spatial hierarchy of textons distribution for scene classification. Proc. Eurocom Multimedia Modeling, pp 333–342

  3. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J MLR 3:993–1022

    MATH  Google Scholar 

  4. Brin S, Page L (1998) The anatomy of a large-scale hyper textual web search engine. WWW

  5. Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. ICCV

  6. Deerwester S, Dumais S, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  7. Duygulu P, Barnard K, Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. ECCV

  8. Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm ACM 24:381–395

    Article  MathSciNet  Google Scholar 

  9. Gemert V, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. T-PAMI 32(7):1271–1283

    Google Scholar 

  10. K. Grauman and T. Darrell. Approximate correspondences in high dimensions. NIPS, 2007.

  11. Globerson A, Roweis S (2006) Metric learning by collapsing classes. Adv In Neu Info Proce Sys 18:451–458

    Google Scholar 

  12. Hofmann T (1999) Probabilistic latent semantic indexing. ACM SIGIR

  13. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. ML 41:177–196

    Google Scholar 

  14. Indyk P, Thaper N (1998) Fast image retrieval via embeddings. Symposium on Theory of Computing

  15. Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient image search. CVPR

  16. Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. ECCV

  17. Ji R, Xie X, Yao H, Wu Y, Ma W (2008) Incremental indexing of visual vocabulary for scalable retrieval. ICME

  18. Ji R, Xie X, Yao H, Ma W (2009) Vocabulary hierarchy optimization for effective and transferable retrieval. CVPR

  19. Ji R, Yao H, Sun X, Zhong B, Gao W (2010) Towards semantic embedding in visual vocabulary. CVPR

  20. Jing Y, Baluja S (2008) VisualRank: applying pagerank to large-scale image search. IEEE Trans on PAMI 30:1877–1890

    Google Scholar 

  21. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. IJCV, pp 604–610

  22. Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling of object categories using link analysis techniques. CVPR

  23. Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links. ACM MIR

  24. Kohonen T (1986) Learning vector quantization for pattern recognition. Tech. Rep. TKK-F-A601, Helsinki Institute of Technology

  25. Kohonen T (2000) Self-organizing maps, 3rd edition, Springer-Verlag

  26. Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebook by information loss minimization. PAMI 31(7):1294–1309

    Google Scholar 

  27. Leibe B, Leonardis A, Schiele B (2004) Combined object categorization and segmentation with an implicit shape model. ECCV

  28. Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. ICCV

  29. Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using 3-d textons. IJCV

  30. Li F, Pietro P (2007) A bayesian hierarchical model for learning natural scene categories. ICCV

  31. Li T, Mei T, Kweon I, Hua X (2010) Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video Technology

  32. Liu D, Hua G, Viola P, Chen T (2008) Integrated feature selection and higher-order spatial feature extraction for object categorization. CVPR, pp 1–8

  33. Liu C, Yuen J, Torralba A (2009) Dense scene alignment using SIFT flow for object recognition. CVPR

  34. Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. CVPR

  35. Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. WWW

  36. Lowe D (2004) Distinctive image features form scale-invariant keypoints. IJCV 20(2):91–110

    Article  Google Scholar 

  37. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297

  38. Marszalek M, Schmid C (2006) Spatial weighting for bag-of-features. CVPR, pp 2118–2125

  39. Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPS

  40. Marszalek M, Schmid C (2007) Semantic hierarchies for visual object recognition. CVPR

  41. Matas J, Chum O, Urban M, Pajla T (2002) Robust wide baseline stereo from maximally stable extremal regions. BMVC

  42. Moosmann F, Triggs B, Jurie F (2006) Fast discriminative visual codebooks using randomized clustering forests. NIPS

  43. Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. PAMI 30(9):1632–1646

    Google Scholar 

  44. Ni B, Tian Q, Yang L, Yan S (2010) Query-log aware content based image retrieval. To be submitted

  45. Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. CVPR, pp 2161–2168

  46. Perronnin F (2008) Universal and adapted vocabularies for generic visual categorization. PAMI 30(7):1243–1256

    Google Scholar 

  47. Perronnin F, Dance C, Csurka G, Bressan M (2006) Adapted vocabularies for generic visual categorization. ECCV

  48. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. CVPR

  49. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. CVPR

  50. Rao A, Miller D, Rose K, Gersho A (1996) A generalized VQ method for combined compression and estimation. ICASSP

  51. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York

    MATH  Google Scholar 

  52. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523

    Article  Google Scholar 

  53. Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. CVPR, pp 2033–2040

  54. Schindler G, Brown M (2007) City-scale location recognition. CVPR

  55. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. ICCV, pp 1470–1477

  56. Viola P, Jones M (2001) Robust real-time face detection. ICCV, pp 7–14

  57. Wang L (2007) Toward a discriminative codebook: codeword selection across multi-resolution. CVPR

  58. Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and visual relatedness. ACM Multimedia, pp 239–248

  59. Wang S, Huang Q, Jiang S, Qin L, Tian Q (2009) Visual context rank for web image re-ranking. ACM workshop on LSMRM

  60. Wu Z, Ke Q, Sun J (2009) Bundling features for large-scale partial-duplicate web image search. CVPR

  61. Wu L, Hoi S, Yu N (2009) Semantic-preserving bag-of-words models for efficient image annotation. ACM workshop on LSMRM, pp 19–26

  62. Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment. PAMI 30(11):1985–1997

    Google Scholar 

  63. Yang J (2007) Evaluating bag-of-visual-words representations in scene classification. ACM Multimedia

  64. Yang L, Meer P, Foran D (2007) Multiple class segmentation using a unified framework over mean-shift patches. CVPR, pp 1–8

  65. Yates R, Neto B (1999) Modern information retrieval, Addison Wesley Longman Publishing Co. Inc

  66. Yuan J, Wu Y, Yang M (2007) Discovery of collocation patterns: from visual words to visual phrases. CVPR, pp 1–8

  67. Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: A comprehensive review. IJCV

  68. Zhang S, Tian Q, Hua G, Huang Q, Li S (2009) Descriptive visual words and visual phrases for image applications. ACM Multimedia

  69. Zhang S, Huang Q, Hua G, Jiang S, Gao W, Tian Q (2010) Building contextual visual vocabulary for large-scale image applications. ACM Multimedia

  70. Zhang S, Huang Q, Lu Y, Gao W, Tian Q (2010) Building pair-wise visual word tree for efficient image re-ranking. ICASSP

  71. Zheng Y, Zhao M, Neo S, Chua T, Tian Q (2008) Visual synset: a higher-level visual representation. CVPR, pp 1–8

  72. Zhou W, Li H, Lu Y, Tian Q (2010) Large scale partial-duplicate image retrieval with bi-space quantization and geometric consistency. ICASSP

  73. ZhouW, Lu Y, Song Y, Li H, Tian Q (2010) Spatial coding for large-scale partial-duplicate web image search. ACM Multimedia

  74. Zhou W, Tian Q, Yang L, Li H (2010) Latent visual context analysis for image re-ranking. ACM International Conference on Image and Video Retrieval (CIVR), Xi’an, China

Download references

Acknowledgement

This work is supported in part by NSF IIS 1052851 and by Akiira Media Systems, Inc. The work of Nicu Sebe has been supported by the FP7 IP GLOCAL European project and by the FIRB S-PATTERN project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Tian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, Q., Zhang, S., Zhou, W. et al. Building descriptive and discriminative visual codebook for large-scale image applications. Multimed Tools Appl 51, 441–477 (2011). https://doi.org/10.1007/s11042-010-0636-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-010-0636-6

Keywords

Navigation