Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

Annotation-Based Image Retrieval

  • Xin-Jing WangEmail author
  • Lei Zhang
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_17-2



Given (i) a textual query and (ii) a set of images and their annotations (phrases or keywords), annotation-based image retrieval systems retrieve images according to the matching score of the query and the corresponding annotations. There are three levels of queries according to Eakins [1]:
  • Level 1: Retrieval by primitive features such as color, texture, shape, or the spatial location of image elements, typically querying by an example, i.e., “find pictures like this.”

  • Level 2: Retrieval by derived features, with some degree of logical inference. For example, “find a picture of a flower.”

  • Level 3: Retrieval by abstract attributes, involving a significant amount of high-level reasoning about the purpose of the objects or scenes depicted. This includes retrieval of named events, of pictures with emotional or religious significance, etc., e.g., “find pictures of a joyful crowd.”

Together, levels 2 and 3 are referred to as semantic image retrieval, which can also be regarded as annotation-based image retrieval. A comprehensive review and analysis on image search in the past 20 years was written by Zhang and Rui [2], which details the system framework, feature extraction and image representation, indexing, and big data’s potential.

Historical Background

There are two frameworks of image retrieval [3]: annotation based (or more popularly, text based) and content based. The annotation-based approach can be tracked back to the 1970s. In such systems, the images are manually annotated by text descriptors, which are used by a database management system (DBMS) to perform image retrieval. There are two disadvantages with this approach. The first is that a considerable level of human labor is required for manual annotation. The second is that because of the subjectivity of human perception, the manually labeled annotations may not converge. To overcome the above disadvantages, content-based image retrieval (CBIR) was introduced in the early 1980s. In CBIR, images are indexed by their visual content, such as color, texture, and shapes. In the past decade, several commercial products and experimental prototype systems were developed, such as QBIC, Photobook, Virage, VisualSEEK, Netra, and SIMPLIcity. Comprehensive surveys in CBIR can be found in [1, 4].

However, the discrepancy between the limited descriptive power of low-level image features and the richness of user semantics, which is referred to as the “semantic gap,” bounds the performance of CBIR. On the other hand, due to the explosive growth of visual data (both online and offline) and the phenomenal success in Web search, there has been increasing expectation for image search technologies. Because of these reasons, the main challenge of image retrieval is understanding media by bridging the semantic gap between the bit stream and the visual content interpretation by humans [5]. Hence, the focus is on automatic image annotation techniques.


The state-of-the-art image auto-annotation techniques include four main categories [3, 5]: (i) using machine learning tools to map low-level features to concepts, (ii) exploring the relations among image content and the textual terms in the associated metadata, (iii) generating semantic template (ST) to support high-level image retrieval, and (iv) making use of both the visual content of images and the textual information obtained from the Web to learn the annotations.

Machine Learning Approaches

A typical approach is using support vector machine (SVM) as a discriminative classifier over image low-level features. Though straightforward, it has been shown effective in detecting a number of visual concepts.

Recently there is a surge of interest in leveraging and handling relational data, e.g., images and their surrounding texts. Blei et al. [6] extend the latent Dirichlet allocation (LDA) model to the mix of words and images and proposed a correlation LDA model. This model assumes that there is a hidden layer of topics, which are a set of latent factors and obey the Dirichlet distribution, and words and regions are conditionally independent on the topics, i.e., generated by the topics. This work used 7,000 Corel photos and a vocabulary of 168 words for annotation.

Relation Exploring Approaches

Another notable direction for annotating image visual content is exploring the relations among image content and the textual terms in the associated metadata. Such metadata are abundant but are often incomplete and noisy. By exploring the co-occurrence relations among the images and the words, the initial labels may be filtered and propagated from initial labeled images to additional relevant ones in the same collection [5].

Jeon et al. [7] proposed a cross-media relevance model to learn the joint probabilistic distributions of the words and the visual tokens in each image, which are then used to estimate the likelihood of detecting a specific semantic concept in a new image.

Semantic Template Approaches

Though it is not yet widely used in the abovementioned techniques, semantic template (ST) is a promising approach in annotation-based image retrieval (a map between high-level concept and low-level visual features).

Chang and Chen [8] show a typical example of ST, in which a visual template is a set of icons or example scenes/objects denoting a personalized view of concepts such as meetings and sunset. The generation of a ST is based on user definition. For a concept, the objects, their spatial and temporal constraints, and the weights of each feature of each object are specified. This initial query scenario is provided to the system, and then through the interaction with users, the system finally converges to a small set of exemplar queries that “best” match (maximize the recall) the concept in the user’s mind.

In contrast, Zhuang et al. [9] generate ST automatically in the process of relevance feedback, whose basic idea is to refine retrieval outputs based on interactions with the user. A semantic lexicon called WordNet is used in this system to construct a network of ST. During the retrieval process, once the user submits a query concept (keyword), the system can find a corresponding ST and thus target similar images.

Large-Scale Web Data-Supported Approaches

Good scalability to a large set of concepts is required in ensuring the practicability of image annotation. On the other hand, images from the Web repositories, e.g., Web search engines or photo sharing sites, come with free but less reliable labels. In [10], a novel search-based annotation framework was proposed to explore such Web-based resources. Fundamentally, it is to automatically expand the text labels of an image of interest, using its initial keyword and image content.

The process of [10] is shown in Fig. 1. It contains three stages, the text-based search stage, the content-based search stage, and the annotation learning stage, which are differentiated using different colors (black, brown, blue) and labels (A, B, C). When a user submits a query image as well as a query keyword, the system first uses the keyword to search a large-scale Web image database (2.4 million images crawled from several Web photo forums), in which images are associated with meaningful but noisy descriptions, as tagged by “A.” in Fig. 1. The intention of this step is to select a semantically relevant image subset from the original pool. Visual feature-based search is then applied to further filter the subset and save only those visually similar images (the path labeled by “B” in Fig. 1). By these means, a group of image search results which are both semantically and visually similar to the query image are obtained. To speed up the visual feature-based search procedure, a hash encoding algorithm is adopted to map the visual features into hash codes, by which inverted indexing technique in text retrieval area can be applied for fast retrieval. At last, based on the search results, the system collects their associated textual descriptions and applies the search result clustering (SRC) algorithm to group the images into clusters. The reason of using SRC algorithm is that (i) it is proved to be significantly effective in grouping documents semantically, and (ii) more attractively, it is capable of learning a name for each cluster that best represents the common topics of a cluster’s member documents. By ranking these clusters according to a ranking function and setting a certain threshold, the system selects a group of clusters and merges their names as the final learnt annotations for the query image, which ends the entire process (C in Fig. 1).
Fig. 1

Framework of the search-based annotation system

The technique has been advanced in the following aspects. Firstly, [11] discussed the special case that the query is only an image, instead of both an image and a text. In this case, no text-based search stage is available, so that it is more challenging for image representation, image indexing (to support real-time search), image retrieval metric, and annotation detection (i.e., mining salient words or phrases from surrounding texts of retrieved images) techniques. Dai et al. [12] proposed a Bayesian model as a better annotation detection model than in [10], and [13] presented novel image representation and image indexing techniques. Secondly, to annotate images that have duplicated versions on the Web is shown to have great research and commercial potentials [14]. Such images are called duplicated images. The reasons are (1) duplicate image detection is a well-defined research problem, compared to general image similarity retrieval, since there is no clear semantic definition of image similarity; (2) how to effectively and efficiently discover duplicate images (i.e., require both retrieval precision and recall) from a Web-scale dataset is a challenging research problem [15]; and (3) duplicate images tend to belong to Web user-interested categories such as celebrities, famous locations, movie posters, funny pictures, etc., and to annotate such images has great commercial values for image search, e.g., to improve search relevance and user experience and to reduce image index [13].

Key Applications

Due to the explosive growth of visual data (both online and offline), effective annotation-based image search becomes a highly supportive technique for many multimedia applications.

Long et al. [1] applied the annotation-based image search technique in [10] to understand user images for monetization. The image annotations help ad providers to select related advertisements for targeted advertising.

El-Saban et al. [16] adopted annotation-based image search to automatically caption mobile captured videos. Such a technique helps save human labor in annotating their videos, and the auto-captions can be directly used to facilitate video search, which has great commercial value.

Zhang et al. [17] utilized annotation-based image search to collect celebrity face images, which shows an example of how to automatically collect a large-scale dataset for computer vision research.


Recommended Readings

  1. 1.
    Long F, Zhang HJ, Feng DD. Fundamentals of content-based image retrieval. In: Feng D, editor. Multimedia information retrieval and management: technological fundamentals and applications. Berlin/Heidelberg: Springer; 2013; Part I. p. 1–26. ISSN: 1860-4862.Google Scholar
  2. 2.
    Zhang L, Rui Y. Image search from thousands to billions in 20 years. ACM TOMCCAP 2013, Special Issue on the 20th Anniversary of the ACM MM Conference; 2013; New York.Google Scholar
  3. 3.
    Liu Y, Zhang D, Lu G, Ma W-Y. A survey of content-based image retrieval with high-level semantics. Pattern Recogn. 2007;40(1):262–82.CrossRefzbMATHGoogle Scholar
  4. 4.
    Rui Y, Huang TS, Chang S-F. Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent. 1999;10(4):39–62.CrossRefGoogle Scholar
  5. 5.
    Chang S-F, Ma W-Y, Smeulders A. Recent advances and challenges of semantic image/video search. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing; 2007; Honolulu, USA. p. 1205–8.Google Scholar
  6. 6.
    Blei D, Jordan MI. Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2003; Toronto, Canada. p. 127–34.Google Scholar
  7. 7.
    Jeon J, Lavrenko V, Manmatha R. Automatic image annotation and retrieval using cross-media relevance models. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2003; Toronto, Canada. p. 119–26.Google Scholar
  8. 8.
    Chang S-F, Chen W, Sundaram H. Semantic visual templates: linking visual features to semantics. Proceedings of the International Conference on Image Processing, Vol. 3; 1998; Chicago, Illinois. p. 531–4.Google Scholar
  9. 9.
    Zhuang Y, Liu X, Pan Y. Apply semantic template to support content-based image retrieval. Proceedings of the SPIE, Storage and Retrieval for Media Databases, Vol. 3972; 1999; Washington, USA. p. 442–9.Google Scholar
  10. 10.
    Wang X-J, Zhang L, Jing F, Ma W-Y. AnnoSearch: image auto-annotation by search. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition; 2006; New York, USA. p. 1483–90.Google Scholar
  11. 11.
    Wang X-J, Zhang L, Li X, Ma W-Y. Annotating images by mining image search results. IEEE Trans Pattern Anal Mach Intell. 2008;30(11):1919–32.CrossRefGoogle Scholar
  12. 12.
    Dai LC, Wang X-J, Zhang L, Yu NH. Efficient tag mining via mixture modeling for real-time search-based image annotation. IEEE International Conference on Multimedia and Expo; 2012; Los Alamitos, USA.Google Scholar
  13. 13.
    Wang X-J, Zhang L, Ma W-Y. Duplicate-search-based image annotation using web-scale data. Proc IEEE. 2012;100(9):2705–21.CrossRefGoogle Scholar
  14. 14.
    Wang X-J, Zhang L, Liu M, Li Y, Ma W-Y. ARISTA – image search to annotation on billions of web photos. IEEE Conference on Computer Vision and Pattern Recognition; 2010; San Francisco, USA.Google Scholar
  15. 15.
    Wang X-J, Zhang L, Liu C. Duplicate discovery on 2 billion internet images. IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2013; Portland, USA.Google Scholar
  16. 16.
    El-Saban M, Wang X-J, Hasan N, Bassiouny M, Refaat M. Seamless annotation and enrichment of mobile captured video streams in real-time. IEEE International Conference on Multimdia and Expo; 2011; Barcelona, Spain.Google Scholar
  17. 17.
    Zhang X, Zhang L, Wang X-J, Shum H-Y. Finding celebrities in billions of webpages. IEEE Transactions on Multimedia. 2012;14(4):995–1007.Google Scholar
  18. 18.
    Eakins J, Graham M. Content-based image retrieval. Technical report. Tyne: University of Northumbria at Newcastle; 1999.Google Scholar
  19. 19.
    Wang X-J, Yu M, Zhang L, Ma W-Y. Advertising based on users’ photos. IEEE International Conference on Multimedia and Expo, 2009; Piscataway, USA.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  1. 1.Micros FacebookCAUSA
  2. 2.Microsoft ResearchWAUSA

Section editors and affiliations

  • Jeffrey Xu Yu
    • 1
  1. 1.The Chinese University of Hong KongHong KongChina