Annotation-Based Image Retrieval
Level 1: Retrieval by primitive features such as color, texture, shape, or the spatial location of image elements, typically querying by an example, i.e., “find pictures like this.”
Level 2: Retrieval by derived features, with some degree of logical inference. For example, “find a picture of a flower.”
Level 3: Retrieval by abstract attributes, involving a significant amount of high-level reasoning about the purpose of the objects or scenes depicted. This includes retrieval of named events, of pictures with emotional or religious significance, etc., e.g., “find pictures of a joyful crowd.”
Together, levels 2 and 3 are referred to as semantic image retrieval, which can also be regarded as annotation-based image retrieval. A comprehensive review and analysis on image search in the past 20 years was written by Zhang and Rui , which details the system framework, feature extraction and image representation, indexing, and big data’s potential.
There are two frameworks of image retrieval : annotation based (or more popularly, text based) and content based. The annotation-based approach can be tracked back to the 1970s. In such systems, the images are manually annotated by text descriptors, which are used by a database management system (DBMS) to perform image retrieval. There are two disadvantages with this approach. The first is that a considerable level of human labor is required for manual annotation. The second is that because of the subjectivity of human perception, the manually labeled annotations may not converge. To overcome the above disadvantages, content-based image retrieval (CBIR) was introduced in the early 1980s. In CBIR, images are indexed by their visual content, such as color, texture, and shapes. In the past decade, several commercial products and experimental prototype systems were developed, such as QBIC, Photobook, Virage, VisualSEEK, Netra, and SIMPLIcity. Comprehensive surveys in CBIR can be found in [1, 4].
However, the discrepancy between the limited descriptive power of low-level image features and the richness of user semantics, which is referred to as the “semantic gap,” bounds the performance of CBIR. On the other hand, due to the explosive growth of visual data (both online and offline) and the phenomenal success in Web search, there has been increasing expectation for image search technologies. Because of these reasons, the main challenge of image retrieval is understanding media by bridging the semantic gap between the bit stream and the visual content interpretation by humans . Hence, the focus is on automatic image annotation techniques.
The state-of-the-art image auto-annotation techniques include four main categories [3, 5]: (i) using machine learning tools to map low-level features to concepts, (ii) exploring the relations among image content and the textual terms in the associated metadata, (iii) generating semantic template (ST) to support high-level image retrieval, and (iv) making use of both the visual content of images and the textual information obtained from the Web to learn the annotations.
Machine Learning Approaches
A typical approach is using support vector machine (SVM) as a discriminative classifier over image low-level features. Though straightforward, it has been shown effective in detecting a number of visual concepts.
Recently there is a surge of interest in leveraging and handling relational data, e.g., images and their surrounding texts. Blei et al.  extend the latent Dirichlet allocation (LDA) model to the mix of words and images and proposed a correlation LDA model. This model assumes that there is a hidden layer of topics, which are a set of latent factors and obey the Dirichlet distribution, and words and regions are conditionally independent on the topics, i.e., generated by the topics. This work used 7,000 Corel photos and a vocabulary of 168 words for annotation.
Relation Exploring Approaches
Another notable direction for annotating image visual content is exploring the relations among image content and the textual terms in the associated metadata. Such metadata are abundant but are often incomplete and noisy. By exploring the co-occurrence relations among the images and the words, the initial labels may be filtered and propagated from initial labeled images to additional relevant ones in the same collection .
Jeon et al.  proposed a cross-media relevance model to learn the joint probabilistic distributions of the words and the visual tokens in each image, which are then used to estimate the likelihood of detecting a specific semantic concept in a new image.
Semantic Template Approaches
Though it is not yet widely used in the abovementioned techniques, semantic template (ST) is a promising approach in annotation-based image retrieval (a map between high-level concept and low-level visual features).
Chang and Chen  show a typical example of ST, in which a visual template is a set of icons or example scenes/objects denoting a personalized view of concepts such as meetings and sunset. The generation of a ST is based on user definition. For a concept, the objects, their spatial and temporal constraints, and the weights of each feature of each object are specified. This initial query scenario is provided to the system, and then through the interaction with users, the system finally converges to a small set of exemplar queries that “best” match (maximize the recall) the concept in the user’s mind.
In contrast, Zhuang et al.  generate ST automatically in the process of relevance feedback, whose basic idea is to refine retrieval outputs based on interactions with the user. A semantic lexicon called WordNet is used in this system to construct a network of ST. During the retrieval process, once the user submits a query concept (keyword), the system can find a corresponding ST and thus target similar images.
Large-Scale Web Data-Supported Approaches
Good scalability to a large set of concepts is required in ensuring the practicability of image annotation. On the other hand, images from the Web repositories, e.g., Web search engines or photo sharing sites, come with free but less reliable labels. In , a novel search-based annotation framework was proposed to explore such Web-based resources. Fundamentally, it is to automatically expand the text labels of an image of interest, using its initial keyword and image content.
The technique has been advanced in the following aspects. Firstly,  discussed the special case that the query is only an image, instead of both an image and a text. In this case, no text-based search stage is available, so that it is more challenging for image representation, image indexing (to support real-time search), image retrieval metric, and annotation detection (i.e., mining salient words or phrases from surrounding texts of retrieved images) techniques. Dai et al.  proposed a Bayesian model as a better annotation detection model than in , and  presented novel image representation and image indexing techniques. Secondly, to annotate images that have duplicated versions on the Web is shown to have great research and commercial potentials . Such images are called duplicated images. The reasons are (1) duplicate image detection is a well-defined research problem, compared to general image similarity retrieval, since there is no clear semantic definition of image similarity; (2) how to effectively and efficiently discover duplicate images (i.e., require both retrieval precision and recall) from a Web-scale dataset is a challenging research problem ; and (3) duplicate images tend to belong to Web user-interested categories such as celebrities, famous locations, movie posters, funny pictures, etc., and to annotate such images has great commercial values for image search, e.g., to improve search relevance and user experience and to reduce image index .
Due to the explosive growth of visual data (both online and offline), effective annotation-based image search becomes a highly supportive technique for many multimedia applications.
Long et al.  applied the annotation-based image search technique in  to understand user images for monetization. The image annotations help ad providers to select related advertisements for targeted advertising.
El-Saban et al.  adopted annotation-based image search to automatically caption mobile captured videos. Such a technique helps save human labor in annotating their videos, and the auto-captions can be directly used to facilitate video search, which has great commercial value.
Zhang et al.  utilized annotation-based image search to collect celebrity face images, which shows an example of how to automatically collect a large-scale dataset for computer vision research.
- 1.Long F, Zhang HJ, Feng DD. Fundamentals of content-based image retrieval. In: Feng D, editor. Multimedia information retrieval and management: technological fundamentals and applications. Berlin/Heidelberg: Springer; 2013; Part I. p. 1–26. ISSN: 1860-4862.Google Scholar
- 2.Zhang L, Rui Y. Image search from thousands to billions in 20 years. ACM TOMCCAP 2013, Special Issue on the 20th Anniversary of the ACM MM Conference; 2013; New York.Google Scholar
- 5.Chang S-F, Ma W-Y, Smeulders A. Recent advances and challenges of semantic image/video search. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing; 2007; Honolulu, USA. p. 1205–8.Google Scholar
- 6.Blei D, Jordan MI. Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2003; Toronto, Canada. p. 127–34.Google Scholar
- 7.Jeon J, Lavrenko V, Manmatha R. Automatic image annotation and retrieval using cross-media relevance models. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2003; Toronto, Canada. p. 119–26.Google Scholar
- 8.Chang S-F, Chen W, Sundaram H. Semantic visual templates: linking visual features to semantics. Proceedings of the International Conference on Image Processing, Vol. 3; 1998; Chicago, Illinois. p. 531–4.Google Scholar
- 9.Zhuang Y, Liu X, Pan Y. Apply semantic template to support content-based image retrieval. Proceedings of the SPIE, Storage and Retrieval for Media Databases, Vol. 3972; 1999; Washington, USA. p. 442–9.Google Scholar
- 10.Wang X-J, Zhang L, Jing F, Ma W-Y. AnnoSearch: image auto-annotation by search. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition; 2006; New York, USA. p. 1483–90.Google Scholar
- 12.Dai LC, Wang X-J, Zhang L, Yu NH. Efficient tag mining via mixture modeling for real-time search-based image annotation. IEEE International Conference on Multimedia and Expo; 2012; Los Alamitos, USA.Google Scholar
- 14.Wang X-J, Zhang L, Liu M, Li Y, Ma W-Y. ARISTA – image search to annotation on billions of web photos. IEEE Conference on Computer Vision and Pattern Recognition; 2010; San Francisco, USA.Google Scholar
- 15.Wang X-J, Zhang L, Liu C. Duplicate discovery on 2 billion internet images. IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2013; Portland, USA.Google Scholar
- 16.El-Saban M, Wang X-J, Hasan N, Bassiouny M, Refaat M. Seamless annotation and enrichment of mobile captured video streams in real-time. IEEE International Conference on Multimdia and Expo; 2011; Barcelona, Spain.Google Scholar
- 17.Zhang X, Zhang L, Wang X-J, Shum H-Y. Finding celebrities in billions of webpages. IEEE Transactions on Multimedia. 2012;14(4):995–1007.Google Scholar
- 18.Eakins J, Graham M. Content-based image retrieval. Technical report. Tyne: University of Northumbria at Newcastle; 1999.Google Scholar
- 19.Wang X-J, Yu M, Zhang L, Ma W-Y. Advertising based on users’ photos. IEEE International Conference on Multimedia and Expo, 2009; Piscataway, USA.Google Scholar