Keywords

1 Introduction

Since the beginning of the World Wide Web (WWW) and the development of cheap digital recording and storage devices the amount of available on-line digital images, continuously increases. The increasing popularity of online social networks, like Instagram, that are based on visual information push further this tendency. As a result, effective and efficient web image indexing and retrieval schemes are of high importance and a lot of research has been devoted towards this end.

In general, image retrieval research efforts are falling into two broad categories: content-based and text-based. Content-based methods retrieve images by analyzing and comparing the content of a given image example as a starting point. Text-based methods are similar to document retrieval and retrieve images using keywords. The latter is the approach of preference both for ordinary users and search engine engineers. Besides the fact that the majority of users are familiar with text-based queries, content-based image retrieval lacks semantic meaning. Furthermore, image examples that have to be given as a query are rarely available. From the search engine perspective, text-based image retrieval methods get advantage of the well established techniques for document indexing and are integrated into a unified document retrieval framework. However, for text-based image retrieval to be feasible, images must be somehow related with specific keywords or textual description. In contemporary search engines this kind of textual description is, usually, obtained from the web page, or the document, containing the corresponding images and includes HTML alternative text, the file names of the images, captions, metadata tags and surrounding text [1, 2]. Text metadata are not always available, and in most cases are not accurate (i.e., they do not fully describe the visual content of the image). In addition, disambiguation of different visual aspects of the same term is very difficult using text metadata without taking into consideration the context.

Surrounding text, is the text that surrounds a Web image inside an HTML document. This text is, indeed, a very important source of semantic information for the image. However, automatic localization of surrounding text is by no means easy mainly due to the modern web-page layout formatting techniques which are based on external files (stylesheets). As a result, visual segmentation (parsing) of the rendered web-page is required in order to identify the surrounding text of an image. The need to automatically extract the semantically related, to an image, textual blocks and assign them to this image led to what we call Web Image Context Extraction (WICE). In that terms, WICE is the process of automatically assigning the textual blocks of a web document to the images of the same document they refer to [3].

In content-based image retrieval features such as color, shape or texture are used for indexing and searching web images. The user provides a target image and the system retrieves the best ranked images based on their similarity from the user’s query. Although it has been a long time since the scientists, working on this area, defined the semantic gap [4], i.e., the inability of a system to interpret images based on automatically extracted low-level features, a solution still does not exist. WICE methods may be used as a means of bridging this gap. For instance, in [5] the authors propose an auto-annotation system which combines a content-based search stage to image annotation along with a text-based stage in order to leverage the image dataset in learning from similar annotations.

Despite the fact that image context identification and text-based image indexing is very important per se, the huge amount of images which do no appear in web-pages or they do not have a clearly related context, either as surrounding text or as specific keywords, puts another challenge. Recently, the idea of visual concept modeling [6, 7] was proposed as possible solution to this problem. In these approaches keywords are modeled through via low-level features and non-annotated images are passed through these models in order to identify possible matches with the visual representation of the models and assigned the corresponding keywords. The approach is promising but the keyword models require proper training, usually approached as a learning by example procedure, and, thus, they depend heavily on the selected corpus and keyword identification. So far the training set was created using manually annotated data and, for quality assurance on the keyword selection, crowdsourcing methods were adopted [8, 9]. Recently, Giannoulakis and Tsapatsoulis [10] investigated the case of Instagram as a source for annotated images concluding that an overall 26 % of hashtags accompanying photos in Instagram are related with photos’ visual content. Thus, filtering approaches are still required for a proper training set to be created.

In this paper we investigate the automatic extraction of keywords, from a web-image’s context, that can be used either for image indexing or for the automatic creation for training datasets for the visual modeling of keywords mentioned above. Our method deviates from existing approaches in a very important aspect: It does not require any sort of training since it is based on a priori fixed English language model to identify the importance of a keyword in a text fragment. The latter is identified through a computationally efficient visual-based html parsing algorithm. The fact that image context is in most cases concise leads traditional approaches, like probabilistic, tf-idf based and clustering based ones, to failure. Thus, image context needs to extended, but, in this case the correlation of the selected text with a specific image in the web page decreases and all images in the web-page tend to share the same indexes (which are also similar to the indexes of the web-page itself). Finally, probabilistic, tf-idf and clustering based approaches require training. As a result the problem is recycled: In order to automatically index non-explicitly annotated images you need training examples but in order to automatically create the training examples you need train the indexing extraction algorithms!

2 Related Work

2.1 WICE Methods

In text-based web image retrieval, image file names, anchor texts, surrounding paragraphs or even the whole text of the hosting web page are traditionally used. The user provides keywords or key phrases and text retrieval techniques are used for the retrieval of the best ranked image. Early examples of these methods include [1113]. However, it turned out very soon that the relevant, to each image, text fragment of the hosting web page must be extracted for better accuracy of retrieval. This is the well-known WICE problem [3], already mentioned in introduction. The high diversity of designing patterns in web pages, the noisy environment (advertisements, graphics, navigational objects etc.), end the existence of too much textual and visual information in single documents are prohibiting factors a WICE system must overcome.

Several WICE methods have been proposed in the literature. A first category of approaches as [14, 15], make use of the DOM tree structure of the hosting web page. In general these methods are not adaptive and they are designed for specific design patterns.

Web page visual segmentation is a second category of approaches to the WICE problem. This kind of approaches was initially proposed in [16], where the authors use Vision based Page Segmentation (VIPS) [17] in order to extract blocks, which contain both image and text, and construct an image graph using link structures. Web page segmentation is indeed a more adequate solution to the problem of text extraction since it is adaptable to different web page styles and depends on the visual cues that form each web page. Most of the proposed algorithms falling within this approach they are computationally heavy [18] and they are not designed specifically for the problem of image indexing [19]; therefore, they often deliver poor results [20]. In addition the creators of VIPS stopped its maintenance as early as in 2011.

In the proposed approach, the WICE problem is tackled through HTML code parsing of the rendered web page. This approach is computationally light and easily applicable and in modern web pages that include several CSS files, for styling purposes, as well as dynamic elements (such as PHP code), is quite effective. It executes html parsing by combining the ideas and the tool presented in [21] with the open source code of Mozilla web browserFootnote 1.

2.2 Web Image Indexing from Concise Text Fragments

The text fragments identified by WICE methods are usually very concise; as a result traditional keyword extraction (tf-idf like methods) does not apply. Web image retrieval based on clickthrough data is more relevant and effective. Clickthrough data are usually collected from search logs and include text queries and data from relevance feedback [22]. These methods [2326] are quite effective but they are based on machine learning; thus, they suffer from the scalability problem and they are inappropriate for large scale web image retrieval.

The proposed method uses a learning-free language model for web image indexing using text fragments located by a new WICE method explained next.

3 The Proposed Method

The overall architecture of the proposed method is shown in Fig. 1 along with an illustrative example. It consists of three main steps: (a) html parsing, obtained with the aid of the lxml parserFootnote 2, (b) the WICE algorithm, and (c) an English language model accompanied by an keyword extraction algorithm.

Fig. 1.
figure 1

The architecture of the proposed method through an example

3.1 The WICE Algorithm

The aim of the WICE algorithm is to identify the context (text fragment) of an image given its URL. First the position of the image and its (rendered) dimensions are computed with the aid of the Mozilla open source code. Small images and graphic types (i.e., gif) corresponding to logos and banners are discarded in this stage. Next the nearby sentences (text belonging to the same level as the image in the DOM tree and being within a radius equal to 0.3 of web page’s rendered height from image’s center) of the image are selected along with the caption, alternative text (if exists) and the hyperlink text. All these text data are merged together to form the text fragment (referred to as image context in the following) related to the given image.

Table 1. Examples of the performance of language model. P(w) is the actual probability of word w

3.2 An English Language Model for Image Retrieval

The relative frequency \(f_w\) of appearance of a word ‘w’ in natural (human) languages follows the well known Zipf law [27]; that is the product of \(f_w\) with the ranking \(R_w\) of word ‘w’ is approximately constant:

$$\begin{aligned} f_w\cdot R_w =c \end{aligned}$$
(1)

Given that the number of words in web pages is very large and continuously increasing, due to new documents, misspelling, slang, etc., the probability of appearance P(w) of a word ‘w’ in a web page can be approximated by considering \(P(w)=f_w\) as follows:

$$\begin{aligned} P(w) = \frac{c}{R_w} \end{aligned}$$
(2)

In text-based retrieval a common keyword identification method involves the well-known tf-idf score. Words, or more general, tokens [27] with high tf-idf score in a given document or text fragment are considered important (keywords) for its description and can be used as indices. However, while the tf (term frequency) value depends only in the specific document the idf (inverse document frequency) value is computed based on a relative large number of relevant documents. Thus, in order to compute tf-idf we need a training corpus.

In this paper, we argue that web images appear in every type of document in the web, and, as a result it is not necessary to collect a specific domain corpus to compute idf and find the keywords of a text fragment. Therefore, we approximate the idf value with \(\frac{1}{P(w)}\) and we arrive in a learning free indexing method for text fragments related with web images. Equation 2 gives a rough approximation of P(w). After a little experimentation (see Table 1 for some examples) we found that a more accurate language model can be obtained by:

$$\begin{aligned} P(w) = \frac{c}{R_w+k^{R_w}} \end{aligned}$$
(3)

where \(c=0.12\) and \(k=1.1\).

In any case, in order for a language model given above to be useful the ranking \(R_w\) Footnote 3 of every word should be available. In this work we get advantage of the ‘Wordcount’ projectFootnote 4 for this purpose.

Keyword extraction is facilitated by the language model and involves a series of steps: (a) text fragment segmentation into sentences, (b) stopwords removal, (c) part of speech (POS) tagging, (e) noun and proper noun selection as candidate indices, (f) ranking of selected terms based on the adopted language model, and (g) final selection of the indexing terms (keywords) based on the \(S = \frac{tf}{P(w)}\) score (terms whose score exceeds an empirically derived threshold T are kept as keywords).

Table 2. Evaluation results for context extraction. GS denotes gold standard

4 Experimental Evaluation

In order to evaluate the proposed method 978 web images taken from 243 web pages were used. Three annotators (students of the Cyprus University of Technology) were asked, independently, to: (a) for each image identify and copy its context (text fragment), and (b) select the keywords from context that best describe the image. The contexts and keywords from the three annotators were merged and used as the gold standard for the evaluation of the proposed method. Indicative examples of images, their context and the keywords chosen by the annotators and found by the algorithm are online availableFootnote 5.

Table 2 summarizes the results of context extraction. By TP we denote the ‘True Positive’ rate, that is, the tokens that were in the gold standard and found by the algorithm. Similarly FP denotes ‘False Positive’ rate, i.e., tokens found by the algorithm but not in the gold standard and FN denotes ‘False Negative’ rate, that is, tokens in the gold standard that were not found. Recall (R), Precision (P) and F-score (F) values are computed as usual by:

$$\begin{aligned} R = \frac{TP}{TP+FN} \end{aligned}$$
(4)
$$\begin{aligned} P = \frac{TP}{TP+FP} \end{aligned}$$
(5)
$$\begin{aligned} F = \frac{2\cdot P\cdot R}{P+R} \end{aligned}$$
(6)

We can see in Table 2 that the algorithm tends to include in image context more tokens than those identified by the annotators. This is mainly caused by our decision to include in image context not only the nearby sentences but the tokens in image caption and image’s hyperlink. The latter was typically never selected by the annotators although from several studies we know that the information in hyperlinks is of utmost importance in information retrieval. Overall, the results are satisfactory given the simplicity of the proposed method. For comparison see the results reported by Alcic and Conrad [15].

Table 3. Evaluation results for keyword identification

Table 3 shows the results of keyword identification. We observe that the recall rate is close to 50 % which is very promising compared to similar methods (see for instance [26]). Human annotators tend to use more ‘emotional’ words to describe images even in cases where the visual content does not correspond clearly to these terms. On the hand the proposed algorithm promotes nouns as keywords (a choice made during the design of the algorithm) and especially named entities (as a result of the use of the proposed language model). Similarly to the context extraction case the algorithm identifies, in general, more keywords than the annotators causing the recall to become higher than precision. Nevertheless, in information retrieval the tendency is to pursuit higher recall than precision (irrelevant results are better than no results).

5 Conclusion and Further Work

In this paper we have presented a method for web page image indexing which is based on language models. The method can be applied to identify keywords for any web image without training on a particular corpus. It is based on raw html parsing of web pages to identify the nearest (to the image) text block and then a metric which combines the frequency of terms in the block with their frequency ranking, in the corresponding language, is used. Preliminary results, on an especially designed and annotated web image database, are promising and show the effectiveness of the proposed method. However, there are also some limitations that need to be surpassed for the method to be widely applied while some improvements are planned for (near) future work. The effectiveness of the proposed system on ‘carousel’ type web images needs to be tested. Furthermore, the algorithm is currently applied only on English web pages since we get advantage of the ranking of English words to create our language model. However, once such a study for any other natural language exists the extension of the proposed method is straightforward.

An improvement of the proposed method is to consider the structure of the surrounding text to further weight the terms. Thus, terms that appear in headers, subheaders, weblinks, etc., will be given higher importance than the terms in the plain text. Finally, the method will be used in the context of visual concept modeling [8] for automatic creation of image-keywords pairs that are required for training purposes. In this context, the other basic limitation of this work, that is, the inability of applying it to non-web images, will be overcome. For more information on this, please see [6, 7].