Keywords

1 Introduction

With the development of visual technology, more attention have been put on image retrieval both in the industry and the research community. Image retrieval techniques could be widely classified into two categories: text-based and content-based. The text-based approach index images in database by key-words, which came from manually added annotation. However, manual annotation is an imprecise and time consuming job. Content-based image retrieval (CBIR) searches images by their own visual contents, which has been presented in the early 1990s [15]. However, how to extract meaningful features from the large collections of image data is still a challenging problem due to the deviation of semantic understanding between human and computer [9]. Semantic gap exists between low-level handcrafted features and high-level human perception [8]. The reason is that a highly evolved human brain could transform the visual signals into concrete subject, while the computer couldn’t do that with high accuracy up to now.

Many researches have been dedicated into semantic image retrieval, comprehensive reviews could be found in [12, 15]. Bag-of-words (BOW) framework, which is initially proposed in [17], is the most famous one among them. The key idea of BoW is to quantify each local feature into one or more so-called visual words, and each image is represented as a set of unordered visual words. Our work also used BOW partially. Furthermore, we noted that the semantic understanding of the image could be divided into two parts: the foreground object and the background regions in general. That means that retrieval results possibly met the high level perception could be got if foreground and background were represented individually by different features. In our previous work [1], image segementation based on RC-saliency [3] was used to segment the foreground and background in the image, and got a better performance in semantic image retrieval. Although RC-saliency has a good effect in the direction of color division, it has some limitations in texture and shape. That means it could only perform well in special kinds of images. So it is far from the broadness of human vision. Thus segmentation based on one visual saliency model could have the limitations on its universality, which could decreases the performance of the retrieval. Obviously, better retrieval results could be got if the performance of segmentation improved. So a saliency fusion based multi-feature model is take into consideration in our proposal.

In this paper, a novel semantic image retrieval method named saliency fusion based multi-feature (SFMF) is proposed. Firstly, we figure out seven saliency maps generated by different methods, and fused them by double low rank matrix recovery method (DLRMR) [7]. Secondly, SaliencyCut [3] based on the fused saliency map is used to segment images into foreground objects and background regions. Finally, local features and global features extracted from both foreground objects and background regions with different weights in the similarity fusion are used for retrieval.

The remaining of this paper is organized as follows. Section 2 describes SFMF in details. Experiment results and analysis are given in Sect. 3, followed by conclusion in Sect. 4.

2 The Proposed SFMF

There are three phrases in the SFMF: fusion stage, offline processing and online retrieval. In the fusion stage, segmentation based on DLRMR saliency fusion is performed on each database image, and thus each image is divided into two parts: foreground object and background region. After the segmentation by fusion saliency maps, images are represented by a multi-feature representation. Furthermore, between the foreground objects and background regions, different features and different weights in the similarity are considered in the retrieval.

Fig. 1.
figure 1

Segmentation based on saliency fusion. Segmentation is executed by SaliencyCut using saliency map generated by saliency fusion by double low rank matrix recovery. After segmentation, background region and foreground object are obtained and represented by different features.

2.1 Fusion Stage

The framework of the segmentation based on saliency fusion is illustrated in Fig. 1. Different from the onefold saliency map, saliency fusion intended to combine various saliency detection methods makes the fusion results better than each individual saliency detection methods. It highlights the advantages of several algorithms and avoids the weakness of a few algorithms so that the final saliency map obtained as a result of fusion outperforms each of them. For the reasons mentioned above, double low rank matrix recovery(DLRMR)[7] is used to cast object and background decomposition problem. Furthermore, different features are used to represent foreground objects and background regions, which makes the SFMF representation to be more characteristic than the solo feature representation of the whole image.

First of all, seven saliency detection methods: AMC [6], BL [18], BSCA [16], HC [3], MR [20], MS [19], ST [13] have been chosen to obtain saliency maps \( \{ S_k|1 \le k \le 7 \}\).

Then mean shift algorithm [4] segmented the image into regions \(\{P_i \}_{i=1,...,n}\), where n is the number of super-pixels. The saliency map \(S_k\) could be represented using an n-dimensional vector \( \mathbf X_k=[x_{1k},x_{2k},...,x_{nk}]^T\), the \(i^{th}\) element of the vector corresponds to the mean of the saliency values of pixels in the super-pixel \(P_i\). By arranging \(\mathbf X_k\) into a matrix, we get the combined matrix representation of individual saliency maps as \(\mathbf X = [\mathbf X_1,\mathbf X_2,...,\mathbf X_7]. \mathbf X \in R^{n \times 7} \). With super-pixel instead of pixel as the smallest unit to calculate more in line with content semantics visual.

Generally, a natural image \(\mathbf I\) could be decomposed as:

$$\begin{aligned} \varPhi (\mathbf I) = \mathbf A + \mathbf E, \end{aligned}$$
(1)

where \(\varPhi \) indicates a certain transformation, \(\mathbf A\) and \(\mathbf E\) denote matrices corresponding to background and foreground. Then treat matrix \(\mathbf X\) as a feature representation of the image \(\mathbf I\) in the saliency feature space, with each row representing a super-pixel feature vector [7]. Equation (1) could be rewritten as:

$$\begin{aligned} \mathbf X = \mathbf A + \mathbf E, \end{aligned}$$
(2)

Therefore, saliency fusion could be cast as a low rank affinity pursuit. Given matrix \(\mathbf X = [\mathbf X_1,\mathbf X_2,...,\mathbf X_7].\ \mathbf X \in R^{n \times 7} \), the low rank matrix recovery problem could then be formulated as:

$$\begin{aligned} \min \limits _{\mathbf A,\mathbf E}\, rank(\mathbf A)+ \lambda (rank(\mathbf E))\quad s.t.\ \mathbf X =\mathbf A + \mathbf E, \end{aligned}$$
(3)

where parameter \(\lambda > 0 \) balances the effects between two ranks [7].

Through the alternating direction method of multipliers(ADMM) [2], the final low rank \(\mathbf E\) measures the contribution of each saliency method and learns an adaptive combination of the maps. It counts the value of E in every region on each saliency map and separate the object area from the background with a suitable threshold, which is recommended 1.4 times around average valuer.

Some comparison examples of the segmentation using the saliency maps fused by DLRMR and generated by one single visual saliency model, such as RC-saliency are shown in Fig. 2. From these examples, we could observe that saliency map fusion could get better segmentation performance than single visual saliency model.

Fig. 2.
figure 2

Given the input image and the ground truth, the segmentation using the saliency maps generated by DLRMR are better than by one saliency model such as RC-saliency, the method used in our previous work.

2.2 Offline Processing

After the images have been segmented into foreground objects and background regions, different feature will be extracted considering their different characters, which is distinct from traditional image retrieval methods.

For the background regions, features are extracted in HSV color space. As they had large areas of similar colors and textures in general, local binary patterns (LBP) in V channel and color histograms in H and S channel were extracted as color and texture features. We choose these two features not only they are simple but also efficient.

For the foreground objects, beside the texture and color features as extracted from the background, local features should also be considered. So the Scale-invariant feature transform (SIFT) [14] is chosen as an instinct choice for its successful achievements in object retrieval task. SIFT feature is packed in the BOW framework for retrieval. That means SIFT features extracted from the images will be compared with a visual words vocabulary clustered by K-means and the frequency of the visual words appearance in the image will be used as the representation of the image.

The features extracted from the images are defined as formula (4),

$$\begin{aligned} F\left\{ \begin{array}{ll} F_f=(H_h,H_s,LBP_v,SIFT_g), \\ F_b=(H_h,H_s,LBP_v), \end{array} \right. \end{aligned}$$
(4)

where \(F_f\) is features of foreground image, and \(F_b\) is features of background image, \(H_h\) and \(H_s\) are histogram features in hue (H) and saturation (S) channel of HSV color space, \(LBP_v\) is histogram of local binary patterns statistics in value (V) channel in HSV color space, where \(SIFT_g\) is the histogram of visual words in gray level space. We set the parameters in advance, the weight of \(F_b\) is less than \(F_f\).

2.3 Online Retrieval

The input query image is also needed to segment based on saliency fusion and represented by different features in the way mentioned in Section 2.2. A fusion of the z-score normalized chi-square distances is proposed to measure the similarity between the query and the images in the database.

The chi-square distance between the histogram of query image \(H_{Q}\) and the histogram of image from the database \(H_{I}\) is defined as:

$$\begin{aligned} DS(Q,I)=\sum \limits _{j=1}^K\frac{(H_\textit{Q}(j)-H_{I}(j))^2}{H_\textit{Q}(j)+H_{I}(j)}, \end{aligned}$$
(5)

where K is the number of bins in the histogram.

Since different histograms are constructed for an image, the similarity between the images is measured by fusing the distances of histograms with different weights. However, normalization before fusion is necessary because each histogram is composed of different feature vectors. Additionally, in order to avoid errors introduced by outliers, distances are normalized through the following ways: given a query image, by calculating distances on one type of histogram between this query and all images from the database, one set of distances \(\{DS_A(Q,I_i)\}\) obtained, where \(i=\{1,2,...,P\}\) and P is the number of images in the database. Thus the normalized distance is defined as:

$$\begin{aligned} DS_A^N(Q,I_i)=\frac{DS_A(Q,I_i)-\mu {_{A_Q}}}{\sigma {_{A_Q}}}, \end{aligned}$$
(6)

where \(\mu {_{A_Q}}\) and \(\sigma {_{A_Q}}\) are the mean value and the standard deviation of the distances set \(\{DS_A(Q,I_i)\}\) respectively.

Finally, all kinds of distances are fused as one distance to decide the similarly between the images and the query. In the stage of distances fusion, not all of the distances have the same weight, more weights on the LBPs distances of background and foreground are given. In particularly, 3 times on foreground and 2 times on background is the better choice as an experimental choice.

The SFMF algorithm framework is summarized in Algorithm 1.

figure a

3 Experiments

3.1 Image Dataset

The experiments were executed on three publicly available and widely used benchmark database: Corel 5k [10], VOC 2006 [5] and Corel 10k [11].

Corel 5k contains 50 themes with 100 images for each of size \(192*128\) or \(128*192\) in JPEG format. 8 themes were selected as the compared methods did, which have an obvious object in the image. The selected themes including “bear”, “pyramid”, “building”, “plane”, “snowberg”, “horse”, “tiger” and “train”. Among those images, 200 images are used to train the vision dictionary, and 800 images for testing.

In Pascal VOC 2006, ten themes were selected: “sheep”, “motorbike”, “cow”, “horse”, “dog”, “bus”, “car”, “person” and “bicycle” and 1000 images are selected randomly covered all these ten themes. Same with the Corel 5k, 200 images are used to train the visual words dictionary, and the rest of them are used for testing.

To prove the versatility of the algorithm, a large scale image database, Corel 10 is chosen. Corel 10k has expanded the number of pictures in the B set to reach 10000 images in total, and contains 100 categories from diverse contents such as sunset, beach, flower, building, car, horses, mountains, fish, food, door, etc.

Some examples of queries from these databases and its retrieval results are demonstrated in Fig. 3. From these examples, we could see that the proposed algorithm could get very good retrieval results obviously for the images containing salient objects with clear background as the salient objects were firstly segmented from the background regions, such as ‘Poker’ and ‘Gun’ in the figure. Furthermore, the proposed algorithm could also get good results for the images containing salient objects with complicated background, as ‘Hippo’ and ‘Terraces’ in the figure.

Fig. 3.
figure 3

Examples of query and its top 12 retrieved images

3.2 Experimental Measurement

We compute the bounded mean Average Precision (mAP) to count how many positive images at top K relevant results. The mAP at bound K is defined as follows:

$$\begin{aligned} mAP_K = \frac{\sum _{i=1}^N AP_K}{N}, \qquad 1 \leqslant K \leqslant N, \end{aligned}$$
(7)

where N is the number of query topics, K is the top K retrieval results considered. And AP at bound K is defined as follows:

$$\begin{aligned} AP_K=\frac{\sum _{i=1}^K \rho _iP_i}{N_{GT}}, \end{aligned}$$
(8)

where K is computing depth of AP, \(\rho _i\) is a boolean function defined as follows:

$$\begin{aligned} \rho _i=\left\{ \begin{array}{ll} 1,\ the\ i-th\ query\ result\ is\ correct, \\ 0,\ the\ i-th\ query\ result\ is\ incorrect, \end{array} \right. \end{aligned}$$
(9)

and \(P_i\) is precision of top i results, \(N_{GT}\) is the number of positive samples in top K query results from Ground Truth [21].

Other two metrics are precision and recall. These two metrics are often combined as the weighted harmonic mean, namely F-measure, and it is an overall performance measure [11]. It could be defined as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} F = \frac{(1+\beta ^2) \times P \times R}{(\beta ^2 \times P) + R}, \\ P = \frac{I_N}{N}, \\ R = \frac{I_N}{M}, \\ \end{array} \right. \end{aligned}$$
(10)

In the experiments of image retrieval, precision(P) is the ratio of the number of retrieved similar images to the number of retrieved images, while recall(R) is the ratio of the number of retrieved similar images to the total number of similar images. Where \(I_N\) is the number of retrieved similar images, N is the total number of images retrieved and M is the total number of similar images in the database. The coefficient \(\beta \) allows one to weight either precision or recall more heavily, and they are balanced when \(\beta = 1\). If there is no particular reason to favor precision or recall, \(\beta = 1\) is commonly used to image retrieval or information retrieval. Parameters are set as the same as [11], \(N = 12\); \(M = 100\) and \(\beta = 1\) on Corel-10k dataset, Thus, F-measure is so called \(F_1\)-measure.

3.3 Comparison

BOW could be seen as the baseline of content-based image retrieval, especially for the semantic image retrieval. RoI-BOW [21] is a recently reported method that improved the BOW model with the region segmentation. With these segmentation, images could be seen as the combination of the regions of interest (RoI) and the regions of Non-RoI. SSH and MSD are also recently reported by Liu [10, 11]. These algorithms used saliency model also and got a good results in image retrieval field. SBMF is our earlier work [1], thus we selected the BOW, RoI-BOW, SSH, MSD and SBMF for comparisons.

Tables 1 and 2 show the mAP at top ten and top twenty retrieved results for each theme in Corel 5K and VOC 2006 database respectively. From the table, we could conduct the conclusion that the performance of the proposed SFMF outperforms in most of the themes selected and archives the best performance overall.

To further verify the improvement, experiments are conducted on Corel 10k database also, the results are shown in Table 3. From this table, we could see that the proposed SFMF algorithm achieves the better performances than the BOW baseline, MSD, SSH and SBMF.

Table 1. Comparison of the mAP on Corel 5K
Table 2. Comparison of the mAP on VOC 2006
Table 3. The precision, recall, F-measure of five methods on Corel 10k

3.4 Running Efficiency

Although the proposed SFMF uses multiple features in retrieval, this proposal scheme has high efficiency in terms of vector space and runtime. The length of the feature vector of the foreground objects is 366: 256 of LBP, 30 of SIFT, 80 of color histogram; and the length of the feature vector of the background regions is 336: 256 of LBP, 80 of color histogram. So the total length of the feature vector for the retrieval is 702.

The running time evaluation experiments are implemented in Matlab2013b on the laptop with a Core I7-4720 processor with 8 GB memory. The running time of segmenting and training visual words reaches about 2.6 s per image and 1.4 s per image respectively, and the running time of retrieval is about 0.5 s per query. It should be noted that the segmentation and training could be done off-line prior to retrieval.

4 Conclusion and Perspectives

A novel semantic image retrieval method named SFMF was proposed, which integrated saliency fusion method and traditional image feature representation. Images were segmented into foreground objects and background regions by saliency maps generated by visual saliency fusion, then different features were extracted in consideration of different characteristics of foreground and background. Experiments implemented on three widely used databases have proved the amelioration of the segmentation based on visual saliency fusion could lead the better performance.

However, it also should be noted that the calculation load of the segmentation and visual saliency fusion is still heavy. Furthermore, more efficient features vector needs to be discovered in the future as the size of LBP constitutes the large proportion of the feature vector. These two issues could be discovered perspectively.