1 Introduction

Salient object detection aims to detect the most salient attention-grabbing object in a scene and the great value of it mainly lies in kinds of applications such as object detection and recognition [1, 2], image and video compression [3, 4], object content aware image retargeting [5, 6], to name a few. Therefore, numerous salient object detection models have been developed in recent years [7, 8]. All these models can be categorized as either bottom-up or top-down approaches. Bottom-up saliency models are based on some pre-assumed priors (e.g., contrast prior, central bias prior, background prior and so on). On the other side, top-down models usually use high-level information to guide the detection. We only focus on bottom-up models in this work.

For bottom-up salient object detection models, the priors play a critical role. The most widely used is the contrast prior and often measured with respect to local [911] or the global fashion [1214]. Motivated by the early primate vision, Itti et al. [11] regard the visual attention as the local center-surround difference and present a pioneer saliency model based on multi-scales image features. Goferman et al. [9] take advantage of multi-clues including local low-level features, high-level features, and global considerations to segment out the salient objects along with their contexts. In [10], Jiang et al. utilize the shape information to find the regions of distinct color by computing the difference between the color histogram of a region and its adjacent regions. Due to the lack of higher-level information about the object, all these local contrast based models tend to produce higher saliency values near edges instead of uniformly highlighting the whole salient object.

On the other side, global contrast based methods take holistic rarity over the complete image into account. The model of Achanta et al. [12] works on a per-pixel basis through computing color dissimilarities to the mean image color and achieves globally consistent results. They also use Gaussian blur to decrease the influence of noise and high frequency patterns. Cheng et al. [13] define a regional contrast-based method by generating 3D histograms and using segmentation, which evaluates not only global contrast differences but also spatial coherence. Model [14] measures global contrast-based saliency based on spatially weighted feature dissimilarities. However, global contrast-based methods may highlight background regions as salient because they do not account for any spatial relationship inside the image.

The central-bias prior is based on a well-known fact that when humans take photos they often frame the interested objects near the center of the image. So Judd et al. [15] present a saliency model via computing the distance between each pixel and the coordinate center of the image. Their model presents a better prediction of the salient object than many previous saliency models. Later, both Goferman et al. [9] and Jiang et al. [10] enhance the intermediate saliency map with weight implemented via a 2D Gaussian fallof positioned at the center of the image. This prior usually improves the saliency performance of the most the natural images. However, the certral-bias prior is not always true when the photographer faces a big scene or the objects of interest cannot be located near the center of image.

Besides above two commonly used priors, several recent models also utilize the background prior, i.e., image boundary should be treated as background, to perform saliency detection. Wei et al. [16] propose a novel saliency measure called geodesic saliency, which use two priors about common backgrounds in natural images, namely boundary and connectivity priors, to help removing background clutters and in turn lead to better salient object detection. Later, Yang et al. [17] utilize this background prior and graph-based manifold ranking to detect the salient object and get promising results. However, they assume that all the four image sides are background. This is not always true when the image is cropped. Recently, Zhu et al. [18] propose a novel and reliable background measure, called boundary connectivity, and a principled optimization framework to integrate multiple low level cues. They do not treat all the image boundaries as background.However, their method is too complicated. Unlike all these aforementioned methods, our model not only adaptively treat the image boundaries as background or non-background, but also is very easy to be implemented. Figure 1 gives an overview of our framework.

Fig. 1.
figure 1figure 1

Overview of our model. Given input image, we first over-segment the image into superpixels. Then we adaptively select the background superpixels and compute the saliency via multi-features enhanced graph-based manifold ranking. Finally we present the salient object segmentation.

The contributions of this paper are three-fold:

  • We adaptively treat the four image boundaries as background. A simple but effective method is proposed to adaptively treat the boundary pixels as background and non-background pixels.

  • We not only use the color information, but also utilize the variance and histogram features in multi color spaces (LAB and RGB) to enhance the detection performance.

  • We present a simple but effective salient object segmentation via computed saliency map.

The rest of this paper is organized as follows. In Sect. 2, we first give a detailed description of graph-based manifold ranking, and then present our proposed model. Then, in Sect. 3, we provide a qualitative and quantitative comparison with previous methods. We will present application of salient object detection: salient object segmentation in Sect. 4. Finally, we conclude with a short summary and discussion in Sect. 5.

2 Robust Salient Object Detection

In 2004, Zhou et al. [19, 20] propose a graph-based manifold ranking model, a method that can exploit the intrinsic manifold structure of data. It can be regarded as a kind of semi-supervised learning problem. We present a robust salient object detection method via adaptive background selection and multi-features enhancement. We first give a brief introduction to the graph-based manifold ranking, and then present the details of our proposed method.

2.1 Graph-Based Manifold Ranking

Given a set of n data points \(X=\{x_1,x_2,...,x_q,...,x_n\}\), with each data \(x_i\in R^{m}\), the first q points \(\{x_1,x_2,...,x_q\}\) are labeled as the queries and the rest points \(\{x_{q+1},...,x_n\}\) are unlabelled. The ranking algorithm aims to rank the remaining points according to their relevances to the labelled queries. Let \(f\!\!\!:X\rightarrow R^n\) denotes a ranking function which assigns to each data point \(x_i\) a ranking value \(f_i\). We can treat f as a vector \(f=[f_1,f_2,...,f_n]^T\). We can also define an indication vector \(y=[y_1,y_2,...,y_n]^T\), in which \(y_i=1\) if \(x_i\) is a query, and \(y_i=0\) otherwise.

Next, we define a graph \(G=(V,E)\) on these data points, where the nodes V are dataset X and the edges E are weighted by an affinity matrix \(W=[w_{ij}]_{n\times n}\). Give G, the degree matrix is denoted as \(D=diag\{d_{11},d_{22},...,d_{nn}\}\), where \(d_{ii}=\sum _{j=1}^{n}w_{ij}\).

According to Zhou et al. [20], cost function associated with the ranking function f is defined to be

(1)

where the regularization parameter \(\mu >0\) controls the balance of the first term (smoothness constraint) and the second term (fitting constraint, containing labelled as well as unlabelled data.). Then the optimal ranking \(f^*\) of queries is computed by solving the following optimization problem:

$$\begin{aligned} f^*=\arg \underset{f}{\min }\,Q(f) \end{aligned}$$
(2)

The trade-off between these two competing constraints is captured by a positive parameter \(\mu \) and usually set to be 0.99 to put more emphasis on the label consistency. The solution of Eq. (2) can be denoted as

$$\begin{aligned} f^*=(I-\alpha S)^{-1}y \end{aligned}$$
(3)

where I is an identity matrix, and \(S=D^{-\frac{1}{2}}WD^{-\frac{1}{2}}\) is the normalized Laplacian matrix, \(\alpha =1/(1+\mu )\). The detailed derivation can be found in [20].

This ranking algorithm indicates that the salient object detection model should consist of two parts: graph construction and ranking with queries. In Sect. 2.2, we present our multi-features enhanced graph construction and then in Sect. 2.3, we give the details of our adaptive background selection and saliency ranking.

2.2 Multi-Features Enhanced Graph Construction

To better exploit the intrinsic relationship between data points, there are two aspects should be carefully treated in graph construction: graph structure and edge weights. We over-segment input image into small homogeneous regions using SLIC algorithm [21] and regard each superpixel as a node in the graph G.

For graph structure, we take into account the local smoothness cue (i.e., local neighboring superpixels are more likely to belong to the same object) and follow two rules. Firstly, each node is not only connected with its direct adjacent neighboring nodes, but also is connected with those nodes sharing common boundaries with its neighboring nodes. Secondly, the nodes on the four image sides should be connected together. Figure 2 gives an illustration of graph construction.

Fig. 2.
figure 2figure 2

An illustration of graph construction.

After modelling the graph structure, the very core problem is how to get the edge weight between any pairwise nodes given input data. The color information has been shown to be effective in saliency detection [7, 12]. So most models only adopt color information to generate the edge weights. However, there are other features can be utilized to improve the performance. We employ more features: color, variance and histogram feature. We denote the edge weight as following

$$\begin{aligned} w_{ij}=e^{-(\frac{c_c(r_i,r_j)}{\sigma _c^2} + \frac{c_v(r_i,r_j)}{\sigma _v^2} + \frac{c_h(r_i,r_j)}{\sigma _h^2})} \end{aligned}$$
(4)

where \(r_i\) and \(r_j\) denote the superpixel region i and j respectively. \(c_c(r_i,r_j)\), \(c_v(r_i,r_j)\) and \(c_h(r_i,r_j)\) represent the corresponding color, variance and histogram feature difference between region \(r_i\) and \(r_j\) respectively. \(\sigma _c\), \(\sigma _v\) and \(\sigma _h\) are feature parameters controlling the strength of the corresponding weight and we take 5, 2 and 2 in all experiments. The color feature is defined as

(5)

where \(c_c(r_i)\) and \(c_c(r_j)\) denote the mean of region \(r_i\) and \(r_j\) respectively in Lab color space.

Generally speaking, the color distributions of image regions are independent of each other with different variances, so we should also take advantage of the variance information. We define the variance feature difference as

(6)

where \(\sigma _v(r_i)\) and \(\sigma _v(r_j)\) are corresponding computed regional variance. \(n(r_i)\) and \(n(r_j)\) are number of pixels in the regions \(r_i\) and \(r_j\) respectively. \(\epsilon \) is a small number to avoid arithmetic error. Note that we also take the region size into account. This is performed in RGB color space.

For histogram feature, we utilize \(\chi ^2\) distance instead of simple Euclidean distance to define the disparity between two histograms as suggested in [22]. The histogram feature is defined by

$$\begin{aligned} c_h(r_i,r_j)=\frac{1}{2}\sum _{k=1}^d\frac{(h_k(r_i)-h_k(r_j))^2}{(h_k(r_i)+h_k(r_j))^2} \end{aligned}$$
(7)

where \(h_k(r_i)\) denotes the k-th component of the color histogram of region \(r_i\), d denotes the number of component in the histogram, we take \(d=256\) in this work for simpleness, however, d can be much more smaller in order to improve the computational efficiency. This is also performed in RGB color space.

All these three features are normalized to [0, 1]Footnote 1. We keep all other parameters unchanged, and add feature(s) to compute the saliency map and give an comparative example in Fig. 3. We can see that these two additional features can both improve the saliency detection performance.

Fig. 3.
figure 3figure 3

An illustration of the effectiveness of multi features. (a) Input image, (b) Ground-truth, (c) Result of GMR [17], (d) Result with color and histogram features, (e) Result with color and variance features, (f) Our result with all three features.

Fig. 4.
figure 4figure 4

Visual comparison. (a) Input image, (b) Ground-truth, (c) Saliency map of GMR [17], (d) Bi-segmentation of (c), (e) Our saliency map, (f) Bi-segmentation of (e). Note that our saliency map is more robust than that of GMR [17].

2.3 Saliency Ranking via Adaptive Background Selection

Most background prior based models treat all the four image sides as background by assuming that photographers will not crop salient objects along the view frame. However, this is not always true. Figure 4 shows a special case and the visual comparison of our model and [17]. We can see that when the salient object touches the image border, the detection result of [17] is not so robust anymore. While our proposed method can handle this drawback. See Algorithm 1 and Algorithm 2 for our adaptive background selection and saliency ranking respectively.

figure afigure a
figure bfigure b

3 Experiments

In this section, we extensively evaluate our model and make quantitative evaluation and qualitative evaluation on three widely used datasets SOD [23], ECSSD [24] and ASD [12].

We compare our approaches with twenty state-of-the-art salient object models on these three widely used datasets. These twenty models are: CA [9], CB [10], CHM [25], FES [26], FT [12], GMR [17], GS [16], HDCT [27], HS [24], MC [28], MSS [29], PCA [30], SF [31], SVO [32], SWD [14], BM [33], LRMR [34], GB [35], SR [36], IT [11].

Fig. 5.
figure 5figure 5

(a), (b): precision-recall curves of different methods. (c), (d): precision, recall and F-measure using an adaptive threshold. (e), (f): MAE. All results are computed on the SOD dataset. The proposed method performs well in all these metrics.

Fig. 6.
figure 6figure 6

(a), (b): precision-recall curves of different methods. (c), (d): precision, recall and F-measure using an adaptive threshold. (e), (f): MAE. All results are computed on the ECSSD dataset. The proposed method performs well for all these metrics.

Fig. 7.
figure 7figure 7

(a), (b): precision-recall curves of different methods. (c), (d): precision, recall and F-measure using an adaptive threshold. (e), (f): MAE. All results are computed on the ASD dataset. The proposed method performs very well.

Fig. 8.
figure 8figure 8

Visual comparison of proposed model and twenty other methods. From top to bottom and left to right are input, ground truth and results of BM [33], CA [9], CB [10], CHM [25], FES [26], FT [12], GB [35], GMR [17], GS [16], HDCT [27], HS [24], IT [11], LRMR [34], MC [28], MSS [29], PCA [30], SF [31], SR [36], SVO [32], SWD [14] and ours.

3.1 Quantitative Evaluation

For quantitative evaluation, we evaluate the performance using three commonly used metrics including the PR (precision-recall) curve, F-Measure and MAE (mean absolute error).

PR curve is based on the overlapping area between pixel-wise annotation and saliency prediction. F-Measure, jointly considers recall and precision. We also introduce the mean absolute error (MAE) into the evaluation because the PR curves are limited in that they only consider whether the object saliency is higher than the background saliency. MAE is the average per-pixel difference between the pixel-wise annotation and the computed saliency map. It directly measures how close a saliency map is to the ground truth and is more meaningful and complementary to PR curves.

Figures 5, 6 and 7 show the PR curves, F-Measures and MAEs of all compared and our models on these three data sets. We note that the PR curve of proposed method outperforms PR curves of all other methods on SOD dataset. On ECSSD and ASD data sets, our model is among the best performance models. For F-Measure, our model gets the best performance on all data sets. And for MAE, our model has the smallest value on all these three data set and this indicates that our saliency maps are closest to the ground truth masks.

3.2 Qualitative Evaluation

For qualitative evaluation, the results of applying the various algorithms to representative images from SOD, ECSSD and ASD are shown in Fig. 8. We note that the proposed algorithm uniformly highlights the salient regions and preserves finer object boundaries than all other methods. It is also worth pointing out that our algorithm performs well when the background is cluttered.

4 Salient Object Segmentation

In [37], Cheng propose an iterative version of GrabCut, named SaliencyCut, to cut out the salient object. However, their work is based on predefined fixed threshold and is a little bit time consuming. We just use the adaptive threshold to segment the salient object. We first define the average saliency value as

$$\begin{aligned} sal_{mean}=\frac{1}{mn}\sum _{i=1}^m\sum _{j=1}^nS(i,j) \end{aligned}$$
(8)

where m and n denote image rows and columns respectively. Then the salient object mask is denoted as

$$\begin{aligned} Sal_{mask}(i,j)=\left\{ \begin{aligned} 1&,\quad S(i,j)>= sal_{mean}&\\ 0&,\quad S(i,j)< sal_{mean}&\\ \end{aligned} \right. \end{aligned}$$
(9)

The final segmented salient object is defined as

$$\begin{aligned} S_{obj}=I.*Sal_{mask} \end{aligned}$$
(10)

where \(.*\) denotes pixel-wise multiplication. See Fig. 9 for some segmentation examples.

Fig. 9.
figure 9figure 9

Examples of salient object segmentation. (a) input images, (b) saliency maps, (c) segmented salient objects.

5 Conclusion

In this paper, we address the salient object detection problem using a semi-supervised method. We tackle the failure case when the salient object touches the image border by adaptive background selection. We also take more features into account to better exploit the intrinsic relationship between image pixels. We evaluate our model on large datasets and demonstrate promising results with comparisons to twenty state-of-the-art methods. Finally, we present a simple but effective salient object segmentation method.