1 Introduction

There are 2 important questions to address regarding the computation of salient objects: what is a salient object, and which object is most salient in real scenes? [4, 37]. The first question corresponds to bottom-up visual saliency models, while the second corresponds to top-down salient object models. Bottom-up and top-down models are instrumental in human cognitive processes [2, 22, 27, 33, 38, 43, 44]. The bottom-up visual model is a task-independent and unconscious visual process based on low-level features, while the top-down visual model simulates prioritizing mechanisms that determine the salience of scene regions based on high-level features.

Based on the task of salient object detection, the top-down model draws human attention to select the specific salient objects, while the bottom-up model cannot. Currently, salient object detection in the top-down model has attracted interest in visual attention [5]. The capability of salient objects has several applications in image cropping [26], video summarization [3, 25], object aware retargeting [7, 32], and object segmentation [8, 30]. However, in recent top-down models, these methods only utilize object information and not take advantage of background information. Since objects and backgrounds have different properties, contrasting appearance information between objects and background regions are high. These saliency detection methods are contrast prior approaches [1, 13, 23]. In addition to contrast prior methods, Wei et al. [36] used the boundary prior approach to compute salient regions in images. However, the contrast prior and boundary prior methods simply regard all image boundaries as background, and therefore lose some object information in saliency computation. To solve this problem, Zhu et al. [46] used boundary connectivity to detect background regions for saliency optimization, and treated saliency computation as a global optimization problem.

In addition to background measure information, many recent top-down models exploit conditional random fields (CRFs) and a class-specific dictionary [20, 42]. These approaches originated from the sparse coding method based on local features or superpixels. Coding method performance depends greatly on the discriminative codebook [31]. Yang et al. and Kocak et al. [20, 42] combined CRFs with a discriminative codebook to learn a top-down saliency model. The central idea of these approaches was to use the sparse codes as CRFs latent variables, and meanwhile, utilize CRFs to learn the discriminative codebook. However, in real scenes, cluttered background information always influences object saliency information.

Inspired by Zhu et al. [46], Yang et al. and Kocak et al. [20, 42], we propose a framwork for computing object saliency via combining saliency optimization from robust background measure information with top-down visual saliency from CRFs and discriminative codebook learning. Our approach used boundary connectivity to acquire background measure information, through which we could compute salient regions. Meanwhile, we learned a top-down object salience model by using CRFs and a class-specific codebook. In contrast to methods implemented by Yang et al. and Kocak et al. [20, 42], we used locality to replace sparsity and generated our top-down saliency model. More specifically, we treated locality-constrained linear codes as CRFs latent variables, and trained the discriminative codebook modulated by CRFs. Our approach not only reduced the influence of a cluttered background, but also enhanced object saliency in real scenes. We evaluated our approach on the Graz-02 [28] and Pascal voc2007 [10] datasets by measuring the quality of saliency maps using mean absolute error (MAE) and standard precision-recall (PR) curves. Experimental results demonstrated that our approach performed better than current state-of-the-art saliency algorithms [20, 29, 36, 39, 40, 42, 46]. For instance, our method acquired salient object regions in cluttered real scenes containing 2 or 3 different object categories (Fig. 1). We generated the object saliency map by learning 3 different top-down object models: person, car and bike. Using this object saliency map, our method can distinguished the class-specific object from complex scenes.

Fig. 1
figure 1

Object saliency map in complex scenes. a Given an original image; b Object saliency region map acquired using person top-down model; c Object saliency region map using car top-down model

In the next section, we review related work on object saliency computation. Section 3 describes our work in detail, while Section 4 shows experimental results on several general datasets. Finally, we draw conclusions in Section 5.

2 Related work

The earliest visual saliency model was a bottom-up model, which was proposed by Itti et al. [14]. This model is an unconsciously visual processing method in neuroscience and computer vision, and is computed by center-surround mechanisms based on low features. Recently, object saliency detection has been viewed as a binary segmentation task [1, 31], where 1 denotes the foreground region and 0 denotes the background region. In this work, we are interested in object prior methods for salient object detection based on robust background prior information and coding methods based on discriminative codebooks.

2.1 Background measure methods

Object prior methods are mainly composed the center prior approach and the boundary prior approach. The center prior approach is viewed as a Gaussian fall-off map based on a center region contrast method [6, 15, 17, 23, 39]. In contrast, the boundary prior approach regards image boundary regions as background, which is measured by saliency optimization based on boundary patches [36, 40]. Furthermore, background regions tend to connect image boundaries, while foreground regions do not [36]. The method proposed by Yang et al. [40] used boundary patches to measure background queries for computing object saliency. Lastly, Zhu et al. [46] proposed to use boundary connectivity and saliency optimization methods to estimate contrast between background and foreground.

2.2 Top-down models

In computer vision, task-oriented top-down visual saliency models involve saliency computation and feature learning [9, 12, 18, 19, 34, 42]. Gao et al. [12] used discriminant features to estimate top-down saliency models based on a pre-defined filter bank. In the training images, these discriminant local features were extracted from each image to denote target object presence or absence. On the other hand, Kanan et al. [18] used independent component analysis to construct a top-down model by learning a support vector machine based on local features. In anther study by Kanan et al. [18], top-down saliency maps were computed by contextual priors of object location and appearance. In another direction, CRFs have been introduced for learning top-down models via local features. The CRF can incorporate various of features for object recognition and segmentation. For example, Yang et al. [42] combined CRF parameters with sparse coding for generating object salience models. In this model, the discriminative codebook was trained by the CRF, and sparse codes were viewed as CRF latent variables. In our work, we used locality linear codes as CRF latent variables, and similar to the proposed by Yang et al. [42], learned a discriminative codebook modulated by CRFs. Our approach not only improved accuracy but also significantly reduced the computational complexity.

2.3 Coding methods

Recently, coding methods are widely used for image classification based on codebook [11, 21, 35, 41]. In the vector quantization (VQ) [11] coding method, each code has only 1 nonzero element, while in the sparse coding (SC) [21, 41] approach, multiple codes are acquired by using sparsity based on codebooks. In general, the SC nonzero responding coefficients are local, Yu et al. [45] proposed local coordinate coding (LCC) to improve SC, and pointed out that, theoretically, locality is more efficient than sparsity. However, the optimization problem of LCC and SC required solving, which increases computational expensive. To solve this problem, Wang et al. [35] presented the locality-constrained linear coding (LLC) method, which is a faster implementation of LCC. In our work, we combined CRF models based on LLC codes with robust background measures based on saliency optimization for computing object salience models.

3 Our algorithm

To acquire more efficient object saliency maps, we combined saliency computation based on robust background measures with CRF models and discriminative codebook learning. Specifically, our approach employed background and object information for generating salient object regions. An overview of our system framework is summarized in Fig. 2. In the training phase, we used a set of training images and corresponding ground truth annotations to learn a class-specific object model by a discriminative codebook and CRF parameters. In the testing phase, we first extracted superpixels from the test image to compute image background information using the robust background measure; second, we generated the object saliency map using the class-specific object model; and finally, we acquired the salient object region by combining the robust background information and category-specific object model.

Fig. 2
figure 2

Overview of our system framework. Given a set of training images and corresponding ground truth annotations, we learned the class-specific object model using a discriminative codebook and CRF parameters. For image testing, we first extracted superpixels from the test image to compute image background information using the robust background measure; then, we computed the object saliency map using the class-specific object model; and finally, we generated the salient object region by combining the robust background information with the specific-class object model

3.1 Saliency computation by robust background measure

Object and background regions have highly different properties in natural images, i.e., background regions are more connected to image boundaries than object regions. Therefore, the saliency computation can be estimated by boundary connectivity and background weighted contrast. The boundary connectivity is defined as follows:

$$ BndCon(R) = \frac{\{p|p\in R,p\in Bnd\}}{\sqrt{|\{p|p\in R\}|}} $$
(1)

where B n d denotes a set of image boundary patches, and p denotes an image patch, which is acquired by superpixel features. However, (1) is difficult to compute because the parameter section is problematic and undesirable discontinuous boundary regions are introduced. In practice, Zhu et al. [46] used adjacent patches (p, q) to construct an undirected weighted graph, and used the weight d a p p (p, q) to denote the Euclidean distance between their average colors in the LAB color space. The d g e o (p, q) denoting geodesic distance between any 2 patches is defined as follows:

$$ d_{geo}(p,q) = \min\limits_{p_{1}=p,p_{2},...,p_{n}=1}\sum\limits_{i=1}^{n-1}d_{app}(p_{i},p_{i+1}). $$
(2)

The spanning area of each patch p is defined as follows:

$$ Area(p) = \sum\limits_{i=1}^{N}exp\left( -\frac{d_{geo}^{2}(p,p_{i})}{2\sigma_{clr}^{2}}\right)=\sum\limits_{i=1}^{N}S(p,q_{i}), $$
(3)

where N denotes the number of patches, and d g e o (p, p)=0. In practice, performance is stable when σ c l r is within [5,15]. The length along the boundary is defined as follows:

$$ Len_{bnd}(p)=\sum\limits_{i=1}^{N}S(p,p_{i})\cdot\delta(p_{i}\in Bnd), $$
(4)

where δ(⋅) = 1 denotes patches on the image patch; otherwise δ(⋅) = 0. Finally, the boundary connectivity is computed by the following formula:

$$ BndCon(p) = \frac{Len_{bnd}(p)}{\sqrt{Area(p)}}. $$
(5)

Through the above formula, the background weighted contrast is defined as follows:

$$ wCtr(p) = \sum\limits_{i=1}^{N}d_{app}(p,p_{i})w_{spa}(p,p_{i})w_{i}^{bg}, $$
(6)

where d s p a (p, p i ) denotes the distance between the centers of patch p and p i , \(w_{spa}(p,p_{i})=exp(-\frac {(d_{spa}^{2}(p,p_{i}))}{2\sigma _{spa}^{2}})\), and σ s p a =0.25. \(w_{i}^{bg}=1-exp(-\frac {BndCon^{2}(p_{i})}{2\sigma _{bndCon}^{2}})\) denotes a new weighting term, and σ b n d C o n =1. Finally, we can use (6) to compute salient regions. Now, object regions receive high contrast from background regions, while background regions receive low contrast from the object regions.

3.2 CRF model based on LLC codes

We extracted dense-SIFT [24] features as a local image patch, and assigned a binary label to indicate the presence or absence of a local patch corresponding to target objects. In practice, X = {x 1, x 2, ...,x m } denotes a set of local patches from an image, Y = {y 1, y 2, ...,y m } denotes the corresponding labels, and D = {d 1, d 2, ...,d K } is a codebook learned from the training image patches. We solved the following optimization problem:

$$ w =\arg\min\limits_{w}\Vert{x-Dw}{\Vert^{2}_{2}}+\lambda\sum\limits_{j=1}^{K}\left( w(j)\cdot\exp\left( \frac{\Vert{x-d_{j}} \Vert_{2}}{\sigma}\right)\right)^{2},\\ s.t.\sum\limits_{j=1}^{K}{w(j)}=1, $$
(7)

where w(j) denotes the j-th element of w i , i = 1,…,m, and w i denotes the sparse code of local image patches and is viewed as a vector of latent variables in the sparse coding of the CRF model. Furthermore, sparse representation is x i = D w i , and λ denotes the sparse penalty parameter for controlling the regularization term. Note that we use the W(X, D)=[w(x 1, D),w(x 2, D),...,w(x m , D)] to denote the latent variables for all CRF model. Solving the (7) requires computing the optimization procedures and thus we used LLC to speed up the encoding process and utilize the K nearest neighbors (K n n) of x i as the local codeword d i to replace the λ for controlling locality constraints [35].

Then, we constructed a 4− connected graph Γ〈ν, ε〉 on the local image patches based on their spatial adjacency patches, where ν and ε denote the nodes and edges of the graph, respectively. We utilized the labels Y and latent variables W(X, D) on the graph Γ to build a CRF model using the following formula:

$$ P((Y|W(X,D)),\alpha) = \frac{1}{Z}e^{-E(W(X,D),Y,\alpha)}, $$
(8)

where Z denotes the partition function, α is the weight vector of the CRF model, and E(W(X, D),Y, α) denotes the energy function of CRF. Through this equation, we simultaneously learned the supervised dictionary D and the CRF parameter α. The marginal probability of node iν is as follows:

$$ p(y_{i}|w_{i},\alpha) = sum_{y\aleph(i)}p(y_{i},y\aleph(i)|w_{i},\alpha), $$
(9)

where (i) is the neighbors of node i on the graph Γ, and the saliency value of local image patch x i is defined as follows:

$$ u(w_{i},\alpha) = p(y_{i}=1|w_{i},\alpha). $$
(10)

Thus, the saliency map U(W, α)={u 1, u 2, ...,u m } can be computed, and the probabilistic definition of the top-down saliency map can reserve both appearance and local contextual information. Finally, for a test image X = {x 1, x 2, ...,x m }, we computed the top-down saliency map U by the following steps:

  1. A.

    Learn the sparse latent variables W(X, D) using (7).

  2. B.

    Compute the top-down saliency map U(W, α) using (9) and (10).

In the training process, a set of training images are denoted as X = {X (1),X (2),…,X (n)} and the corresponding pixel-level ground truth labels are Y = {Y (1),Y (2),…,Y (n)}. Then, the optimal codebook D and the CRF model weight vector \(\hat {\alpha }\) could be calculated by the equation:

$$ (\hat{D},\hat{\alpha})=\arg\max\limits_{D,\alpha}{\Pi}_{j=1}^{n}P(\textbf{{\textit{Y}}}^{(j)}\mid W(\textbf{{\textit{X}}}^{(j)},D),\alpha), $$
(11)

We then maximized (11) to acquire optimal parameters α and D so that for all YY (j),j = 1,…,n, and thus

$$ P(Y^{(j)}|W(X^{(j)},D),\alpha)\geq P(Y|W(X^{(j)},D),\alpha), $$
(12)

According to (8), we obtained the following formula:

$$ E(Y^{(j)},W^{(j)},\alpha)\leq E(Y,W^{(j)},\alpha), $$
(13)

Then, by the cutting plane [16] algorithm, the most violated label was obtained by the following equation:

$$ \hat{Y^{(j)}}=\arg\min\limits_{Y}E(Y,W^{(j)},\alpha), $$
(14)

Therefore, the objective function was minimized to learn the optimal parameter weight α and the codebook D

$$ \min\limits_{\alpha,D}\frac{\gamma}{2}\|\alpha\|^{2}+\sum\limits_{j=1}^{n}\ell^{j}(\alpha,D), $$
(15)

where \(\ell ^{j}(\alpha ,D)=E(\hat {Y^{(j)}},W^{(j)},\alpha )-E(Y^{(j)},W^{(j)},\alpha )\) is loss function and γ controls the regularization of weight α. We used an iterative procedure by alternating between codebook and weight parameter updates [20, 42]. Once we acquired the CRF model weight vector \(\hat {\alpha }\) and the codebook \(\hat {D}\), a test image saliency map can be calculated using (10). The iterative learning algorithm is summarized in Algorithm 1.

figure d

To improve saliency map quality and reduce the impact of cluttered background information, we introduced the robust background measure and top-down method derived by the CRF model based on LLC codes. Thus, we acquire the final object saliency map as follows:

$$ crfwopt_{Map} = wCtr(p) + U(w_{i},\alpha), $$
(16)

where w C t r(p) is computed by (6), to obtain high contrast from background regions, p denotes an image patch acquired by superpixel features. U denotes the specified category saliency map as computed by (10). w i denotes locality-constrained linear codes as CRF latent variables, α is the weight vector of the CRF model, and crfopt denotes our proposed method.

4 Experiments

We evaluated our method on 2 general datasets (Graz-02, PASCAL VOC 2007). These general datasets are more challenging than other datasets since they contain ground truth images with large intra-class object variations, severe occlusions, and cluttered backgrounds. All experiments were carried out on a Dell T7610 workstation with 32 GB memory.

In our experiments, we used standard precision-recall (PR) curves to evaluate the performance of our algorithm. The curve was computed by comparing the binary masks generated from the saliency map against the ground truth annotations. However, PR curves are limited in practice, because they consider only object and background saliency. Therefore, we also used the mean absolute error (MAE) to measure the per-pixel difference between the saliency map and the binary ground truth. Through the above two measures, we acquired more meaningful salient objects for applications such as image cropping or object segmentation.

4.1 Graz-02

The Graz-02 dataset is widely used in salient object detection for evaluating the performance of different methods. This dataset contains 3 different object categories (person, car, bicycle), and an additional background category. Each category contains 300 images with corresponding pixel-level object annotations, for a total of 1,200 images. The size of each image is 640 × 480 pixels. For the experiments, we extracted dense-SIFT descriptors from each image patch, and sampled image patches of 64 × 64 pixels with a step size of 16 pixels, total collected 999 patches for each image. We followed the standard experimental setup for fairly comparing with accuracy with that of other methods. Furthermore, we labeled an image patch as a positive sample if the object pixels occupied at least one-quarter of its total pixels; otherwise, the image patch was a negative sample. We used the 150 odd-numbered images in each object category and the additional 150 odd-numbered images from the background class as training samples, and the remaining even-numbered images as test samples. We also needed to extract all dense-SIFT descriptors from training images by utilizing the Km e a n s clustering algorithm [21] to initialize the dictionary. We then used locality-constrained linear coding method to generate CRF latent variables. We also needed to evaluate 2 important parameters: the number of codewords K in the codebook, the locality sparse penalty parameter K n n. Usually, a large number of code words contain more object appearance variations. To reduce computational cost for obtaining the codebook, we selected a moderate 512 code words. According to the definition, presented by Wang et al. [35], K n n controls locality constraints. Specifically, the greater K n n, the more code words, and therefore there are more codewords to represent an image patch. Therefore we selected a moderate K n n = 20 for our experiments.

To illustrate our method’s performance, we compared it to several methods crfsc [42], w C t r o p t [46], SF [29], GS [36], and MR [40] using PR curves and MAE evaluations in the Graz-02 dataset. Among these methods, SF combined low level cues in a straightforward way; GS and MR utilized boundary priors; w C t r o p t used robust background measures and global optimization; and crfsc used a CRF model and discriminative codebook. To show the influence of the locality constraint K n n on the experiment, Fig. 9 reports precision rates at equal error rates (EER) for different K n n parameters with a codebook size of K = 512.

Figures 3a, 5a and 7a report PR curves for different methods on the Graz-02 dataset. The PR curves show that under the same precision and recall rate, our method (crfwopt) consistently performed better than other state-of-the-art methods. Figures 35b, and 7b report MAE results for various methods on the Graz-02 dataset. To verify the effectiveness, the object saliency maps of all methods (GS, MR, SF, w C t r o p t, crfwopt) are shown in Figs. 4567 and 8.

Fig. 3
figure 3

Comparison of PR curves and MAE on the Graz-02 (person) dataset. a PR curves show that our method (crfwopt) achieved better performance than other state-of-the-art methods; b MAE shows that our object saliency map (crfwopt) is close to the ground truth annotations

Fig. 4
figure 4

Examples of object saliency map results from our method (crfwopt) and other methods on the Graz-02 (person) dataset. a source image; b GS; c MR; d SF; e w C t r o p t; f crfsc; g crfwopt

Fig. 5
figure 5

Comparison of PR curves and MAE on the Graz-02 (car) dataset. a Under the same precision recall rate, the PR curves show that our method (crfwopt) achieved better performance than other state-of-the-art methods; b MAE shows that our object saliency map (crfwopt) is close to the ground truth annotations

Fig. 6
figure 6

Examples of object saliency maps from our method (crfwopt) and other methods on the Graz-02 (car) dataset. a source image; b GS; c MR; d SF; e w C t r o p t; f crfsc; g crfwopt

Fig. 7
figure 7

Comparison PR curves and MAE on the Graz-02 (bicycle) dataset. a Under the same precision recall rate, the PR curves show that our method (crfwopt) achieved better performance than other state-of-the-art methods; b The MAE shows that our saliency object map (crfwopt) is close to the ground truth annotations

Fig. 8
figure 8

Examples of object saliency maps from our method (crfwopt) and other methods on the Graz-02 (bicycle) dataset. a source image; b GS; c MR; d SF; e w C t r o p t; f crfsc; g crfwopt

4.2 PASCAL VOC 2007

The PASCAL VOC 2007 dataset is more challenging than the Graz-02 dataset, and consist of 9,963 images from 20 different object categories and a background class. In the dataset, only 632 images have ground truth pixel-level segmentation annotations. In practice, we noticed that each object category contains few images for learning our model. To solve this problem, we used existing bounding box annotations to label the object presence or absence according to the GrabCut [30] segmentation method. In experiments, the number images used from the person, car, and bike object categories were 4,192, 1,542, and 505, respectively. Similar to the Graz-02 dataset experimental setup, we used the odd-numbered images and randomly selected other object categories as the background class for training. Moreover, we used even-numbered images test samples. To learn robust background information and train our salience object model, we needed to extract superpixels and dense-SIFT descriptors from training images. Specially, we utilized the Km e a n s clustering algorithm to initialize the codebook for modulating the CRF model. The number of code words in the codebook and the locality constraint K n n were set as 512 and 20, respectively (Fig. 9).

Fig. 9
figure 9

Precision rates at equal error rates (EER) for different K n n parameter on the Graz-02 dataset. Codebook size K = 512

Similar to those of the Graz-02 dataset, Figs.10a, 12a, and 14a report PR curves for different methods on the PASCAL VOC 2007 dataset. The PR curves show that under the same precision and recall rate, our method (crfwopt) performed better than other state-of-the-art methods except for in the car category (Fig. 12a). Figures 10b, 12b and 14b report MAE results for various methods on the PASCAL VOC 2007 dataset. To verify effectiveness, the object saliency maps of all methods (G S, M R, S F, w C t r o p t, c r f w o p t) are shown in Figs.1113 and 15. Lastly, to show the influence of the locality constraint parameter K n n, Fig. 16 reports the precision rates at equal error rates (EER) for different K n n parameters with a codebook size of K = 512 (Figs. 12, 13, 14, 15 and 16).

Fig. 10
figure 10

Comparison of PR curves and MAE on the PASCAL VOC 2007 (people) dataset. a Under the same precision recall rate, PR curves show that our method (crfwopt) achieved better performance than other state-of-the-art methods; b MAE shows that our object saliency map (crfwopt) is close to the ground truth annotations

Fig. 11
figure 11

Examples of object saliency map results from our method (crfwopt) and other methods on the PASCAL VOC 2007 (people) dataset. a source image; b GS; c MR; d SF; e w C t r o p t; f crfsc; g crfwopt

Fig. 12
figure 12

Comparison of PR curves and MAE on the PASCAL VOC 2007 (car) dataset. a Under the same precision recall rate, PR curves show that our method (crfwopt) did not perform better than other state-of-the-art methods; b MAE shows that our object saliency map (crfwopt) is close to the ground truth annotations

Fig. 13
figure 13

Examples of object saliency maps from our method (crfwopt) and other methods on the PASCAL VOC 2007 (car) dataset. a source image; b GS; c MR; d SF; e w C t r o p t; f crfsc; g crfwopt

Fig. 14
figure 14

Comparison of PR curves and MAE on the PASCAL VOC 2007 (bike) dataset. a Under the same precision recall rate, PR curves show that our method (crfwopt) achieved better performance than other state-of-the-art methods; b MAE shows that our object saliency map (crfwopt) is close to the ground truth annotations

Fig. 15
figure 15

Examples of object saliency maps from our method (crfwopt) and other methods on the PASCAL VOC 2007 (bike) dataset. a source image, b GS, c MR, d SF, e w C t r o p t, f crfsc, g our

Fig. 16
figure 16

Precision rates at equal error rates (EER) for different K n n parameters on the PASCAL VOC 2007 dataset. Codebook size K = 512

5 Conclusions

In this paper, we presented a framework for learning object saliency maps by combining amplified robust background measures through computation optimization with category-specific top-down model. The proposed work utilized background measures is useful for constraining background regions, and conducting saliency estimations. Furthermore, we learned the top-down model using a class-specific codebook and CRFs based on dense-SIFT features. Additionally, in contrast to other methods, we used LLC codes as CRF latent variables. Experimental results on the Graz-02 and PASCAL VOC2007 datasets show that our approach can generate more accurate salient object regions compared to other methods.