Combining background information and a top-down model for computing salient objects

Yang, Zhen; Yang, Fan; Xiong, Huilin

doi:10.1007/s11042-016-4005-y

Combining background information and a top-down model for computing salient objects

Published: 10 October 2016

Volume 76, pages 20815–20832, (2017)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

Combining background information and a top-down model for computing salient objects

Download PDF

Zhen Yang¹,
Fan Yang² &
Huilin Xiong¹

450 Accesses
Explore all metrics

Abstract

Predicting the salient object region in real scenes has progressed significantly in recent years. In this work, we propose a novel method for computing salient object regions by combining background information and a top-down visual saliency model, which is well-suited for locating category-specific salient objects in cluttered real scenes. First, we used a robust background measure to acquire clean saliency maps by optimizing background information. Second, we learned a top-down saliency object model by combining a class-specific codebook and conditional random fields (CRFs) during the training phase. Furthermore, our model used the locality-constrained linear codes as latent CRF variables. Finally, we computed salient object regions by combining the robust background measure and top-down model. Experimental results on the Graz-02 and PASCAL VOC2007 datasets show that our method creates much better saliency maps than current state-of-the-art methods.

Computing object-based saliency via locality-constrained linear coding and conditional random fields

Article 04 July 2016

Robust Salient Object Detection and Segmentation

Salient Object Segmentation Based on Automatic Labeling

1 Introduction

There are 2 important questions to address regarding the computation of salient objects: what is a salient object, and which object is most salient in real scenes? [4, 37]. The first question corresponds to bottom-up visual saliency models, while the second corresponds to top-down salient object models. Bottom-up and top-down models are instrumental in human cognitive processes [2, 22, 27, 33, 38, 43, 44]. The bottom-up visual model is a task-independent and unconscious visual process based on low-level features, while the top-down visual model simulates prioritizing mechanisms that determine the salience of scene regions based on high-level features.

Based on the task of salient object detection, the top-down model draws human attention to select the specific salient objects, while the bottom-up model cannot. Currently, salient object detection in the top-down model has attracted interest in visual attention [5]. The capability of salient objects has several applications in image cropping [26], video summarization [3, 25], object aware retargeting [7, 32], and object segmentation [8, 30]. However, in recent top-down models, these methods only utilize object information and not take advantage of background information. Since objects and backgrounds have different properties, contrasting appearance information between objects and background regions are high. These saliency detection methods are contrast prior approaches [1, 13, 23]. In addition to contrast prior methods, Wei et al. [36] used the boundary prior approach to compute salient regions in images. However, the contrast prior and boundary prior methods simply regard all image boundaries as background, and therefore lose some object information in saliency computation. To solve this problem, Zhu et al. [46] used boundary connectivity to detect background regions for saliency optimization, and treated saliency computation as a global optimization problem.

In addition to background measure information, many recent top-down models exploit conditional random fields (CRFs) and a class-specific dictionary [20, 42]. These approaches originated from the sparse coding method based on local features or superpixels. Coding method performance depends greatly on the discriminative codebook [31]. Yang et al. and Kocak et al. [20, 42] combined CRFs with a discriminative codebook to learn a top-down saliency model. The central idea of these approaches was to use the sparse codes as CRFs latent variables, and meanwhile, utilize CRFs to learn the discriminative codebook. However, in real scenes, cluttered background information always influences object saliency information.

Inspired by Zhu et al. [46], Yang et al. and Kocak et al. [20, 42], we propose a framwork for computing object saliency via combining saliency optimization from robust background measure information with top-down visual saliency from CRFs and discriminative codebook learning. Our approach used boundary connectivity to acquire background measure information, through which we could compute salient regions. Meanwhile, we learned a top-down object salience model by using CRFs and a class-specific codebook. In contrast to methods implemented by Yang et al. and Kocak et al. [20, 42], we used locality to replace sparsity and generated our top-down saliency model. More specifically, we treated locality-constrained linear codes as CRFs latent variables, and trained the discriminative codebook modulated by CRFs. Our approach not only reduced the influence of a cluttered background, but also enhanced object saliency in real scenes. We evaluated our approach on the Graz-02 [28] and Pascal voc2007 [10] datasets by measuring the quality of saliency maps using mean absolute error (MAE) and standard precision-recall (PR) curves. Experimental results demonstrated that our approach performed better than current state-of-the-art saliency algorithms [20, 29, 36, 39, 40, 42, 46]. For instance, our method acquired salient object regions in cluttered real scenes containing 2 or 3 different object categories (Fig. 1). We generated the object saliency map by learning 3 different top-down object models: person, car and bike. Using this object saliency map, our method can distinguished the class-specific object from complex scenes.

In the next section, we review related work on object saliency computation. Section 3 describes our work in detail, while Section 4 shows experimental results on several general datasets. Finally, we draw conclusions in Section 5.

2 Related work

The earliest visual saliency model was a bottom-up model, which was proposed by Itti et al. [14]. This model is an unconsciously visual processing method in neuroscience and computer vision, and is computed by center-surround mechanisms based on low features. Recently, object saliency detection has been viewed as a binary segmentation task [1, 31], where 1 denotes the foreground region and 0 denotes the background region. In this work, we are interested in object prior methods for salient object detection based on robust background prior information and coding methods based on discriminative codebooks.

2.1 Background measure methods

Object prior methods are mainly composed the center prior approach and the boundary prior approach. The center prior approach is viewed as a Gaussian fall-off map based on a center region contrast method [6, 15, 17, 23, 39]. In contrast, the boundary prior approach regards image boundary regions as background, which is measured by saliency optimization based on boundary patches [36, 40]. Furthermore, background regions tend to connect image boundaries, while foreground regions do not [36]. The method proposed by Yang et al. [40] used boundary patches to measure background queries for computing object saliency. Lastly, Zhu et al. [46] proposed to use boundary connectivity and saliency optimization methods to estimate contrast between background and foreground.

2.2 Top-down models

In computer vision, task-oriented top-down visual saliency models involve saliency computation and feature learning [9, 12, 18, 19, 34, 42]. Gao et al. [12] used discriminant features to estimate top-down saliency models based on a pre-defined filter bank. In the training images, these discriminant local features were extracted from each image to denote target object presence or absence. On the other hand, Kanan et al. [18] used independent component analysis to construct a top-down model by learning a support vector machine based on local features. In anther study by Kanan et al. [18], top-down saliency maps were computed by contextual priors of object location and appearance. In another direction, CRFs have been introduced for learning top-down models via local features. The CRF can incorporate various of features for object recognition and segmentation. For example, Yang et al. [42] combined CRF parameters with sparse coding for generating object salience models. In this model, the discriminative codebook was trained by the CRF, and sparse codes were viewed as CRF latent variables. In our work, we used locality linear codes as CRF latent variables, and similar to the proposed by Yang et al. [42], learned a discriminative codebook modulated by CRFs. Our approach not only improved accuracy but also significantly reduced the computational complexity.

2.3 Coding methods

Recently, coding methods are widely used for image classification based on codebook [11, 21, 35, 41]. In the vector quantization (VQ) [11] coding method, each code has only 1 nonzero element, while in the sparse coding (SC) [21, 41] approach, multiple codes are acquired by using sparsity based on codebooks. In general, the SC nonzero responding coefficients are local, Yu et al. [45] proposed local coordinate coding (LCC) to improve SC, and pointed out that, theoretically, locality is more efficient than sparsity. However, the optimization problem of LCC and SC required solving, which increases computational expensive. To solve this problem, Wang et al. [35] presented the locality-constrained linear coding (LLC) method, which is a faster implementation of LCC. In our work, we combined CRF models based on LLC codes with robust background measures based on saliency optimization for computing object salience models.

3 Our algorithm

To acquire more efficient object saliency maps, we combined saliency computation based on robust background measures with CRF models and discriminative codebook learning. Specifically, our approach employed background and object information for generating salient object regions. An overview of our system framework is summarized in Fig. 2. In the training phase, we used a set of training images and corresponding ground truth annotations to learn a class-specific object model by a discriminative codebook and CRF parameters. In the testing phase, we first extracted superpixels from the test image to compute image background information using the robust background measure; second, we generated the object saliency map using the class-specific object model; and finally, we acquired the salient object region by combining the robust background information and category-specific object model.

3.1 Saliency computation by robust background measure

Object and background regions have highly different properties in natural images, i.e., background regions are more connected to image boundaries than object regions. Therefore, the saliency computation can be estimated by boundary connectivity and background weighted contrast. The boundary connectivity is defined as follows:

$$ BndCon(R) = \frac{\{p|p\in R,p\in Bnd\}}{\sqrt{|\{p|p\in R\}|}} $$

(1)

where B n d denotes a set of image boundary patches, and p denotes an image patch, which is acquired by superpixel features. However, (1) is difficult to compute because the parameter section is problematic and undesirable discontinuous boundary regions are introduced. In practice, Zhu et al. [46] used adjacent patches (p, q) to construct an undirected weighted graph, and used the weight d _{a
p
p}(p, q) to denote the Euclidean distance between their average colors in the LAB color space. The d _{g
e
o}(p, q) denoting geodesic distance between any 2 patches is defined as follows:

$$ d_{geo}(p,q) = \min\limits_{p_{1}=p,p_{2},...,p_{n}=1}\sum\limits_{i=1}^{n-1}d_{app}(p_{i},p_{i+1}). $$

(2)

The spanning area of each patch p is defined as follows:

$$ Area(p) = \sum\limits_{i=1}^{N}exp\left( -\frac{d_{geo}^{2}(p,p_{i})}{2\sigma_{clr}^{2}}\right)=\sum\limits_{i=1}^{N}S(p,q_{i}), $$

(3)

where N denotes the number of patches, and d _{g
e
o}(p, p)=0. In practice, performance is stable when σ _{c
l
r} is within [5,15]. The length along the boundary is defined as follows:

$$ Len_{bnd}(p)=\sum\limits_{i=1}^{N}S(p,p_{i})\cdot\delta(p_{i}\in Bnd), $$

(4)

where δ(⋅) = 1 denotes patches on the image patch; otherwise δ(⋅) = 0. Finally, the boundary connectivity is computed by the following formula:

$$ BndCon(p) = \frac{Len_{bnd}(p)}{\sqrt{Area(p)}}. $$

(5)

Through the above formula, the background weighted contrast is defined as follows:

$$ wCtr(p) = \sum\limits_{i=1}^{N}d_{app}(p,p_{i})w_{spa}(p,p_{i})w_{i}^{bg}, $$

(6)

where d _{s
p
a}(p, p _i) denotes the distance between the centers of patch p and p _i, $w_{spa}(p,p_{i})=exp(-\frac {(d_{spa}^{2}(p,p_{i}))}{2\sigma _{spa}^{2}})$, and σ _{s
p
a}=0.25. $w_{i}^{bg}=1-exp(-\frac {BndCon^{2}(p_{i})}{2\sigma _{bndCon}^{2}})$ denotes a new weighting term, and σ _{b
n
d
C
o
n}=1. Finally, we can use (6) to compute salient regions. Now, object regions receive high contrast from background regions, while background regions receive low contrast from the object regions.

3.2 CRF model based on LLC codes

We extracted dense-SIFT [24] features as a local image patch, and assigned a binary label to indicate the presence or absence of a local patch corresponding to target objects. In practice, X = {x ₁, x ₂, ...,x _m} denotes a set of local patches from an image, Y = {y ₁, y ₂, ...,y _m} denotes the corresponding labels, and D = {d ₁, d ₂, ...,d _K} is a codebook learned from the training image patches. We solved the following optimization problem:

$$ w =\arg\min\limits_{w}\Vert{x-Dw}{\Vert^{2}_{2}}+\lambda\sum\limits_{j=1}^{K}\left( w(j)\cdot\exp\left( \frac{\Vert{x-d_{j}} \Vert_{2}}{\sigma}\right)\right)^{2},\\ s.t.\sum\limits_{j=1}^{K}{w(j)}=1, $$

(7)

where w(j) denotes the j-th element of w _i, i = 1,…,m, and w _i denotes the sparse code of local image patches and is viewed as a vector of latent variables in the sparse coding of the CRF model. Furthermore, sparse representation is x _i = D w _i, and λ denotes the sparse penalty parameter for controlling the regularization term. Note that we use the W(X, D)=[w(x ₁, D),w(x ₂, D),...,w(x _m, D)] to denote the latent variables for all CRF model. Solving the (7) requires computing the optimization procedures and thus we used LLC to speed up the encoding process and utilize the K nearest neighbors (K n n) of x _i as the local codeword d _i to replace the λ for controlling locality constraints [35].

Then, we constructed a 4− connected graph Γ〈ν, ε〉 on the local image patches based on their spatial adjacency patches, where ν and ε denote the nodes and edges of the graph, respectively. We utilized the labels Y and latent variables W(X, D) on the graph Γ to build a CRF model using the following formula:

$$ P((Y|W(X,D)),\alpha) = \frac{1}{Z}e^{-E(W(X,D),Y,\alpha)}, $$

(8)

where Z denotes the partition function, α is the weight vector of the CRF model, and E(W(X, D),Y, α) denotes the energy function of CRF. Through this equation, we simultaneously learned the supervised dictionary D and the CRF parameter α. The marginal probability of node i ∈ ν is as follows:

$$ p(y_{i}|w_{i},\alpha) = sum_{y\aleph(i)}p(y_{i},y\aleph(i)|w_{i},\alpha), $$

(9)

where ℵ(i) is the neighbors of node i on the graph Γ, and the saliency value of local image patch x _i is defined as follows:

$$ u(w_{i},\alpha) = p(y_{i}=1|w_{i},\alpha). $$

(10)

Thus, the saliency map U(W, α)={u ₁, u ₂, ...,u _m} can be computed, and the probabilistic definition of the top-down saliency map can reserve both appearance and local contextual information. Finally, for a test image X = {x ₁, x ₂, ...,x _m}, we computed the top-down saliency map U by the following steps:

A.
Learn the sparse latent variables W(X, D) using (7).
B.
Compute the top-down saliency map U(W, α) using (9) and (10).

In the training process, a set of training images are denoted as X = {X ⁽¹⁾,X ⁽²⁾,…,X ⁽ⁿ⁾} and the corresponding pixel-level ground truth labels are Y = {Y ⁽¹⁾,Y ⁽²⁾,…,Y ⁽ⁿ⁾}. Then, the optimal codebook D and the CRF model weight vector $\hat {\alpha }$ could be calculated by the equation:

$$ (\hat{D},\hat{\alpha})=\arg\max\limits_{D,\alpha}{\Pi}_{j=1}^{n}P(\textbf{{\textit{Y}}}^{(j)}\mid W(\textbf{{\textit{X}}}^{(j)},D),\alpha), $$

(11)

We then maximized (11) to acquire optimal parameters α and D so that for all Y≠Y ^(j),j = 1,…,n, and thus

$$ P(Y^{(j)}|W(X^{(j)},D),\alpha)\geq P(Y|W(X^{(j)},D),\alpha), $$

(12)

According to (8), we obtained the following formula:

$$ E(Y^{(j)},W^{(j)},\alpha)\leq E(Y,W^{(j)},\alpha), $$

(13)

Then, by the cutting plane [16] algorithm, the most violated label was obtained by the following equation:

$$ \hat{Y^{(j)}}=\arg\min\limits_{Y}E(Y,W^{(j)},\alpha), $$

(14)

Therefore, the objective function was minimized to learn the optimal parameter weight α and the codebook D

$$ \min\limits_{\alpha,D}\frac{\gamma}{2}\|\alpha\|^{2}+\sum\limits_{j=1}^{n}\ell^{j}(\alpha,D), $$

(15)

where $\ell ^{j}(\alpha ,D)=E(\hat {Y^{(j)}},W^{(j)},\alpha )-E(Y^{(j)},W^{(j)},\alpha )$ is loss function and γ controls the regularization of weight α. We used an iterative procedure by alternating between codebook and weight parameter updates [20, 42]. Once we acquired the CRF model weight vector $\hat {\alpha }$ and the codebook $\hat {D}$, a test image saliency map can be calculated using (10). The iterative learning algorithm is summarized in Algorithm 1.

To improve saliency map quality and reduce the impact of cluttered background information, we introduced the robust background measure and top-down method derived by the CRF model based on LLC codes. Thus, we acquire the final object saliency map as follows:

$$ crfwopt_{Map} = wCtr(p) + U(w_{i},\alpha), $$

(16)

where w C t r(p) is computed by (6), to obtain high contrast from background regions, p denotes an image patch acquired by superpixel features. U denotes the specified category saliency map as computed by (10). w _i denotes locality-constrained linear codes as CRF latent variables, α is the weight vector of the CRF model, and crfopt denotes our proposed method.

4 Experiments

We evaluated our method on 2 general datasets (Graz-02, PASCAL VOC 2007). These general datasets are more challenging than other datasets since they contain ground truth images with large intra-class object variations, severe occlusions, and cluttered backgrounds. All experiments were carried out on a Dell T7610 workstation with 32 GB memory.

In our experiments, we used standard precision-recall (PR) curves to evaluate the performance of our algorithm. The curve was computed by comparing the binary masks generated from the saliency map against the ground truth annotations. However, PR curves are limited in practice, because they consider only object and background saliency. Therefore, we also used the mean absolute error (MAE) to measure the per-pixel difference between the saliency map and the binary ground truth. Through the above two measures, we acquired more meaningful salient objects for applications such as image cropping or object segmentation.

4.1 Graz-02

The Graz-02 dataset is widely used in salient object detection for evaluating the performance of different methods. This dataset contains 3 different object categories (person, car, bicycle), and an additional background category. Each category contains 300 images with corresponding pixel-level object annotations, for a total of 1,200 images. The size of each image is 640 × 480 pixels. For the experiments, we extracted dense-SIFT descriptors from each image patch, and sampled image patches of 64 × 64 pixels with a step size of 16 pixels, total collected 999 patches for each image. We followed the standard experimental setup for fairly comparing with accuracy with that of other methods. Furthermore, we labeled an image patch as a positive sample if the object pixels occupied at least one-quarter of its total pixels; otherwise, the image patch was a negative sample. We used the 150 odd-numbered images in each object category and the additional 150 odd-numbered images from the background class as training samples, and the remaining even-numbered images as test samples. We also needed to extract all dense-SIFT descriptors from training images by utilizing the K−m e a n s clustering algorithm [21] to initialize the dictionary. We then used locality-constrained linear coding method to generate CRF latent variables. We also needed to evaluate 2 important parameters: the number of codewords K in the codebook, the locality sparse penalty parameter K n n. Usually, a large number of code words contain more object appearance variations. To reduce computational cost for obtaining the codebook, we selected a moderate 512 code words. According to the definition, presented by Wang et al. [35], K n n controls locality constraints. Specifically, the greater K n n, the more code words, and therefore there are more codewords to represent an image patch. Therefore we selected a moderate K n n = 20 for our experiments.

To illustrate our method’s performance, we compared it to several methods crfsc [42], w C t r _o p t [46], SF [29], GS [36], and MR [40] using PR curves and MAE evaluations in the Graz-02 dataset. Among these methods, SF combined low level cues in a straightforward way; GS and MR utilized boundary priors; w C t r _o p t used robust background measures and global optimization; and crfsc used a CRF model and discriminative codebook. To show the influence of the locality constraint K n n on the experiment, Fig. 9 reports precision rates at equal error rates (EER) for different K n n parameters with a codebook size of K = 512.

Figures 3a, 5a and 7a report PR curves for different methods on the Graz-02 dataset. The PR curves show that under the same precision and recall rate, our method (crfwopt) consistently performed better than other state-of-the-art methods. Figures 3b 5b, and 7b report MAE results for various methods on the Graz-02 dataset. To verify the effectiveness, the object saliency maps of all methods (GS, MR, SF, w C t r _o p t, crfwopt) are shown in Figs. 4, 5, 6, 7 and 8.

4.2 PASCAL VOC 2007

The PASCAL VOC 2007 dataset is more challenging than the Graz-02 dataset, and consist of 9,963 images from 20 different object categories and a background class. In the dataset, only 632 images have ground truth pixel-level segmentation annotations. In practice, we noticed that each object category contains few images for learning our model. To solve this problem, we used existing bounding box annotations to label the object presence or absence according to the GrabCut [30] segmentation method. In experiments, the number images used from the person, car, and bike object categories were 4,192, 1,542, and 505, respectively. Similar to the Graz-02 dataset experimental setup, we used the odd-numbered images and randomly selected other object categories as the background class for training. Moreover, we used even-numbered images test samples. To learn robust background information and train our salience object model, we needed to extract superpixels and dense-SIFT descriptors from training images. Specially, we utilized the K−m e a n s clustering algorithm to initialize the codebook for modulating the CRF model. The number of code words in the codebook and the locality constraint K n n were set as 512 and 20, respectively (Fig. 9).

Similar to those of the Graz-02 dataset, Figs.10a, 12a, and 14a report PR curves for different methods on the PASCAL VOC 2007 dataset. The PR curves show that under the same precision and recall rate, our method (crfwopt) performed better than other state-of-the-art methods except for in the car category (Fig. 12a). Figures 10b, 12b and 14b report MAE results for various methods on the PASCAL VOC 2007 dataset. To verify effectiveness, the object saliency maps of all methods (G S, M R, S F, w C t r _o p t, c r f w o p t) are shown in Figs.11, 13 and 15. Lastly, to show the influence of the locality constraint parameter K n n, Fig. 16 reports the precision rates at equal error rates (EER) for different K n n parameters with a codebook size of K = 512 (Figs. 12, 13, 14, 15 and 16).

5 Conclusions

In this paper, we presented a framework for learning object saliency maps by combining amplified robust background measures through computation optimization with category-specific top-down model. The proposed work utilized background measures is useful for constraining background regions, and conducting saliency estimations. Furthermore, we learned the top-down model using a class-specific codebook and CRFs based on dense-SIFT features. Additionally, in contrast to other methods, we used LLC codes as CRF latent variables. Experimental results on the Graz-02 and PASCAL VOC2007 datasets show that our approach can generate more accurate salient object regions compared to other methods.

References

Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 1597–1604
Aytekin C, Ozan EC, Kiranyaz S, Gabbouj M (2016) Extended quantum cuts for unsupervised salient object extraction. Multimedia Tools and Application:1–21
Bodesheim P (2011) Spectral clustering of rois for object discovery. In: Proceedings of the 33rd international conference on pattern recognition, pp 450–455
Borji A (2015) What is a salient object? A dataset and a baseline model for salient object detection. IEEE Trans Image Process 24(2):742–756
Article MathSciNet Google Scholar
Borji A, Cheng M, Jiang H, Li J (2015) Salient object detection: a benchmark. IEEE Trans Image Process 24(12):5706–5722
Article MathSciNet Google Scholar
Cheng MM, Warrell J, Lin WY, Zheng S, Vineet V, Crook N (2013) Efficient salient region detection with soft image abstraction. In: IEEE international conference on computer vision 2013. ICCV 2013, pp 1529–1536
Ding Y, Xiao J, Yu J (2011) Importance filtering for image retargeting. In: IEEE conference on computer vision and pattern recognition. CVPR 2011, pp 89–96
Donoser M, Urschler M, Hirzer M, Bischof H (2009) Saliency driven total variation segmentation. In: 2009 IEEE 12th international conference on computer vision. ICCV 2009, pp 817–824
Ehinger KA, Hidalgo-Sotelo B, Torralba A, Oliva A (2009) Modeling search for people in 900 scenes: a combined source model of eye guidance. Vis Cogn 17(6-7):945–978
Article Google Scholar
Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2005, vol 2, pp 524–531
Gao D, Han S, Vasconcelos N (2009) Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Trans Pattern Anal Mach Intell 31(6):989–1005
Article Google Scholar
Goferman S, Zelnikmanor L, Tal A (2012) Context-aware saliency detection. IEEE Trans Pattern Anal Mach Intell 34(10):1915–1926
Article Google Scholar
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Article Google Scholar
Jiang H, Wang J, Yuan Z, Wu Y, Zheng N, Li S (2014) Salient object detection: a discriminative regional feature integration approach. In: IEEE conference on computer vision and pattern recognitio. CVPR 2014, pp 2083–2090
Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural svms. mach learn. Mach Learn 77(1):27–59
Article MATH Google Scholar
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. ICCV 2009, pp 2106–2113
Kanan C, Tong MH, Zhang L, Cottrell GW (2009) Sun: top-down saliency using natural statistics. Vis Cogn 17(6-7):979–1003
Article Google Scholar
Khan N, TappenMF (2013) Discriminative dictionary learning with spatial priors. In: IEEE international conference on image processing. ICIP 2013, pp 166–170
Kocak A, Cizmeciler K, Erdem A, Erkut E (2014) Top down saliency estimation via superpixel-based discriminative dictionaries. In: Proceedings of the british machine vision conference. BMVC 2014, BMVA press
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2006, pp 2169–2178
Liang Z, Wang M, Zhou X, Lin L, Li W (2012) Salient object detection based on regions. Multimedia Tools and Applications 68(3):517–544
Article Google Scholar
Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum H (2011) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33(2):353–367
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(60):91–110
Article Google Scholar
Ma Y, Hua X, Lu L, Zhang H (2005) A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia 7(5):907–919
Article Google Scholar
Marchesotti L, Cifarelli C, Csurka G (2009) A framework for visual saliency detection with applications to image thumbnailing. In: International conference on computer vision. IEEE international conference on computer vision. ICCV 2009, pp 2232–2239
Nguyen TV, Kankanhalli M (2016) As-similar-as-possible saliency fusion. Multimedia Tools and Applications:1–19
Opelt A, Pinz A, Fussenegger M, Auer P (2006) Generic object recognition with boosting. IEEE Trans Pattern Anal Mach Intell 28(3):416–31
Article MATH Google Scholar
Perazzi F, Kr?henbhl P, Pritch Y, Hornung A (2012) Saliency filters: contrast based filtering for salient region detection. In: IEEE conference on computer vision and pattern recognition. CVPR 2012, pp 733–740
Rother C, Kolmogorov V, Blake A (2004) grabcut: interactive foreground extraction using iterated graph cuts 23:309–314
Schölkopf B, Platt J, Hofmann T (2007) Fast discriminative visual codebooks using randomized clustering forests. In: Conference on neural information processing systems. NIPS 2007, pp 985–992
Sun J, Ling H (2013) Scale and object aware image thumbnailing. Int J Comput Vis 104(2):135–153
Article MathSciNet Google Scholar
Tang C, Hou C, Wang P, Song Z (2016) Salient object detection using color spatial distribution and minimum spanning tree weight. Multimedia Tools and Applications 75(12):6963–6978
Article Google Scholar
Torralba A, Oliva A (2006) Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol Rev 113(4):766–86
Article Google Scholar
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2010, pp 3360–3367
Wei Y, Wen F, Zhu W, Sun J (2012) Geodesic saliency using background priors. In: European conference on computer vision. ECCV 2012, pp 29–42
Wu B, Xu L (2014) Integrating bottom-up and top-down visual stimulus for saliency detection in news video. Multimedia Tools and Applications 73(3):1053–1075
Article Google Scholar
Xu X, Mu N, Chen L, Zhang X (2015) Hierarchical salient object detection model using contrast-based saliency and color spatial distribution. Multimedia Tools and Applications 75(5):2667–2679
Article Google Scholar
Yan Q, Xu L, Shi J, Jia J (2013) Hierarchical saliency detection 9(4):1155–1162
Yang C, Zhang L, Lu H, Ruan X, Yang MH (2013) Saliency detection via graph-based manifold ranking. In: IEEE conference on computer vision and pattern recognition. CVPR 2013, pp 3166–3173
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition. CVPR 2009, pp 1794–1801
Yang MH, Yang J (2012) Top-down visual saliency via joint crf and dictionary learning. In: IEEE Conference on computer vision and pattern recognition. CVPR 2012, pp 2296–2303
Yang Z, Xiong H (2015) Image classification based on saliency coding with category-specific codebooks. Neurocomputing 184:188–195
Article Google Scholar
Yang Z, Xiong H (2016) Computing object-based saliency via locality-constrained linear coding and conditional random fields. Vis Comput:1–11
Yu K, Zhang T, Gong Y (2009) Nonlinear learning using local coordinate coding. In: Advances in neural information processing systems 22: conference on neural information processing systems. NIPS 2009, pp 2223–2231
Zhu W, Liang S, Wei Y, Sun J (2014) Saliency optimization from robust background detection. In: Computer vision and pattern recognition. CVPR 2014, pp 2814–2821

Download references

Acknowledgments

This work was supported by the National Natural Foundation of China under grant no. 61375008. We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

Author information

Authors and Affiliations

Department of Automation, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai, 200240, China
Zhen Yang & Huilin Xiong
School of Communication and Electronics, Jiangxi Science and Technology Normal University, 605 Fenglin Rd., Nanchang, China
Fan Yang

Authors

Zhen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huilin Xiong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Z., Yang, F. & Xiong, H. Combining background information and a top-down model for computing salient objects. Multimed Tools Appl 76, 20815–20832 (2017). https://doi.org/10.1007/s11042-016-4005-y

Download citation

Received: 18 February 2016
Revised: 26 July 2016
Accepted: 23 September 2016
Published: 10 October 2016
Issue Date: October 2017
DOI: https://doi.org/10.1007/s11042-016-4005-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Combining background information and a top-down model for computing salient objects

Abstract

Similar content being viewed by others

Computing object-based saliency via locality-constrained linear coding and conditional random fields

Robust Salient Object Detection and Segmentation

Salient Object Segmentation Based on Automatic Labeling

1 Introduction