1 Introduction

Salient Object Detection (SOD) refers to the perceptual selective process that highlights the most distinct regions on the scenes by the human vision system. In practice, object-level saliency detection can be broadly applied as a pre-processing technique for various computer vision tasks, such as image and video segmentation [27], video compression [6], image cropping [28], video summarizing [23], image fusion [7], etc.

Due to the prevalence of the convolutional neural networks (CNNs), the performance of salient object detection has been largely improved [5, 9, 13, 16,17,18,19, 24, 32, 33]. Further, the integration of fully convolutional neural networks (FCNs) facilitates salient object detection tasks to an end-to-end phase [12, 18, 30, 34, 42]. However, current CNNs based SOD approaches still face significant challenges due to its network structures. Firstly, as salient object detection is a pixel-level labelling task, the outputs from the convolutional layers with large receptive fields can be rather rough after being restructured back to pixel-level labelled maps [21]. Hence, the output saliency maps from CNNs may result in blob-like defects. Secondly, salient object detection emphasizes on recognizing the salient regions in object-level. As the CNNs lacks smoothness constraints for label agreement, the output saliency maps may consist poor object delineation and spurious regions [25]. In summary, the output maps from CNNs are relatively coarse and further refinement is needed to improve object boundary division and saliency density smoothness.

One prevalent approach adopted by recent state-of-the-art saliency models is to introduce a Conditional Random Field (CRF) layer, i.e. Dense-CRF [11], fully connected to the FCNs [9, 14, 17] for coarse saliency map refinement. The fully connected CRF layer does not participate in finetuning the front-end FCNs. Instead, it acts as a post-processing layer to reconcile the spatial and appearance coherence of the coarse saliency maps through cross validations. On one hand, the CRF layer efficiently enhances the accuracy of the saliency maps in practice; on the other hand, it makes the training stage of the front-end deep neural networks compact and efficient.

However, existing saliency models only connect one CRF layer to the end of the pre-trained deep neural networks for refinement. In this paper, we extend the CRF layer as a more flexible integration to any of the side output layers of the FCNs to enhance the quality of the intermediate outputs, and thus to further improve the performance of the whole networks. Therefore, we propose the multi-scale CRFs model (MCRF) based on multi-scale side outputs from FCNs for salient object detection. Specifically, a fully convolutional neural network based on the encoder-decoder architecture with three scales of side output maps is trained with pixel-wise labels. Then, a CRF layer is connected to each side output layer to refine the delineation and smoothness of the side output maps. Finally, the refined side output maps are fused and then refined by another CRF layer for the final saliency map. The contributions of the paper are two folds:

  • The proposed MCRF model integrates multiple CRF layers to refine the multi-scale side output maps from FCNs, and thus to complement the defects of each side outputs for a unified and refined saliency map. The multi-scale CRFs refinement structure largely improves the refinement effectiveness than integrating one CRF layer at the end of the network.

  • The multi-scale CRFs refinement structure results in highly competitive performance based on the simple encoder-decoder networks with only three scales of side outputs. Hence, the multi-scale CRFs structure is able to avoid the over-fitting issues due to complex deep network architectures with limited training samples.

The rest of the paper is organized as follows. Section 2 summarizes the related works. Section 3 introduces the framework of the proposed multi-scale CRFs saliency model. Section 4 presents the implementation details and the experimental results and Sect. 5 concludes the work.

Fig. 1.
figure 1

Framework of the proposed multi-scale CRFs model. Three scales of side output maps are selected from the encoder-decoder networks. The encoder network is based on the VGG-16 net [31]. Then, the decoder network is connected to the “pool5” layer, which gradually unpools the features from the corresponding pooling layers. The decoder convolutional layers are all followed by a BN layer and a ReLU layer. To upsample the three scales of side outputs from “Deconv1”, “Deconv2” and “Deconv3”, a convolutional layer with \(1\times 1\) kernel size is used to compute the one channel feature map and a deconvolutional layer followed with a crop layer is connected to upsample the feature maps to the image size respectively. To finetune the front-end encoder-decoder networks, each side output map is connected with a side loss (\(L_1\), \(L_2\), \(L_3\)) for optimization. Then, one CRF layer is connected to each side output map for multi-scale refinement and the refined side output maps are fused by element-wise production. Finally, another CRF layer is connected to refine the fused map for the final saliency map. The CRF layers are tuned by cross validations and all the CRF layers share the same parameter settings.

2 Related Works

This section presents a brief review of representative network architectures of FCNs and previous deep saliency models that adopt fully connected conditional random field (CRF) for saliency refinement.

2.1 Saliency Detection via FCNs

Previous works [41] suggest that the convolutional layers of the CNNs can describe the high-level semantic features at different scales and maintain their spatial information. To put the merits of convolutional layers into full use, Long et al. [21] propose the fully convolutional networks (FCNs) for semantic segmentation. The FCNs addresses the advantages of the convolutional layers to get rid of the large parameter costs from the fully connected layers. Moreover, as the convolutional layers keep the spatial information on the output maps, the side outputs from different levels are able to produce multi-scale feature maps for recognition tasks [2, 22].

A variety of network architectures are proposed to compute multi-scale feature maps of FCNs. The encoder-decoder network [40] proposes a simple yet efficient fully convolution and unpooling structure for object contour detection. The seminal FCNs [21] constructs skip connections to generate end-to-end prediction at multiple scales. Similarly, the Hypercolumns FCNs [8] and the U-Net [29] apply multiple skip-connections through concatenations to capture the features from multiple scales for precise localization. The holistically-nested edge detector (HED) model [35] employs skip-layer connections to construct even deeper supervised structures, and fuses side outputs at various scales to resolve the ambiguity in edge and object boundary detection. Further, the DSS model [9] introduces short connections to the skip layers to construct an enhanced HED structure that combines both deeper and shallower side outputs for multi-scale contexts. Apparently, deeper network architectures are able to learn richer semantic features for more accurate predictions. However, complex network architectures may lead to the time consuming training process and may face with over-fitting problems. Thus, constructing a compact yet efficient FCNs structure for targeted tasks is crucial in balancing the accuracy and efficiency.

2.2 CRFs for Saliency Refinement

Prior to the pervasive applications of the CNNs, most of the best performed traditional saliency methods firstly compute a coarse saliency map and then refine it by handcraft features from the input image. Such refinement is based on some common context-aware assumptions and theories from graphical models. As the conditional random fields (CRFs) is a flexible framework for incorporating various features and is capable to accommodate inference functions for graphical models, it has been frequently adopted for labeling refinement tasks. For instance, Qiu et al. [26] take advantages of handcraft image features and spatially weighted distance to infer a CRF model to refine coarse saliency maps.

Deeplab [3] firstly implements the dense CRFs framework to deep neural networks to refine the semantic segmentation results based on unary and pairwise potentials proposed by [11]. The proposed dense CRFs is fully connected to CNNs as a post-processing step for end-to-end refinement. Later, several works [36, 37, 44] unroll the CRF inference by [11] to an end-to-end trainable feed-forward networks.

For efficient computation, existing saliency models tend to integrate the fully connected CRFs on top of the deep neural networks for end-to-end post processing. MDF saliency model [15] involves the CRFs from Deeplab [3] to integrate multiple output saliency maps from CNNs with inputs of different contexts. Later, the DCL model [16] incorporates the dense CRFs [3] to improve spatial coherence and contour localization for the fused result from two streams of CNNs. The MSRNet [14] model and DSS [9] model both integrates the fully connected dense CRFs [11] to refine the fused output maps from the CNNs. In this work, a more flexible and efficient incorporation of dense CRFs will be explored on top of the pre-trained CNNs.

3 Multi-scale CRFs Model

The Multi-scale CRFs Model is based on a simple yet effective encoder-decoder architecture. Firstly, multi-scales of side output feature maps are computed from the pre-trained encoder-decoder networks. Then, each side output maps are refined by a fully connected CRF layer to enhance the delineation and smoothness. Finally, the enhance feature maps are fused and refined by another CRF layer for the final saliency map.

3.1 The Encoder-Decoder Networks

Given the input image \({I}=\{I_i, i=1,\cdots ,|I|\}\) with three-dimensional size of \(H\times W\times 3\), and the ground truth \({G}=\{G_i, i=1,\cdots ,|{G}|\}\), \(G_i\in \{0,1\}\) with the size of \(H\times W\times 1\), the encoder-decoder networks \(\mathcal {F}\) is adopted to produce \(M=\{m=1,\cdots , M\}\) scales of side output feature maps, denoted as \(s_m\) respectively as follow:

$$\begin{aligned} s_m = \mathcal {F}(W,w_m), \end{aligned}$$
(1)

where W denotes the generic weights of the encoder-decoder networks and \(w_m\) denotes the scale specific weights. In the training phase, the cross-entropy loss is utilized as the side objective function \(L_m(W,w_m)\) to train the network weights:

$$\begin{aligned} \displaystyle \begin{array}{l} L_m(W,w_m) \\ = -\sum \limits _{\mathcal {S}_m^j\in {\mathcal {S}_m}} [\mathcal {S}_m^j \log P(\mathcal {S}_m^j=1|\mathcal {I};W,w_m)+(1-\mathcal {S}_m^j) \log P(\mathcal {S}_m^j=0|\mathcal {I};W,w_m)] \end{array} \end{aligned}$$
(2)

where \(\mathcal {I}=\{\mathcal {I}^j, j=1,\cdots ,|\mathcal {I}|\}\) denotes all the pixels in the training image set and \(\mathcal {S}_m=\{\mathcal {S}_m^j, j=1,\cdots ,|\mathcal {S}_m|\}\) denotes all the saliency values from the side output layer at the m-th scale of the encoder-decoder networks. \(P(\mathcal {S}_m^j=1|\mathcal {I};W,w_m)\) represents the probability of the activation value at location j at the m-th scale side output map.

3.2 Multi-scale CRFs Refinement

Through the encoder-decoder networks, M scales of side output maps are computed to primarily locate the salient objects. In order to further improve the prediction accuracy, a fully connected CRF [11] layer is integrated to each side output layer for refinement as follow:

$$\begin{aligned} \hat{s}_m = \mathcal {C}_m(s_m, {I}, \varTheta _m), \end{aligned}$$
(3)

where \(\mathcal {C}_m(\cdot )\) refers to the CRF layer at the m-th scale, \(\varTheta _m\) refers to all the parameters for the m-th CRF layer, and \(\hat{s}_m\) represents the refined side output map at the m-th scale.

To each side output map \(\hat{s}_m\), the energy function of the CRF is

$$\begin{aligned} E(G) = \sum \limits _{i}\phi _u(s_m^i)+\sum \limits _{i<k}\phi _p(s_m^i,s_m^k). \end{aligned}$$
(4)

\(\phi _u(s_m^i)\) refers to the unary term, where the side output maps are directly regarded as the input. \(\phi _p(s_m^i,s_m^k)\) is the pairwise term, which accounts for the coherence of the saliency information and image features between the current pixel and its neighbors. Thus, the pairwise term is defined as:

$$\begin{aligned} \phi _p(s_m^i,s_m^k)=\mu (s_m^i,s_m^k)[\nu _1\exp (-\frac{\Vert p_i-p_k\Vert ^2}{2\sigma _\alpha ^2}-\frac{\Vert I_i-I_k\Vert ^2}{2\sigma _\beta ^2})+ \nu _2\exp (-\frac{\Vert p_i-p_k\Vert ^2}{2\sigma _\gamma ^2})], \end{aligned}$$
(5)

where \(\mu (s_m^i,s_m^k)=1\) if \(s_m^i=s_m^k\) and otherwise 0. \(I_i\) represents the RGB image features of the i-th pixel, while \(p_i\) is the pixel position. The Gaussian kernel \(\exp (-\frac{\Vert p_i-p_k\Vert ^2}{2\sigma _\alpha ^2}-\frac{\Vert I_i-I_k\Vert ^2}{2\sigma _\beta ^2})\) measures the appearance coherence which refines the nearby pixels with similar features with similar saliency scores, while the Gaussian kernel \(\exp (-\frac{\Vert p_i-p_k\Vert ^2}{2\sigma _\gamma ^2})\) measures the spatial coherence which reconciles close pixels with similar saliency scores. Parameters \(\nu _1\) and \(\nu _2\) control the contributions of each Gaussian kernel respectively.

The energy minimization is based on the mean field approximation to the CRF distribution proposed by [11], and high-dimensional filtering can be utilized to speed up the computation.

Then, the refined saliency maps from each scale of the CRF layer are fused by element-wise production:

$$\begin{aligned} \tilde{s} = \prod \limits _{m=1}^{M}\hat{s}_m. \end{aligned}$$
(6)

Finally, another CRF layer is connected to further refine the fused map as the final saliency map:

$$\begin{aligned} \bar{s}_{final} = \mathcal {C}_{final}(\tilde{s}, {I}, \varTheta _{fuse}) \end{aligned}$$
(7)

4 Experiment

4.1 Implementation

In this work, the fully convolutional encoder-decoder networks is adopted to obtain the multi-scale side output maps. The network architecture is demonstrated in Fig. 1 with detailed layer descriptions. The encoder network is based on the VGG-16 net [31]. The decoder network firstly unpools the features from the corresponding maxpooling layers and properly upsample and crop the side output maps to the image size. All the decoder convolutional layers are followed by batch normalization and ReLU activation functions. We also add a dropout layer after each ReLU layer in the decoder networks.

The hyper-parameters for the finetuning the encoder-decoder networks are set as: a fixed learning rate (1e−8), weight decay (0.0005), momentum (0.9), loss weight for each side output (1). The batch size is set as 12, and 100 epochs are performed for tuning the encoder decoder network. The sigmoid cross entropy loss layers are used for model optimization.

The fully connected dense CRF layers share the same parameter settings and are tuned via cross validations on the validation set, and \(\nu _1\), \(\nu _2\), \(\sigma _\alpha \), \(\sigma _\beta \), and \(\sigma _\gamma \) are set to 3.0, 3.0, 60.0, 5.0, and 3.0, respectively. Only 3 iterations of the meanfield approximation are set to each CRF layer.

All the implementation is based on the public Caffe library [10]. The CRF is based on the PyDenseCRF implementation [11]. The GPU for training acceleration is the Nvidia Tesla P100 with 16 GB memory. Totally 100 epochs are performed to train the encoder-decoder networks, which takes about 16 h. In the testing phase, it takes averagely 1.68 s to compute the final saliency maps.

Table 1. Evaluation results over four datasets, with models including MDF [15], RFCN [34], DHS [18], Amulet [42], UCF [43], DCL [16], MSR [14], DSS [9], RA [4] and the proposed MCRF model. “\(+\)” marks the models utilizing dense CRF [11] for post-processing. “−” means that the corresponding dataset is used as the training data. The evaluation on MSRA-B is performed on the testing set. The best performances are in bold while the second best results are underlined.
Table 2. Comparisons of Mean F-measure by implementing multi-scale CRFs versus implementing single-scale CRF respectively. “\(s_1\), \(s_2\), \(s_3\)” refer to the three scales of side output maps from the encoder-decoder networks respectively. “\(s_{123}\)+\(\texttt {CRF}^1\)” fuses the maps “\(s_1\), \(s_2\), \(s_3\)” by elementwise production and then connect a single CRF layer with 3 meanfield iterations to compute the saliency maps. “\(s_{123}\)+\(\texttt {CRF}^2\)” also fuses the side output maps and connect a single CRF layer with 10 meanfield iterations. Note that the parameter settings of CRF layer for “\(s_{123}\)+\(\texttt {CRF}^2\)” are the same as DSS model. The evaluations are performed on ECSSD dataset.

4.2 Datasets

We follow the training protocol as in [9, 16] by using the MSRA-B dataset [20] as the training data for fair comparisons. The MSRA-B dataset consists of 2,500 training images, 500 validation images and 2000 testing images. The images are resized to \(240 \times 320\) as the input to the data layer. Horizontal flipping is used for data augmentation such that the number of training samples is twice as large as the original number.

The proposed model is evaluated over four datasets, including: MSRA-B [20], ECSSD [38], DUT-OMRON [39], and HKU-IS [17]. MSRA-B is the training dataset. ECSSD contains a pool of 1000 images with even more complex salient objects on the scenes. DUT-OMRON dataset contains a large number of 5168 more difficult and challenging images. HUK-IS consists of 4447 challenging images and pixel-wise saliency annotation.

4.3 Evaluation Metrics

We employ two types of evaluation metrics to evaluate the performance of the saliency maps: mean F-measure and mean absolute error (MAE). When a given saliency map is slidingly thresholded from 0 to 255, a precision-recall (PR) curve can be computed based on the ground truth. F-measure is computed to count for the saliency maps with both high precision and recall:

$$\begin{aligned} F = \frac{\left( 1+{\beta }^2\right) \cdot precision\cdot recall}{\beta ^2 \cdot precision + recall}, \end{aligned}$$
(8)

where \(\beta ^2 = 0.3\) [1] to emphasize the precision. In this paper, the mean F-measure is chosen for evaluation and the saliency maps are thresholded by twice of the mean saliency values.

MAE measures the overall pixel-wise difference between the saliency map sal and the ground truth gt as follow:

$$\begin{aligned} MAE = \frac{1}{H}\sum _{h=1}^{H} {\left| sal(h)-gt(h)\right| }, \end{aligned}$$
(9)

where H is the number of pixels on the map.

4.4 Experimental Results

We compare the proposed MCRF model with nine state-of-the-art deep saliency models including MDF [15], RFCN [34], DHS [18], Amulet [42], UCF [43], DCL [16], MSR [14], DSS [9], and RA [4]. All the models are CNN-based approaches. All the implementations are based on public codes and suggested settings by the corresponding authors (Fig. 2).

Fig. 2.
figure 2

Comparisons of the mean precision and mean recall on MSRA-B testing set, DUT-OMRON, HKU-IS and ECSSD datasets respectively.

Fig. 3.
figure 3

Examples of saliency maps from DHS [18], Amulet [42], UCF [43], DCL [16], MSR [14], DSS [9], RA [4] and the proposed MCRF model.

Table 1 lists the mean F-measure and MAE of the nine saliency models and the proposed MCRF model over four datasets. It is clearly observed that the MCRF model surpasses most of the existing saliency models with much better performances. Compared to MDF [15], DCL [16], MSR [14] that apply single CRF [11] layers, the multi-scale CRF model results in superior performances. Moreover, the proposed MCRF model receives comparable performances with the DSS [9] model. Compared to the DSS [9] that uses the enhanced HED architecture with five scales of side outputs (totally 53 convolutional and deconvolutional layers), the proposed MCRF model is based on the simple encoder-decoder architecture and only three scales of side outputs (totally 31 convolutional and deconvolutional layers) are fused for multi-scale integration. Thus, the multi-scale CRF structure is proved to be efficient. We also evaluate the performances of embedding multi-scale CRFs versus single-scale CRFs to the pre-finetuned model as in Table 2. Clearly, multi-scale CRFs model receives the best performances. Figure 3 presents saliency maps from the compared models and the proposed MCRF model.

5 Conclusion

This paper proposes to refine the side outputs efficiently from multiple scales of FCNs by embedding multi-scale CRF layers. Firstly, the front-end FCNs is based on the simple yet efficient encoder-decoder networks which involves much fewer convolutional layers and parameters such that the front-end network is easy to train. Secondly, only three scales of side outputs from the FCNs are integrated but competitive performances are received. In future, the side output refinement based on CRF inference with upper level side output from the FCNs will be further explored for a hierarchical refinement architecture.