Keywords

1 Introduction

In the computer vision field, scene parsing is one of the core problems. This paper researches the problem of scene parsing. Scene parsing aims at assigning a label to each pixel in images with the semantic category it belongs to. This work is very challenging, and it involves all kinds of problems, e.g., detecting object categories problems, segmentation problems, and multi-label recognition problems of the image, and so on. In the last decade, scene parsing had been comprehensive studied by several researchers.

For lots of problems in computer vision field including scene parsing, it is necessary to extract features for each image. Features are the expressions of images for easy calculation and classification easily. Hence extracting good features is the key of computer vision tasks. For image representation, the conventional approaches use hand-crafted features by designing carefully, e.g., HOG [3], SIFT [14], GIST [16], etc. Since the introduction by LeCun [11], deep CNN has been applied to a wide range of computer vision tasks such as hand-written digit classification and face detection. Recently, the latest generation of CNNs have substantially outperformed traditional approaches in many recognition tasks [2, 5, 10, 17, 18]. Deep learning is advantageous for large image regions with complex variations, because its deep architectures can automatically learn the hierarchies of features representation.

The issue of scene parsing is improving the recognition rate of the object classes. Some of the previous methods [4, 12, 21] had low recognition rate for small objects which occupied only a few pixels. To increase the accuracy rate of thing samples, Tighe and Lazebnik [22] proposed using per-exemplar detector which is training a linear SVM classifier for every exemplar in the training set. In addition, Yang et al. [23] expanded the retrieval set using rare class exemplars to achieve more balanced superpixel classification results and combined both global and local semantic context information to boost the performance of their system. Our approach is inspired by the work of [22], so we use exemplar-SVMs [15] which are identical to Tighe and Lazebnik [22] for improving the accuracy rate of rare classes.

In this work, we present a scene parsing approach which combines the multi-context deep convolutional features with exemplar-SVMs. This method can improve the overall accuracy and recognize the rare classes better. Our system has the following steps. We first retrieves the most similar images to query images, as shown in Fig. 1, in an annotated dataset using the global features which are obtained from the convolutional neural network of the GoogLeNet [20]. Then, we segment the images into superpixels and compute the likelihood scores of superpixels of the query images using local features, which incorporate the exemplar Support Vector Machine for classification. Finally, our system integrates multiple cues into a Markov Random Field framework to infer the parsing result.

Fig. 1.
figure 1

For a query image (a), our system retrieves the most similar images (c) (three images are shown here) from training set using the global features which are obtained from the convolutional neural network. The ground-truth annotation of (a) and (c) are shown in (b) and (d), respectively.

The rest of this paper is organized as follows. In Sect. 2, we review some relevant works for scene parsing. Section 3 presents our approach which combines the multi-context deep convolutional features with exemplar-SVMs for scene parsing. Section 4 provides the experimental results and compares the performance with other approaches, followed by a conclusion in Sect. 5.

2 Related Work

Most recently, as the development of artificial intelligence, many researchers use the deep learning approach which is one of the machine learning algorithms to solve a lot of issues in computer vision field, including object recognition, object detection, simultaneous image segmentation and labeling. The studies have shown that convolution neural network has achieved good results in these works. In recent years, a lot of methods for scene parsing had been proposed. The goal of this work is to assign each pixel of the image a semantic label. Grangier et al. [8] presented a method that did not use any graphical model for scene parsing. The deep convolutional network was used for this paper to model the complex scene label structures, depending on a supervised greedy learning approach. Farabet et al. [5] learned the different levels features applying a multi-scale convolutional neural network handling on original pixels. Their system depended on those features representation to think about context for each decision. Another innovation point of this paper was that they employed the optimal cover to obtain the most consistent region from the tree, taking the place of graph cuts and other inference approaches. Pinheiro and Collobert [17] applied the recurrent convolutional neural networks to scene labeling. The advantage of this paper was that it was no need segmentation and task-specific features compared to other approaches. It was trained in an end-to-end way over raw pixels, and the inference cost was low. In order to improve the overall performance, Bu et al. [2] combined the convolutional neural network architecture with a graphical model. The neural network can learn the hybrid features which could provide the hierarchical information and the spatial relationship information.

From what has been discussed above, the deep learning method was applied to the scene parsing work had achieved impressive results. But training network model of deep learning need a lot of data. The following will introduce some other methods that they were the traditional and did not apply deep learning. For instance, Liu et al. [12] proposed a nonparametric approach for scene parsing using a dense scene alignment to label transfer. The SIFT flow algorithm was used to match the structures within two images. This system did not require any training, but inference via the SIFT flow was very complex and computationally expensive. With the consideration of these, Tighe and Lazebnik [21] presented a scalable nonparametric image parsing with superpixels. This method also required no training and it was easy to implement. It established a new benchmark for scene parsing. The system in Zhang et al. [24] proposed an ANN bilateral matching scheme to match the test images and retrieved the training images. Eigen and Fergus [4] used per-descriptor weights to minimize the classification error and to improve performance on rare classes, the paper used a context-driven adaptation of the training set for each query image. Different from other paper, this system did not use SVMs for classification. Tighe and Lazebnik [22] incorporated region-level features with per-exemplar sliding window detectors outputs for image parsing to achieve broad coverage across hundreds of object categories. Yang et al. [23] used thing class exemplars extending the retrieval set, to obtain more balanced superpixel classifications and combined the global with local semantic context information for refining the image retrieval and superpixel matching. The experimental results compared with previous methods have improved. To boost the label likelihood estimates at superpixels George [6] integrated the likelihood scores forming different probabilistic classifiers and in the parsing process incorporated the semantic context through global label costs. The newest paper, a large and open vocabulary was presented by Zhao et al. [25] for scene parsing, which aimed at parsing images in the wild. Image pixels and word concepts connecting with semantic relations were merged to the parsing framework. To parse a 2D image and recover the 3D scene structures, Liu et al. [13] presented an attribute grammar, which contained a set of production rules, each of which described a series of spatial relation among planar surfaces in 3D scenes.

3 Approach

In this section, we propose an overview of our scene parsing system which combines the convolutional neural network with exemplar-SVMs. Our approach builds on the framework of [22]. Firstly, the GoogLeNet [20] is used to extract the global image features for computing the similar images of training set to the query image. Then we divide the image into superpixels. The features of superpixels are extracted to obtain the probability of class labels for each query image. Finally, we incorporate the superpixel parsing results with exemplar-SVMs parsing results and integrate them to the Markov Random Field framework for obtaining the final parsing. The common architecture of the system which our presented is shown in Fig. 2.

Fig. 2.
figure 2

The general framework of our system. Given a query image, we extract superpixels using the SLIC method. Then, we use the multi-context deep convolutional features including global and local deep features to describe superpixels and compute the superpixel likelihood ratio scores to obtain the superpixel parsing. We use exemplar detectors to obtain the exemplar parsing. Finally, in order to obtain the final parsing, we combine the exemplar parsing results with superpixel parsing results into the MRF framework.

3.1 Feature Extraction with CNN

The deep Convolutional Neural Network (CNN) [11], which showed its powerfulness in extracting high-level feature representations [7] recently, can solve previous some problems well. Deep CNN aims to mimic the functions of neocortex in human brain as a hierarchy of filters and nonlinear operations. The major goal of deep learning is learning the feature representations. The good network structure types should be selected to study features. In this paper, we choose the convolutional neural network of GoogLeNet [20]. The multi-context deep convolutional features which are introduced in this paper refer to the global features and the local features.

Global Features. For each query image \(I_{q}\), we extract the 1024-dimensional feature vector of GoogLeNet [20] the pool5 layer output for computing similar images which include similar classes and spatial layouts of the training dataset. Then we rank all the training images in increasing order according to the Euclidean distance between training images and the query image. It is similar to the previous works. We select the top K similar images to constitute the retrieval set. The number of images in retrieval set is less than that in training dataset. So we process images in retrieval set instead of those in training set for reducing the computational cost. The process of extracting image features using the convolutional neural network of GoogLeNet is presented in Fig. 3.

Fig. 3.
figure 3

The process of extracting image features using the convolutional neural network of GoogLeNet. For a query image, the input size of the network is \(224\times 224 \times 3\). Though convolution and pooling operation, the visualization results of some layers are shown in the figure. We extract the pool5 layer output of GoogLeNet as our deep global features which are 1024-dimensional.

Local Features. According to the computer vision applications, superpixels are becoming increasingly popular for use. To reduce the problem space and complexity, we also use superpixels as computing element, which can provide better spatial support for aggregating image features that may pertain to a single object than pixels. The image superpixels are extracted by the SLIC [1] method which is easy to implement. For each superpixel, we use the convolutional network of GoogLeNet to extract features for describing it. For each query image, we pad the image and select a superpixel-centered context window. The input of the local-context network is the superpixel-centered context window which is upsample to \(224\times 224 \times 3\). The local feature shares the same deep structure with the global feature. Other deep structures can also be used to extract the local image features. The process is shown in Fig. 4. Those local features produce a log-likelihood ratio score \(L\left( r_{i}, c \right) \) for class label c at superpixel \(r_{i}\):

$$\begin{aligned} L\left( r_{i}, c \right) = \mathrm{log} \frac{P\left( r_{i} | c \right) }{P\left( r_{i} | \overline{c} \right) } = \sum _{k} \mathrm{log} \frac{P\left( f_i^k | c \right) }{P\left( f_i^k | \overline{c} \right) } \end{aligned}$$
(1)

where \(f_i^k\) is the feature; \(P\left( f_i^k | c \right) \) (resp. \(P\left( f_i^k | \overline{c} \right) \)) is the likelihood of feature type k for region \(r_{i}\) given class c (resp. all classes but c) and here \(k = 1\).

Fig. 4.
figure 4

Process of extracting local features using GoogLeNet. For a query image, we pad regions exceeding boundaries with mean pixel value. Then a superpixel-centered large context window is selected, and warped to \(224 \times 224 \times 3\) as input. We extract the pool5 layer output of GoogLeNet as our deep local features.

3.2 Exemplar-SVMs

In order to improve the rare class recognition rate, we incorporate the per-exemplar framework of [15], which is similar to the previous work [22]. In our dataset, the per-exemplar detector is trained for each labeled class instance. As the same with the detector training procedure of [15], all objects which include “thing” classes and “stuff” classes are trained. Each exemplar-SVMs creates the detection windows in a sliding fashion, and an exemplar co-occurence-based mechanism is applied for suppressing the redundant responses.

The leave-one-out method is employed to train SVM classifier on the training set. Each feature dimension is normalized by its standard deviation and the regularization constant is computed using fivefold cross validation. The RBF kernel is applied for SVM training. At each pixel, the output results of SVMs produce Y responses. The \(D_{SVM}\left( c_{i}, r_{i}\right) \) is defined to denote the response of the SVM for class \(c_{i}\) at superpixel \(r_{i}\).

3.3 Inference

To assign labels to all of superpixels \(r_{i}\) in the query image \(I_{q}\), the Markov Random Field model is incorporated into our system. The scene parsing problem is formulated as minimization of a standard MRF energy function, which combines the unary term with the smoothness term to obtain the final parsing result. We define the MRF energy function of the superpixel class labels c as:

$$\begin{aligned} \mathbf E \left( \mathbf c \right) = \sum _{r_{i} \in I_{q}} D \left( r_{i}, \mathbf c \right) + \sum _{r_{i} \in I_{q}} \mathrm{max} \left[ 0, M-D_{SVM}\left( c_{i}, r_{i}\right) \right] + \lambda \sum _{\left( r_{i}, r_{j} \right) \in N} \psi \left( c_{r_{i}}, c_{r_{j}} \right) \end{aligned}$$
(2)

where \(D \left( r_{i}, \mathbf c \right) \) is the data term; once we obtain the likelihood \(L\left( r_{i}, c \right) \), \(D \left( r_{i}, \mathbf c \right) = - \omega _{i} log L\left( r_{i}, c \right) \); M is the highest expected value of SVM response; N is the set of adjacent pixels; The smoothing constant is \(\lambda \). The smoothness term \(\psi \) is similar with which defined in [21, 23] according to the probabilities of label \(c_{i}\) and \(c_{j}\) co-occurring in the training images:

$$\begin{aligned} \psi \left( c_{r_{i}}, c_{r_{j}} \right) = -\mathrm{log} \left[ \left( \mathbf P \left( c_{r_{i}}| c_{r_{j}} \right) + \mathbf P \left( c_{r_{j}}| c_{r_{i}} \right) \right) /2 \right] \delta \left[ c_{r_{i}} \ne c_{r_{j}} \right] \end{aligned}$$
(3)

where \(\mathbf P \left( c_{r_{i}}| c_{r_{j}} \right) \) is the conditional probability of class label \(c_{r_{i}}\) of the pixel \(r_{i}\), and its adjacent pixel \(r_{j}\) has class label \(c_{r_{j}}\). We use the \(\alpha \)-expansion algorithm [9] to optimize the MRF energy function.

4 Experiments

4.1 Dataset

The system results are reported on two datasets: the SIFT Flow dataset [12] containing 2688 images which were divided into 2488 training images, 200 test images, and 33 classes, and the per-pixel frequency counts of object categories in test images are presented in the left of Fig. 5. Another is video images dataset collected from the actual video surveillance scene, which included 500 images, among 450 training images, 50 test images, and 20 classes, and the per-pixel frequency counts of object categories are presented in the right of Fig. 5. As can be seen from the graph, some classes do not exist in the test set.

Fig. 5.
figure 5

The per-pixel frequency counts of the object categories on the SIFT Flow dataset (left) and video images dataset of we gathering in the actual video surveillance scene (right).

4.2 The Experimental Results

There are two measurement function for evaluating our scene parsing system: the per-pixel classification accuracy rate (the percentage of pixels that were correctly labeled of test images) and per-class classification rate (the average of all classes recognition rates). For SIFT Flow dataset, the size of the retrieval set is set as 200, namely \(K=200\). For the video images which are collected by ourselves, the size of the retrieval set is set as 50. The smoothing constant \(\lambda \) is 1. The highest expected value of SVM response M is approximately 10. The per-class recognition rates of SIFT Flow dataset and our collected video images dataset are presented in the left and right of Fig. 6, respectively. On the SIFT Flow dataset, the top five categories of recognition rate are sun, sky, building, sea and mountain.

Fig. 6.
figure 6

Left: the per-class recognition rate on the SIFT Flow dataset. Right: the per-class recognition rate on the video images dataset of we gathering in the actual video surveillance scene.

Fig. 7.
figure 7

The visual results of our approach on the SIFT Flow dataset. Column (a) is the test images; the ground truth of (a) is shown in (b); (c) is the superpixel parsing results; the exemplar parsing results are displayed in (d) and (e) is the final parsing result, which combines (c) with (d).

Table 1. A comparison of our system to the state-of-the-art methods about per-pixel and per-class classification accuracy rate \((\%)\) on the SIFT Flow dataset.
Fig. 8.
figure 8

The visual results of our approach on the video images. Column (a) is the test images; the ground truth of (a) is shown in (b); (c) is the superpixel parsing results; the exemplar parsing results are displayed in (d) and (e) is the final parsing result, which combines (c) with (d).

Table 1 shows that the results of our system in contrast to the state-of-the art methods about per-pixel and per-class recognition rate on the SIFT Flow dataset. Although the recognition rate of presenting method is not the highest, the per-pixel accuracy increases by around \(4\%\) and the per-class accuracy increase about \(17\%\) over the method of [21]. The per-pixel accuracy of our approach is less than George [6] about \(6\%\) and the per-class recognition rate is less approximately \(3\%\).

The part of visual samples results of our method on the SIFT Flow dataset and the video images are shown on Figs. 7 and 8, respectively. In Fig. 7, (a) expresses the test images; The ground truth of (a) is shown in (b); (c) is superpixel parsing results and the recognition rates are displayed under the images. The exemplar parsing results are shown in (d). The final parsing results are displayed in (e). The Fig. 8 is similar to Fig. 7, but does not have the accuracy of images. As can be seen from Fig. 8, the parsing results of video images are impressive.

The computing environment of our system is MATLAB 2012b and on a four-core 3.4 GHz CPU with Intel Core i7-6700 and 8G of RAM. Training time and parsing time are influenced by feature dimension. The higher the dimension and amount of features are, the longer the training time costs.

5 Conclusions

This paper has proposed a method for scene parsing to improve the quality of retrieval sets and the class recognition rate of combining the deep convolutional features with exemplar-SVMs. Our approach takes advantage of a retrieval set of similar objects and scenes for reducing computational complexity. The deep features we extracted can express the image well. Furthermore, multiple cues are integrated into a MRF framework to segment and parse the test image. On the SIFT Flow dataset, the experimental results show that the method achieve promising results. This system is used to actual scene images and obtained the ideal parsing results.