Keywords

1 Introduction

Weakly Supervised Learning (WSL) has been successfully applied on many tasks, such as object localization [5, 6, 11, 13, 26, 35, 44], relation detection [40] and semantic segmentation [32,33,34, 36, 37]. WSL attracts extensive attention from researchers and practitioners because it is less dependent on massive pixel-level annotations. In this paper, we focus on Weakly Supervised Object Localization (WSOL) problem.

Existing WSOL methods locate target object regions using convolutional classification networks. Classification networks recognize various kinds of objects by identifying discriminative regions of an objects. Fully convolutional networks [17] without using fully connected layers can preserve the relative positions of pixels. Therefore, the discovered discriminative regions can indicate the exact location of the target objects. Zhou et al. revisited classification networks (e.g. AlexNet [12], VGG [25] and GoogleNet [27, 28]) and proposed a Class Activation Maps (CAM) approach to find the regions of interest using only image-level supervision. Following [14], CAM replaced the top fully connected layers by convolutional layers to keep the object positions and can discover the spatial distribution of discriminative regions for different classes. The key weakness of the localization maps generated by CAM is that only the most discriminative regions are highlighted, as a result we can only locate a small part of target objects. To cope with the weakness, Wei et al. [32] proposed to apply additional networks for enriching object-related regions, given images of which the most discriminative regions are erased according to the attention maps from a pre-trained network. Moreover, Zhang et al. [43] proved the CAM method can be simplified to enable end-to-end training. Armed with this proof, an Adversarial Complementary Learning approach was proposed in [43] by incorporating one additional classifier for mining complementary object regions, which can finally produce accurate object localization maps. However, all these methods ignore to explore the correlations among pixels.

Fig. 1.
figure 1

Learning process of Self-produced guidance. Given an input image, we first generate corresponding attention map with a classification network. Then the attention map is roughly split, following the rule that the region with high confidence should be the object, whereas that with low confidence should be background. The regions with medium confidence remain undefined. All these three regions constitute the seed. Self-produced guidance is defined as the multi-stage pixel-level object mask supervised by the seed.

We observe that images can be roughly divided into foreground and background regions. The foreground pixels usually constitute the object(s) of interests. We found that attention maps inferred from classification networks [32, 43, 45] can effectively provide the probabilities of each pixel to be foreground or background. Although pixels of high foreground/background probabilities may not cover the entire target object/background, they still provide the important cues for getting some common patterns of target objects. Based on this, we can simply leverage those reliable foreground/background seeds as supervision to encourage the network to sense the distributions of foreground objects and background regions. Since pixels with correlations (e.g. within a same object or background) often share similar appearance, more reliable foreground/background pixels can be easily discovered by learning from the discovered seeds. With more reliable guided pixels for supervision, the entire foreground objects can be gradually distinguished from background, which will finally benefit the weakly object localization.

Inspired by the above motivation, in this paper, we propose a Self-produced Guidance (SPG) approach for learning better attention maps and getting precise positions of objects. We leverage attention maps to produce the guidance masks of foreground and background regions in a stagewise manner. The foreground/background seeds of each stage can be generated following a simple rule: (1) the regions with highly confident scores are considered as foreground; (2) the regions with very low scores are background seeds; (3) the regions with medium confidence remain undefined. The undefined regions are meant to be figured out using intermediate features. We adopt a top-down mechanism of using upper layer’s output as the supervision of the lower layers to learn better object localizations. The upper layers maintain more abstract semantic information, whereas the lower layers have more specific pixel-related information. We leave the ambiguous area undefined before more regions can be defined as foreground/background using upper layer features. The more regions be defined, the stronger ability to define harder regions. After getting the guidance masks of foreground and background, we use them as auxiliary supervisions. These supervisions are expected to enable the classification network to learn pixel correlations. Consequently, attention maps can clearly indicate class-specific object regions. Figure 1 illustrates the learning process of self-produced guidance. Given an input image, we firstly generate corresponding attention maps through a classification network according to the convenient method in [43]. Then the attention map is roughly split into foreground/background seeds and ignored regions. The self-produced guidance are learned from these seeds with the input of intermediate features in a stagewise manner. Finally, the SPG masks of multiple layers are fused for a more precise and integrate indication of target objects.

To sum up, our main contributions are:

  • We propose a stagewise approach to learn high-quality Self-produced Guidance masks which exhibit the foreground and background of a given image.

  • We present a weakly object localization method by incorporating self-produced supervision, which can inspire the classification network discover pixel correlations to improve the localization performance.

  • The proposed method achieves the new state-of-the-art with the error rate of Top-1 43.83% on ILSVRC dataset with only image-level supervision.

We discuss the proposed SPG approach in detail in Sect. 3. In Sect. 4, we empirically evaluate the proposed method on the ILSVRC2016 dataset, showing that the superiority of SPG in object localization task with only image-level supervision. We also discuss the further insights of the proposed SPG algorithms through additional experiments.

2 Related Work

Convolutional neural network has been widely used in object detection and localization tasks [3, 8, 10, 18, 25, 42]. One of the earliest deep networks to detect objects in a one-stage manner is OverFeat [23], which employs a multiscale and sliding window approach to predict object boundaries. These boundaries are then applied for accumulating bounding boxes. SSD [16] and YOLO [20] used a similar one-stage method, and these detectors are specifically designed for speeding up the detection process. Faster-RCNN designed by Ren et al. [21] has achieved great success in the object detection task. It generates region proposals and predicts highly reliable object locations in an unified network in real time. Lin et al. [15] presented that the performance of Faster-RCNN can be significantly improved by constructing feature pyramids with marginal extra cost.

Although these approaches are considerably successful in detecting object of interest in images, the vast number of annotations are unaffordable for training such networks with limited budget. Weakly supervised methods alleviate this problem by using much cheaper annotations like image-level labels. Jie et al. [11] proposed a self-taught learning framework by firstly selecting some high-response proposals, and then finetuning the network on the selected regions to progressively improve its detection capacity. This method highly rely on region proposals pre-processed by algorithms like Selective Search [30]. The general-purpose proposal algorithms may not robust to produce accurate bounding boxes. Dong et al. [5] adopted two separate networks to jointly refine the region proposals and select positive regions. High-quality attention maps are also critical for object detection and segmentation [19]. Diba et al. [4] proposed the attention maps can be leveraged to produce region proposals. With the assistance of these proposals, more detailed information can be easily detected.

However, these methods introduces extra computational as a result of using pre-processed region proposals and multiple networks. Zhou et al. [44] discovered that the localization maps for each class can be produced by aggregating top-level feature maps using a class-specific fully connected layer. Zhang et al. [41] introduced a different backpropagation scheme to produce contrastive response maps by passing along top-down signals downwards. However, this method supervised by solely using image labels tends to only discover a small part of the target objects. Wei et al. [32] applied a similar but more efficient approach to hide discriminative regions under the guidance of a pre-trained network, and then the processed images are trained for discovering more regions of interest. These methods increases the amount of images, thus they need much more precious computational and time resources to train the networks. Zhang et al. [43] provided theoretical proof of producing class-specific attention maps during the forward pass by just selecting from the last layer feature maps, which enables the end-to-end attention learning. Also, they proposed the ACoL approach [43] to efficiently mine the integral target object in an enhanced classification network.

Fig. 2.
figure 2

Overview of the proposed SPG approach. The input images are processed by Stem to extract mid-level feature maps, which are then fed into SPG-A for classification. Attention map is then inferred from the classification network. Self-produced guidance maps are gradually learned with the guide of the attention map. SPG-C utilizes the self-produced guidance map as an auxiliary supervision to reinforce the quality of the attention map. GAP refers to global average pooling

3 Self-produced Guidance

3.1 Network Overview

We denote the image set as \(I=\{(I_i, y_i)\}^{N-1}_{i=0}\), where \(y_i = \{0,1,...,C-1\}\) is the label of the image \(I_i\), N is the number of images and C is the number of image classes. Figure 2 illustrates the architecture of the SPG approach, which mainly has four different components, including Stem, SPG-A, SPG-B and SPG-C. Different components have different structures and functionalities. We use lowercase f to denote functions and capital F to denote output feature maps. Stem is a fully convolutional network denoted as \(f^{Stem}(I_i,\theta ^{Stem})\), where \(\theta ^{Stem}\) is the parameters. The output feature maps of \(f^{Stem}\) is denoted as \(F^{Stem}\). \(f^{Stem}\) acts as a feature extractor, which takes the RGB images as input and produces high-level position-aware feature maps of multiple channels. The extracted feature maps \(F^{Stem}\) are then fed into the following component SPG-A. We denote the SPG-A component as \(f^{A}(F^{Stem}, \theta ^{A})\), which is a network for image-level classification. \(f^{A}(F^{Stem}, \theta ^{A})\) is consisted of four convolutional blocks (i.e. A1, A2, A3 and A4), a global average pooling (GAP) layer [14] and a softmax layer. A4 has one convolutional layer with kernel size \(1 \times 1\) of C filters. These filters are corresponding to the attention maps of each class, so as to generate attention maps during the forward pass [43]. SPG-B is leveraged to learn Self-produced guidance masks by using the seeds of foreground and background generated from attention maps. The high confident regions within attention maps are extracted to perform as supervision to learn better object regions. SPG-B leverages the intermediate feature maps from the classification network SPG-A to predict Self-produced Guidance masks. Particularly, the output features maps \(F^{A1}\) and \(F^{A2}\) of A1 and A2 are fed into the two blocks in SPG-B, respectively. Each block of SPG-B contains three convolutional layers followed by a sigmoid layer, where the first layer is to adapt the different number of channels in feature maps \(F^{A1}\) and \(F^{A2}\). The output of SPG-B are denoted as \(F^{B1}\) and \(F^{B2}\) for the two branches, respectively. The component SPG-C uses the auxiliary SPG supervision to encourage the SPG-A to learn pixel-level correlations. SPG-C contains two convolutional layers with \(3 \times 3\) and \(1 \times 1\) kernels, followed by a sigmoid layer.

3.2 Self-produced Guidance Learning

Attention maps generated from classification networks can only exhibit the most discriminative parts of target objects. We propose to generate Self-produced Guidance (SPG) masks which separate the foreground, i.e. the object of interest, from the background to provide the classification networks with spatial correlation information of pixels. The generated SPG masks are then leveraged as auxiliary supervision to encourage the networks to learn correlations between pixels. Thus, pixels within the same object will have the same responses in feature maps. As the detailed information (i.e. object edge and boundary) is usually very abstract in the top-level feature maps, we employ the intermediate features to produce precise SPG masks. Indeed, some previous works use low-level feature maps to learn object regions [9, 38]. These approaches require pixel-level ground-truth labels as supervision. Differently, we propose to use self-produced guidance by incorporating high confident object regions within attention maps. In detail, for any image \(I_i\), we firstly extract its attention map O by simply from a classification network. We observe that the attention maps usually highlight the most discriminative regions of object. The initial object and background seeds can be easily obtained according to the scores in the attention maps. In particular, the regions with very low scores are considered as background, while the regions with very high scores are foreground. The rest regions are ignored during the learning process. We initialize the SPG learning process by these seeds. B2 is supervised by the seed map and it can learn the patterns of foreground and background. In this way, the pixels within the ignored regions are gradually recognized. Then, we use the same strategy to find the foreground and background seeds in the output map of B2, which are used to train the B1 branch. In such a stagewise way, the intermediate information of the neural network are employed to learn the Self-produced Guidance.

We formally define this process as follows. Given a input image of size \(W \times H\), we denote the binarized SPG mask \(M\in \{0,1,255\}^{W\times H}\), where \(M_{x,y} = 0\) if the pixel at \(x_{th}\) row and \(y_{th}\) column belongs to background regions, \(M_{x,y} = 1\) if it belongs to object regions, and \(M_{x,y} = 255\) if it is ignored. We denote the attention map as O. The produced guidance masks can be calculated by

$$\begin{aligned} M_{x,y} = \left\{ \begin{aligned} 0 \qquad&if \quad O_{x,y}< \delta _l, 0<\delta _l<1 \\ 1 \qquad&if \quad O_{x,y} > \delta _h, 0<\delta _h<1 \\ 255 \qquad&if \quad \delta _l \le O_{x,y} \le \delta _h, 0<\delta _l<\delta _h<1 \end{aligned} \right. \end{aligned}$$
(1)

where \(\delta _l\) and \(\delta _h\) are thresholds to identify regions in localization maps as background and foreground, respectively.

We adopt an stagewise approach to gradually learn the high-quality self-produced supervision maps. B2 is applied to learn better self-produced maps supervised by the seed map \(M^{A}\). In training, only the positions labeled as 0 and 1 in the self-produced maps are served as pixel-level supervision. The pixels with values of 255 are temporarily ignored. The ignored pixels do not contribute to the loss and their gradients do not back-propagated. The network will learn the patterns from the already labeled pixels and then more regions will be recognized, because the pixels belonging to background or objects usually share much correlation. For example, the regions belong to the same object usually have the same appearance. The output of B2 is then further applied as attention maps, and better self-produced supervision masks can be calculated using the same policy in Eq. (1). After obtaining output maps of B1 and B2, these two maps are fused to generated our final self-produced supervision map. Particularly, we calculate the average of the two maps, then generate the self-produced guidance \(M^{fuse}\) according to Eq. (1).

The generated self-produced guidance is leveraged as pixel-level supervision for the classification network SPG-A. Thereby, the classification network will learn the correlation among pixels, and we will obtain better localization maps. The entire network is trained in an end-to-end manner. We adopt the cross-entropy loss function for the classification learning and self-produced guidance learning. Algorithm 1 illustrates the training procedure of the proposed SPG approach.

figure a

During testing, we extract the attention maps according to the class with the highest predicted scores, and then resize the maps to the same size with the original images by bilinear interpolation. For a fair comparison, we apply the same strategy utilized in [44] to produce object bounding boxes based on the generated object localization maps. In particular, we firstly segment the foreground and background by a fixed threshold. Then, we seek the tight bounding boxes covering the largest connected area in the foreground pixels. The thresholds for generating bounding boxes are adjusted to the optimal values using grid search method. For more details please refer to [44].

3.3 Implementation Details

We evaluate the proposed SPG approach by modifying the Inception-v3 network [29]. In particular, we remove the layers after the second Inception block, i.e., the third Inception block, pooling and linear layer. For a fair comparison, we build a plain version network, named SPG-plain. We add two convolutional layers of kernel size \(3 \times 3\), stride 1, pad 1 with 1024 filters and a convolutional layer of size \(1 \times 1\), stride 1 with 1000 units (200 for CUB-200-2011). Finally, a GAP layer and a softmax layer are added on the top. We update the plain network by adding two components (SPG-B and SPG-C). The first layers of B1 and B2 are convolutional layers of kernel size \(3 \times 3\) with 288 and 768 filters, respectively. The second layers are convolutional layers of 512 filters followed by a \(1 \times 1\) convolutional output layer. The second and third layers share parameters between B1 and B2. The strides are 1 for all convolutional layers. To keep the resolution of feature maps, we set the pad to 1 to the filters whose kernel size is \(3 \times 3\). SPG-C is consist of two convolutional layers of kernel size \(3 \times 3\) with 512 filters and a output convolutional layer with kernel size of \(1 \times 1\). All branches in SPG-B and SPG-C connects to a output sigmoid layer. We use the pre-trained weights on ILSVRC [22]. Following the baseline methods [26, 44], input images are randomly cropped to \(224 \times 224\) pixels after being reshaped to the size of \(256 \times 256\). During testing, we directly resize the input images to \(224 \times 224\). For classification results, we average the class scores from the softmax layer with 10 crops (4 corners plus center, same with horizontal flip).

We implement the networks using PyTorch. We finetune the networks with the initial learning rate of 0.001 (0.01 for the added layers) on ILSVRC, and it is decreased by a factor of 10 after every epoch. The batch size is 30 and the weight decay is 0.0005. The momentum of the SGD optimizer is set to 0.9. We randomly sample some images and visualize their localization maps. We adjust \(\delta _h\) to mine object seeds. The object seeds should include as much object pixels as possible while exclude background pixels. Similarly, \(\delta _l\) can be adjusted so that the background seeds should be as large as possible while exclude object regions. We choose the parameters for B1 are \(\delta _h=0.5\) and \(\delta _l=0.05\), and the parameters for B2 are \(\delta _h=0.7\) and \(\delta _l=0.1\). We train the networks on NVIDIA GeForce TITAN 1080Ti GPU with 11 GB memory. Code is available at https://github.com/xiaomengyc/SPG.

4 Experiments

4.1 Experiment Setup

Dataset and Evaluation. We evaluate the Top-1 and Top-5 localization accuracy of the proposed approach. We mainly compare our approach with other baseline methods on the ILSVRC 2016 dataset, as it has more than 1.2 million images of 1,000 classes for training. We report the accuracy on the validation set of 50,000 images. We also tested our algorithm on the bird dataset, CUB-200-2011 [31]. CUB-200-2011 contains 11,788 images of 200 categories with 5,994 images for training and 5,794 for testing. We leverage the localization metric suggested by [22]. An image has the right predicted bounding box if (1) it has the right prediction of image label; (2) and its predicted bounding box has more than 50% overlap with the ground-truth boxes.

Table 1. Localization error on ILSVRC validation set (* indicates methods which improve the Top-5 performance only using predictions with high scores).
Table 2. Localization error on CUB-200-2011 test set (* indicates methods which improve the Top-5 performance only using predictions with high scores).

4.2 Comparison with the State-of-the-Arts

We compare the proposed SPG approach with the state-of-the-art methods on ILSVRC validation set and CUB-200-2011 test set.

Localization: Table 1 illustrates the localization error of various baseline algorithms on the ILSVRC val set. We observe that our baseline SPG-plain model achieves 53.71 and 41.81 of Top-1 and Top-5 localization error. Based on the SPG-plain network, the SPG strategy further reduces the localization error to Top-1 51.40 and Top-5 40.00. We illustrate the results on CUB-200-2011 in Table 2, the SPG approach achieves the localization error of Top-1 53.36%. Both results on ILSVRC and CUB outperform the state-of-the-art approach, ACoL [43] which applied two classifier branches to discover complementary object regions. Following the baseline methods [43, 44], we boost the Top-5 localization error by repeatedly using the predicted bounding boxes with high classification scores. We select two bounding boxes from the top 1st and 2nd predicted classes, and one from the 3rd class. By this way, the Top-5 localization error (indicated by *) on ILSVRC is improved to 35.05%, and that on CUB-200-2011 is improved to 40.62%. To summarize, the improvement of the plain networks mainly attribute to the structure of the Inception-v3 network, which can capture larger object regions. The improvement of the SPG networks attribute to the use of the auxiliary supervision. SPG can encourage the classification network learn more pixel-level correlations, and as a result of this, the localization performance increases.

Localization performance is restricted by the classification accuracy, because the calculation of localization overlap only conducts on images which have the correct prediction of image-level labels. In order to break this limitation, we further improve the localization performance by combining our localization results with the state-of-the-art classification results, i.e., ResNet [7] and DPN [2], As shown in Table 3, the localization performance constantly improves with the classification results getting better. When we use the classification results from the ensemble DPN method (ensemble of DPN-92, DPN-98 and DPN-131), which has very low classification error of Top-1 15.47% and Top-5 2.70%, the localization error decreases to Top-1 43.83% and Top-5 29.36%.

Table 3. Localization/Classification error on ILSVRC validation set with the state-of-the-art classification results.
Fig. 3.
figure 3

Illustration of the attention maps and the predicted bounding boxes of SPG on ILSVRC and CUB-200-2011. The predicted bounding boxes are in green and the ground-truth boxes are in red. Best viewed in color.

Fig. 4.
figure 4

Output maps of the proposed SPG approach. The localization maps usually only highlight small region of the object. We extract the seeds of the self-produced guidance by segmenting the confident regions of the localization maps as foreground (white) and background (black), and ignore the left regions (grey). These seeds are applied as supervision to learn better self-produced guidance maps. Finally, the learned maps are leveraged to encourage the network to improve the quality of the localization maps.

Figure 3 shows the attention maps as well as the predicted bounding boxes with the proposed SPG on ILSVRC and CUB-200-2011. Our proposed approach can highlight nearly the entire object regions and produce precise bounding boxes. Figure 4 visualizes the output of the multiple branches in generating the self-produced guidances. The attention maps generated from the classification network are leveraged to produce the seeds of foreground and background. We can observe the seeds usually cover small region of the object and background pixels. The produced seed masks (Mask-A) are then utilized as supervision for the B2 branch. With such supervision information, B2 can learn more confident patterns of foreground and background pixels, and precisely predict the remaining foreground/background regions where leave undefined in Mask-A. B1 leverages the lower level feature maps and the supervision from B2 to learn more detailed regions. Finally, the self-produced guidance is obtained by fusing the two outputs of B1 and B2. This guidance is used as auxiliary supervision to encourage the classification network learn better attention maps.

4.3 Ablation Study

Limitation of the Localization Accuracy

As calculation of the localization error rate is affected by network’s classification performance. We compare the localization performance using ground-truth labels to eliminate the influence caused by classification accuracy As shown in Table 4, the proposed SPG outperforms the other approaches. The Top-1 error of SPG-plain is 37.32%, which is better than other baseline approaches. With the assistance of the auxiliary supervision, the localization error with ground-truth labels reduces to 35.31%. This reveals the superiority of the attention maps generated by our method, and shows that the proposed self-produced guidance maps can successfully encourage the network learn better object regions.

Table 4. Localization error on ILSVRC validation data with ground-truth labels.

Effect of the Cascade Learning Strategy

In the proposed method, we learn the self-produced guidance maps in a two-stage way. The branch B2 is supervised by the guidance maps generated by the localization maps from SPG-A, while the branch B1 is supervised by self-produced guidance from the output of B2. In order to verify the effectiveness of this two-stage method, we break this structure and use the initial seed masks as supervision for the both branches. As a result, we obtain a higher Top-1 error rate of 35.58% when providing the ground-truth classification labels. So, we can conclude that the two-stage structure utilized in SPG-B is useful to generate better self-produced guidance maps, and it is more effective for generating better attention maps. Also, we find it is helpful to share the second and third layers of B1 and B2. By removing the shared setting, the localization error rate will increase from 35.31% to 36.31%.

Effect of the Auxiliary Supervision

We propose to use the self-produced guidance maps as a pixel-level auxiliary supervision to encourage the classification network to learn better localization maps using SPG-C. Thus, we remove SPG-C to test whether SPG-C influence the classification network. After removing SPG-C, the performance becomes worse with the Top-1 error rate of 36.06% on ILSVRC validation set when providing ground-truth labels. This reveals that the proposed self-produced guidance maps is effective to improve the quality of the localization maps by adding auxiliary supervision with SPG-C. It is notable that, the localization performance with only using SPG-B is still better than the plain version. So, the branches in SPG-B can also contribute to the improvement of localization accuracy.

5 Conclusions

In this paper, we proposed the Self-produced Guidance approach for locating target object regions given only image-level labels. The proposed approach can generate high-quality self-produced guidance maps for encouraging the classification network to learn pixel-level correlations. Thereby, the networks can detect much more object regions for localization. Extensive experiments show the proposed method can detect more object regions and outperform the state-of-the-art localization methods.