Self-produced Guidance for Weakly-Supervised Object Localization

Zhang, Xiaolin; Wei, Yunchao; Kang, Guoliang; Yang, Yi; Huang, Thomas

doi:10.1007/978-3-030-01258-8_37

Xiaolin Zhang¹⁷,
Yunchao Wei¹⁸,
Guoliang Kang¹⁷,
Yi Yang¹⁷ &
…
Thomas Huang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11216))

Included in the following conference series:

European Conference on Computer Vision

2399 Accesses
125 Citations

Abstract

Weakly supervised methods usually generate localization results based on attention maps produced by classification networks. However, the attention maps exhibit the most discriminative parts of the object which are small and sparse. We propose to generate Self-produced Guidance (SPG) masks which separate the foreground i.e., the object of interest, from the background to provide the classification networks with spatial correlation information of pixels. A stagewise approach is proposed to incorporate high confident object regions to learn the SPG masks. The high confident regions within attention maps are utilized to progressively learn the SPG masks. The masks are then used as an auxiliary pixel-level supervision to facilitate the training of classification networks. Extensive experiments on ILSVRC demonstrate that SPG is effective in producing high-quality object localizations maps. Particularly, the proposed SPG achieves the Top-1 localization error rate of 43.83% on the ILSVRC validation set, which is a new state-of-the-art error rate.

You have full access to this open access chapter, Download conference paper PDF

In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization

Keywords

1 Introduction

Weakly Supervised Learning (WSL) has been successfully applied on many tasks, such as object localization [5, 6, 11, 13, 26, 35, 44], relation detection [40] and semantic segmentation [32,33,34, 36, 37]. WSL attracts extensive attention from researchers and practitioners because it is less dependent on massive pixel-level annotations. In this paper, we focus on Weakly Supervised Object Localization (WSOL) problem.

Existing WSOL methods locate target object regions using convolutional classification networks. Classification networks recognize various kinds of objects by identifying discriminative regions of an objects. Fully convolutional networks [17] without using fully connected layers can preserve the relative positions of pixels. Therefore, the discovered discriminative regions can indicate the exact location of the target objects. Zhou et al. revisited classification networks (e.g. AlexNet [12], VGG [25] and GoogleNet [27, 28]) and proposed a Class Activation Maps (CAM) approach to find the regions of interest using only image-level supervision. Following [14], CAM replaced the top fully connected layers by convolutional layers to keep the object positions and can discover the spatial distribution of discriminative regions for different classes. The key weakness of the localization maps generated by CAM is that only the most discriminative regions are highlighted, as a result we can only locate a small part of target objects. To cope with the weakness, Wei et al. [32] proposed to apply additional networks for enriching object-related regions, given images of which the most discriminative regions are erased according to the attention maps from a pre-trained network. Moreover, Zhang et al. [43] proved the CAM method can be simplified to enable end-to-end training. Armed with this proof, an Adversarial Complementary Learning approach was proposed in [43] by incorporating one additional classifier for mining complementary object regions, which can finally produce accurate object localization maps. However, all these methods ignore to explore the correlations among pixels.

We observe that images can be roughly divided into foreground and background regions. The foreground pixels usually constitute the object(s) of interests. We found that attention maps inferred from classification networks [32, 43, 45] can effectively provide the probabilities of each pixel to be foreground or background. Although pixels of high foreground/background probabilities may not cover the entire target object/background, they still provide the important cues for getting some common patterns of target objects. Based on this, we can simply leverage those reliable foreground/background seeds as supervision to encourage the network to sense the distributions of foreground objects and background regions. Since pixels with correlations (e.g. within a same object or background) often share similar appearance, more reliable foreground/background pixels can be easily discovered by learning from the discovered seeds. With more reliable guided pixels for supervision, the entire foreground objects can be gradually distinguished from background, which will finally benefit the weakly object localization.

Inspired by the above motivation, in this paper, we propose a Self-produced Guidance (SPG) approach for learning better attention maps and getting precise positions of objects. We leverage attention maps to produce the guidance masks of foreground and background regions in a stagewise manner. The foreground/background seeds of each stage can be generated following a simple rule: (1) the regions with highly confident scores are considered as foreground; (2) the regions with very low scores are background seeds; (3) the regions with medium confidence remain undefined. The undefined regions are meant to be figured out using intermediate features. We adopt a top-down mechanism of using upper layer’s output as the supervision of the lower layers to learn better object localizations. The upper layers maintain more abstract semantic information, whereas the lower layers have more specific pixel-related information. We leave the ambiguous area undefined before more regions can be defined as foreground/background using upper layer features. The more regions be defined, the stronger ability to define harder regions. After getting the guidance masks of foreground and background, we use them as auxiliary supervisions. These supervisions are expected to enable the classification network to learn pixel correlations. Consequently, attention maps can clearly indicate class-specific object regions. Figure 1 illustrates the learning process of self-produced guidance. Given an input image, we firstly generate corresponding attention maps through a classification network according to the convenient method in [43]. Then the attention map is roughly split into foreground/background seeds and ignored regions. The self-produced guidance are learned from these seeds with the input of intermediate features in a stagewise manner. Finally, the SPG masks of multiple layers are fused for a more precise and integrate indication of target objects.

To sum up, our main contributions are:

We propose a stagewise approach to learn high-quality Self-produced Guidance masks which exhibit the foreground and background of a given image.
We present a weakly object localization method by incorporating self-produced supervision, which can inspire the classification network discover pixel correlations to improve the localization performance.
The proposed method achieves the new state-of-the-art with the error rate of Top-1 43.83% on ILSVRC dataset with only image-level supervision.

We discuss the proposed SPG approach in detail in Sect. 3. In Sect. 4, we empirically evaluate the proposed method on the ILSVRC2016 dataset, showing that the superiority of SPG in object localization task with only image-level supervision. We also discuss the further insights of the proposed SPG algorithms through additional experiments.

2 Related Work

Convolutional neural network has been widely used in object detection and localization tasks [3, 8, 10, 18, 25, 42]. One of the earliest deep networks to detect objects in a one-stage manner is OverFeat [23], which employs a multiscale and sliding window approach to predict object boundaries. These boundaries are then applied for accumulating bounding boxes. SSD [16] and YOLO [20] used a similar one-stage method, and these detectors are specifically designed for speeding up the detection process. Faster-RCNN designed by Ren et al. [21] has achieved great success in the object detection task. It generates region proposals and predicts highly reliable object locations in an unified network in real time. Lin et al. [15] presented that the performance of Faster-RCNN can be significantly improved by constructing feature pyramids with marginal extra cost.

Although these approaches are considerably successful in detecting object of interest in images, the vast number of annotations are unaffordable for training such networks with limited budget. Weakly supervised methods alleviate this problem by using much cheaper annotations like image-level labels. Jie et al. [11] proposed a self-taught learning framework by firstly selecting some high-response proposals, and then finetuning the network on the selected regions to progressively improve its detection capacity. This method highly rely on region proposals pre-processed by algorithms like Selective Search [30]. The general-purpose proposal algorithms may not robust to produce accurate bounding boxes. Dong et al. [5] adopted two separate networks to jointly refine the region proposals and select positive regions. High-quality attention maps are also critical for object detection and segmentation [19]. Diba et al. [4] proposed the attention maps can be leveraged to produce region proposals. With the assistance of these proposals, more detailed information can be easily detected.

However, these methods introduces extra computational as a result of using pre-processed region proposals and multiple networks. Zhou et al. [44] discovered that the localization maps for each class can be produced by aggregating top-level feature maps using a class-specific fully connected layer. Zhang et al. [41] introduced a different backpropagation scheme to produce contrastive response maps by passing along top-down signals downwards. However, this method supervised by solely using image labels tends to only discover a small part of the target objects. Wei et al. [32] applied a similar but more efficient approach to hide discriminative regions under the guidance of a pre-trained network, and then the processed images are trained for discovering more regions of interest. These methods increases the amount of images, thus they need much more precious computational and time resources to train the networks. Zhang et al. [43] provided theoretical proof of producing class-specific attention maps during the forward pass by just selecting from the last layer feature maps, which enables the end-to-end attention learning. Also, they proposed the ACoL approach [43] to efficiently mine the integral target object in an enhanced classification network.

3 Self-produced Guidance

3.1 Network Overview

We denote the image set as $I=\{(I_i, y_i)\}^{N-1}_{i=0}$, where $y_i = \{0,1,...,C-1\}$ is the label of the image $I_i$, N is the number of images and C is the number of image classes. Figure 2 illustrates the architecture of the SPG approach, which mainly has four different components, including Stem, SPG-A, SPG-B and SPG-C. Different components have different structures and functionalities. We use lowercase f to denote functions and capital F to denote output feature maps. Stem is a fully convolutional network denoted as $f^{Stem}(I_i,\theta ^{Stem})$, where $\theta ^{Stem}$ is the parameters. The output feature maps of $f^{Stem}$ is denoted as $F^{Stem}$. $f^{Stem}$ acts as a feature extractor, which takes the RGB images as input and produces high-level position-aware feature maps of multiple channels. The extracted feature maps $F^{Stem}$ are then fed into the following component SPG-A. We denote the SPG-A component as $f^{A}(F^{Stem}, \theta ^{A})$, which is a network for image-level classification. $f^{A}(F^{Stem}, \theta ^{A})$ is consisted of four convolutional blocks (i.e. A1, A2, A3 and A4), a global average pooling (GAP) layer [14] and a softmax layer. A4 has one convolutional layer with kernel size $1 \times 1$ of C filters. These filters are corresponding to the attention maps of each class, so as to generate attention maps during the forward pass [43]. SPG-B is leveraged to learn Self-produced guidance masks by using the seeds of foreground and background generated from attention maps. The high confident regions within attention maps are extracted to perform as supervision to learn better object regions. SPG-B leverages the intermediate feature maps from the classification network SPG-A to predict Self-produced Guidance masks. Particularly, the output features maps $F^{A1}$ and $F^{A2}$ of A1 and A2 are fed into the two blocks in SPG-B, respectively. Each block of SPG-B contains three convolutional layers followed by a sigmoid layer, where the first layer is to adapt the different number of channels in feature maps $F^{A1}$ and $F^{A2}$. The output of SPG-B are denoted as $F^{B1}$ and $F^{B2}$ for the two branches, respectively. The component SPG-C uses the auxiliary SPG supervision to encourage the SPG-A to learn pixel-level correlations. SPG-C contains two convolutional layers with $3 \times 3$ and $1 \times 1$ kernels, followed by a sigmoid layer.

3.2 Self-produced Guidance Learning

Attention maps generated from classification networks can only exhibit the most discriminative parts of target objects. We propose to generate Self-produced Guidance (SPG) masks which separate the foreground, i.e. the object of interest, from the background to provide the classification networks with spatial correlation information of pixels. The generated SPG masks are then leveraged as auxiliary supervision to encourage the networks to learn correlations between pixels. Thus, pixels within the same object will have the same responses in feature maps. As the detailed information (i.e. object edge and boundary) is usually very abstract in the top-level feature maps, we employ the intermediate features to produce precise SPG masks. Indeed, some previous works use low-level feature maps to learn object regions [9, 38]. These approaches require pixel-level ground-truth labels as supervision. Differently, we propose to use self-produced guidance by incorporating high confident object regions within attention maps. In detail, for any image $I_i$, we firstly extract its attention map O by simply from a classification network. We observe that the attention maps usually highlight the most discriminative regions of object. The initial object and background seeds can be easily obtained according to the scores in the attention maps. In particular, the regions with very low scores are considered as background, while the regions with very high scores are foreground. The rest regions are ignored during the learning process. We initialize the SPG learning process by these seeds. B2 is supervised by the seed map and it can learn the patterns of foreground and background. In this way, the pixels within the ignored regions are gradually recognized. Then, we use the same strategy to find the foreground and background seeds in the output map of B2, which are used to train the B1 branch. In such a stagewise way, the intermediate information of the neural network are employed to learn the Self-produced Guidance.

We formally define this process as follows. Given a input image of size $W \times H$, we denote the binarized SPG mask $M\in \{0,1,255\}^{W\times H}$, where $M_{x,y} = 0$ if the pixel at $x_{th}$ row and $y_{th}$ column belongs to background regions, $M_{x,y} = 1$ if it belongs to object regions, and $M_{x,y} = 255$ if it is ignored. We denote the attention map as O. The produced guidance masks can be calculated by

$$\begin{aligned} M_{x,y} = \left\{ \begin{aligned} 0 \qquad&if \quad O_{x,y}< \delta _l, 0<\delta _l<1 \\ 1 \qquad&if \quad O_{x,y} > \delta _h, 0<\delta _h<1 \\ 255 \qquad&if \quad \delta _l \le O_{x,y} \le \delta _h, 0<\delta _l<\delta _h<1 \end{aligned} \right. \end{aligned}$$

(1)

where $\delta _l$ and $\delta _h$ are thresholds to identify regions in localization maps as background and foreground, respectively.

We adopt an stagewise approach to gradually learn the high-quality self-produced supervision maps. B2 is applied to learn better self-produced maps supervised by the seed map $M^{A}$. In training, only the positions labeled as 0 and 1 in the self-produced maps are served as pixel-level supervision. The pixels with values of 255 are temporarily ignored. The ignored pixels do not contribute to the loss and their gradients do not back-propagated. The network will learn the patterns from the already labeled pixels and then more regions will be recognized, because the pixels belonging to background or objects usually share much correlation. For example, the regions belong to the same object usually have the same appearance. The output of B2 is then further applied as attention maps, and better self-produced supervision masks can be calculated using the same policy in Eq. (1). After obtaining output maps of B1 and B2, these two maps are fused to generated our final self-produced supervision map. Particularly, we calculate the average of the two maps, then generate the self-produced guidance $M^{fuse}$ according to Eq. (1).

The generated self-produced guidance is leveraged as pixel-level supervision for the classification network SPG-A. Thereby, the classification network will learn the correlation among pixels, and we will obtain better localization maps. The entire network is trained in an end-to-end manner. We adopt the cross-entropy loss function for the classification learning and self-produced guidance learning. Algorithm 1 illustrates the training procedure of the proposed SPG approach.

During testing, we extract the attention maps according to the class with the highest predicted scores, and then resize the maps to the same size with the original images by bilinear interpolation. For a fair comparison, we apply the same strategy utilized in [44] to produce object bounding boxes based on the generated object localization maps. In particular, we firstly segment the foreground and background by a fixed threshold. Then, we seek the tight bounding boxes covering the largest connected area in the foreground pixels. The thresholds for generating bounding boxes are adjusted to the optimal values using grid search method. For more details please refer to [44].

3.3 Implementation Details

We evaluate the proposed SPG approach by modifying the Inception-v3 network [29]. In particular, we remove the layers after the second Inception block, i.e., the third Inception block, pooling and linear layer. For a fair comparison, we build a plain version network, named SPG-plain. We add two convolutional layers of kernel size $3 \times 3$, stride 1, pad 1 with 1024 filters and a convolutional layer of size $1 \times 1$, stride 1 with 1000 units (200 for CUB-200-2011). Finally, a GAP layer and a softmax layer are added on the top. We update the plain network by adding two components (SPG-B and SPG-C). The first layers of B1 and B2 are convolutional layers of kernel size $3 \times 3$ with 288 and 768 filters, respectively. The second layers are convolutional layers of 512 filters followed by a $1 \times 1$ convolutional output layer. The second and third layers share parameters between B1 and B2. The strides are 1 for all convolutional layers. To keep the resolution of feature maps, we set the pad to 1 to the filters whose kernel size is $3 \times 3$. SPG-C is consist of two convolutional layers of kernel size $3 \times 3$ with 512 filters and a output convolutional layer with kernel size of $1 \times 1$. All branches in SPG-B and SPG-C connects to a output sigmoid layer. We use the pre-trained weights on ILSVRC [22]. Following the baseline methods [26, 44], input images are randomly cropped to $224 \times 224$ pixels after being reshaped to the size of $256 \times 256$. During testing, we directly resize the input images to $224 \times 224$. For classification results, we average the class scores from the softmax layer with 10 crops (4 corners plus center, same with horizontal flip).

We implement the networks using PyTorch. We finetune the networks with the initial learning rate of 0.001 (0.01 for the added layers) on ILSVRC, and it is decreased by a factor of 10 after every epoch. The batch size is 30 and the weight decay is 0.0005. The momentum of the SGD optimizer is set to 0.9. We randomly sample some images and visualize their localization maps. We adjust $\delta _h$ to mine object seeds. The object seeds should include as much object pixels as possible while exclude background pixels. Similarly, $\delta _l$ can be adjusted so that the background seeds should be as large as possible while exclude object regions. We choose the parameters for B1 are $\delta _h=0.5$ and $\delta _l=0.05$, and the parameters for B2 are $\delta _h=0.7$ and $\delta _l=0.1$. We train the networks on NVIDIA GeForce TITAN 1080Ti GPU with 11 GB memory. Code is available at https://github.com/xiaomengyc/SPG.

4 Experiments

4.1 Experiment Setup

Dataset and Evaluation. We evaluate the Top-1 and Top-5 localization accuracy of the proposed approach. We mainly compare our approach with other baseline methods on the ILSVRC 2016 dataset, as it has more than 1.2 million images of 1,000 classes for training. We report the accuracy on the validation set of 50,000 images. We also tested our algorithm on the bird dataset, CUB-200-2011 [31]. CUB-200-2011 contains 11,788 images of 200 categories with 5,994 images for training and 5,794 for testing. We leverage the localization metric suggested by [22]. An image has the right predicted bounding box if (1) it has the right prediction of image label; (2) and its predicted bounding box has more than 50% overlap with the ground-truth boxes.

Table 1. Localization error on ILSVRC validation set (* indicates methods which improve the Top-5 performance only using predictions with high scores).

Full size table

Table 2. Localization error on CUB-200-2011 test set (* indicates methods which improve the Top-5 performance only using predictions with high scores).

Full size table

4.2 Comparison with the State-of-the-Arts

We compare the proposed SPG approach with the state-of-the-art methods on ILSVRC validation set and CUB-200-2011 test set.

Localization: Table 1 illustrates the localization error of various baseline algorithms on the ILSVRC val set. We observe that our baseline SPG-plain model achieves 53.71 and 41.81 of Top-1 and Top-5 localization error. Based on the SPG-plain network, the SPG strategy further reduces the localization error to Top-1 51.40 and Top-5 40.00. We illustrate the results on CUB-200-2011 in Table 2, the SPG approach achieves the localization error of Top-1 53.36%. Both results on ILSVRC and CUB outperform the state-of-the-art approach, ACoL [43] which applied two classifier branches to discover complementary object regions. Following the baseline methods [43, 44], we boost the Top-5 localization error by repeatedly using the predicted bounding boxes with high classification scores. We select two bounding boxes from the top 1st and 2nd predicted classes, and one from the 3rd class. By this way, the Top-5 localization error (indicated by *) on ILSVRC is improved to 35.05%, and that on CUB-200-2011 is improved to 40.62%. To summarize, the improvement of the plain networks mainly attribute to the structure of the Inception-v3 network, which can capture larger object regions. The improvement of the SPG networks attribute to the use of the auxiliary supervision. SPG can encourage the classification network learn more pixel-level correlations, and as a result of this, the localization performance increases.

Localization performance is restricted by the classification accuracy, because the calculation of localization overlap only conducts on images which have the correct prediction of image-level labels. In order to break this limitation, we further improve the localization performance by combining our localization results with the state-of-the-art classification results, i.e., ResNet [7] and DPN [2], As shown in Table 3, the localization performance constantly improves with the classification results getting better. When we use the classification results from the ensemble DPN method (ensemble of DPN-92, DPN-98 and DPN-131), which has very low classification error of Top-1 15.47% and Top-5 2.70%, the localization error decreases to Top-1 43.83% and Top-5 29.36%.

Table 3. Localization/Classification error on ILSVRC validation set with the state-of-the-art classification results.

Full size table

Figure 3 shows the attention maps as well as the predicted bounding boxes with the proposed SPG on ILSVRC and CUB-200-2011. Our proposed approach can highlight nearly the entire object regions and produce precise bounding boxes. Figure 4 visualizes the output of the multiple branches in generating the self-produced guidances. The attention maps generated from the classification network are leveraged to produce the seeds of foreground and background. We can observe the seeds usually cover small region of the object and background pixels. The produced seed masks (Mask-A) are then utilized as supervision for the B2 branch. With such supervision information, B2 can learn more confident patterns of foreground and background pixels, and precisely predict the remaining foreground/background regions where leave undefined in Mask-A. B1 leverages the lower level feature maps and the supervision from B2 to learn more detailed regions. Finally, the self-produced guidance is obtained by fusing the two outputs of B1 and B2. This guidance is used as auxiliary supervision to encourage the classification network learn better attention maps.

4.3 Ablation Study

Limitation of the Localization Accuracy

As calculation of the localization error rate is affected by network’s classification performance. We compare the localization performance using ground-truth labels to eliminate the influence caused by classification accuracy As shown in Table 4, the proposed SPG outperforms the other approaches. The Top-1 error of SPG-plain is 37.32%, which is better than other baseline approaches. With the assistance of the auxiliary supervision, the localization error with ground-truth labels reduces to 35.31%. This reveals the superiority of the attention maps generated by our method, and shows that the proposed self-produced guidance maps can successfully encourage the network learn better object regions.

Table 4. Localization error on ILSVRC validation data with ground-truth labels.

Full size table

Effect of the Cascade Learning Strategy

In the proposed method, we learn the self-produced guidance maps in a two-stage way. The branch B2 is supervised by the guidance maps generated by the localization maps from SPG-A, while the branch B1 is supervised by self-produced guidance from the output of B2. In order to verify the effectiveness of this two-stage method, we break this structure and use the initial seed masks as supervision for the both branches. As a result, we obtain a higher Top-1 error rate of 35.58% when providing the ground-truth classification labels. So, we can conclude that the two-stage structure utilized in SPG-B is useful to generate better self-produced guidance maps, and it is more effective for generating better attention maps. Also, we find it is helpful to share the second and third layers of B1 and B2. By removing the shared setting, the localization error rate will increase from 35.31% to 36.31%.

Effect of the Auxiliary Supervision

We propose to use the self-produced guidance maps as a pixel-level auxiliary supervision to encourage the classification network to learn better localization maps using SPG-C. Thus, we remove SPG-C to test whether SPG-C influence the classification network. After removing SPG-C, the performance becomes worse with the Top-1 error rate of 36.06% on ILSVRC validation set when providing ground-truth labels. This reveals that the proposed self-produced guidance maps is effective to improve the quality of the localization maps by adding auxiliary supervision with SPG-C. It is notable that, the localization performance with only using SPG-B is still better than the plain version. So, the branches in SPG-B can also contribute to the improvement of localization accuracy.

5 Conclusions

In this paper, we proposed the Self-produced Guidance approach for locating target object regions given only image-level labels. The proposed approach can generate high-quality self-produced guidance maps for encouraging the classification network to learn pixel-level correlations. Thereby, the networks can detect much more object regions for localization. Extensive experiments show the proposed method can detect more object regions and outperform the state-of-the-art localization methods.

References

Cao, C., et al.: Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2956–2964 (2015)
Google Scholar
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. arXiv preprint arXiv:1707.01629 (2017)
Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., Huang, T.: Revisiting RCNN: on awakening the classification power of faster RCNN. In: ECCV (2018)
Google Scholar
Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., Van Gool, L.: Weakly supervised cascaded convolutional networks. arXiv preprint (2017)
Google Scholar
Dong, X., Meng, D., Ma, F., Yang, Y.: A dual-network progressive approach to weakly supervised object detection. In: ACM Multimedia (2017)
Google Scholar
Dong, X., Zheng, L., Ma, F., Yang, Y., Meng, D.: Few-example object detection with model communication. arXiv preprint arXiv:1706.08249 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, S., Jiao, J., Zhang, X., Han, G., Lau, R.W.: Delving into salient object subitizing and detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1059–1067. IEEE (2017)
Google Scholar
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5300–5309. IEEE (2017)
Google Scholar
Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., Li, S.: Salient object detection: a discriminative regional feature integration approach. In: IEEE CVPR, pp. 2083–2090 (2013)
Google Scholar
Jie, Z., Wei, Y., Jin, X., Feng, J., Liu, W.: Deep self-taught learning for weakly supervised object localization. In: IEEE CVPR (2018)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computational baby learning: a weakly-supervised approach for object detection. In: IEEE ICCV, pp. 999–1007 (2015)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR (2013)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, vol. 1, p. 4 (2017)
Google Scholar
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR (2015)
Google Scholar
Luo, Y., Guan, T., Pan, H., Wang, Y., Yu, J.: Accurate localization for mobile device using a multi-planar city model. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3733–3738. IEEE (2016)
Google Scholar
Luo, Y., Zheng, Z., Zheng, L., Tao, G., Junqing, Y., Yang, Y.: Macro-micro adversarial network for human parsing. In: ECCV (2018)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Russakovsky, O., et al.: ImageNetLarge scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations (2014)
Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. arXiv preprint arXiv:1704.04232 (2017)
Szegedy, C., et al.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: IEEE CVPR, pp. 1–9 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Article Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report, CNS-TR-2011-001, California Institute of Technology (2011)
Google Scholar
Wei, Y., Feng, J., Liang, X., Cheng, M.M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: IEEE CVPR (2018)
Google Scholar
Wei, Y., et al.: Learning to segment with image-level annotations. Pattern Recognit. 59, 234–244 (2016)
Article Google Scholar
Wei, Y., et al.: STC: a simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI 39(11), 2314–2320 (2016)
Article Google Scholar
Wei, Y., et al.: TS2C: tight box mining with surrounding segmentation context for weakly supervised object detection. In: ECCV (2018)
Google Scholar
Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T.S.: Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: IEEE CVPR, pp. 7268–7277 (2018)
Google Scholar
Xiao, H., Wei, Y., Liu, Y., Zhang, M., Feng, J.: Transferable semi-supervised semantic segmentation. In: AAAI (2018)
Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In: IEEE ICCV (2017)
Google Scholar
Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 543–559. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_33
Chapter Google Scholar
Zhang, Q., Jiao, J., Cao, Y., Lau, R.W.: Task-driven webpage saliency. In: ECCV (2018)
Google Scholar
Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.: Adversarial complementary learning for weakly supervised object localization. In: IEEE CVPR (2018)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE CVPR (2016)
Google Scholar
Zhu, J., Mao, J., Yuille, A.L.: Learning from weakly supervised data by the expectation loss SVM (e-SVM) algorithm. In: NIPS, pp. 1125–1133 (2014)
Google Scholar

Download references

Acknowledgement

Xiaolin Zhang (No. 201606180026) is partially supported by the Chinese Scholarship Council. This work is partially supported by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network. We acknowledge the Data to Decisions CRC (D2D CRC) and the Cooperative Research Centres Programme for funding this research.

Author information

Authors and Affiliations

CAI, University of Technology Sydney, Ultimo, NSW, Australia
Xiaolin Zhang, Guoliang Kang & Yi Yang
University of Illinois Urbana-Champaign, Champaign, IL, USA
Yunchao Wei & Thomas Huang

Authors

Xiaolin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yunchao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Kang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Yang .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T. (2018). Self-produced Guidance for Weakly-Supervised Object Localization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11216. Springer, Cham. https://doi.org/10.1007/978-3-030-01258-8_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-01258-8_37
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01257-1
Online ISBN: 978-3-030-01258-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Self-produced Guidance for Weakly-Supervised Object Localization

Abstract

Similar content being viewed by others

In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization

Keywords

1 Introduction

2 Related Work