Bottom-Up Top-Down Cues for Weakly-Supervised Semantic Segmentation

  • Qibin Hou
  • Daniela MassicetiEmail author
  • Puneet Kumar Dokania
  • Yunchao Wei
  • Ming-Ming Cheng
  • Philip H. S. Torr
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10746)


We consider the task of learning a classifier for semantic segmentation using weak supervision in the form of image labels specifying objects present in the image. Our method uses deep convolutional neural networks (cnns) and adopts an Expectation-Maximization (EM) based approach. We focus on the following three aspects of EM: (i) initialization; (ii) latent posterior estimation (E-step) and (iii) the parameter update (M-step). We show that saliency and attention maps, bottom-up and top-down cues respectively, of images with single objects (simple images) provide highly reliable cues to learn an initialization for the EM. Intuitively, given weak supervisions, we first learn to segment simple images and then move towards the complex ones. Next, for updating the parameters (M step), we propose to minimize the combination of the standard softmax loss and the KL divergence between the latent posterior distribution (obtained using the E-step) and the likelihood given by the cnn. This combination is more robust to wrong predictions made by the E step of the EM algorithm. Extensive experiments and discussions show that our method is very simple and intuitive, and outperforms the state-of-the-art method with a very high margin of 3.7% and 3.9% on the PASCAL VOC12 train and test sets respectively, thus setting new state-of-the-art results.



Qibin Hou, Yunchao Wei and Ming-Ming Cheng were sponsored by NSFC (61620106008, 61572264), CAST (YESS20150117), Huawei Innovation Research Program (HIRP), and IBM Global SUR award. Daniela Massiceti, Punnet K. Dokania and Philip H.S. Torr were sponsored by ERC grant ERC-2012-AdG 321162-HELIOS. Ms Massiceti was also sponsored by the Skye Foundation. We thank all sponsors for their support.


  1. 1.
    Ahmed, F., Tarlow, D., Batra, D.: Optimizing expected intersection-over-union with candidate-constrained CRFs. In: ICCV (2015)Google Scholar
  2. 2.
    Alexe, B., Deselares, T., Ferrari, V.: Measuring the objectness of image windows. PAMI 34(11), 2189–2202 (2012)CrossRefGoogle Scholar
  3. 3.
    Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)Google Scholar
  4. 4.
    Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 549–565. Springer, Cham (2016). Google Scholar
  5. 5.
    Chandra, S., Kokkinos, I.: Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 402–418. Springer, Cham (2016). Google Scholar
  6. 6.
    Chen, L.-G., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected. In: ICLR (2015)Google Scholar
  7. 7.
    Cheng, M., Zhang, Z., Lin, W., Torr, P.H.S.: BING: binarized normed gradients for objectness estimation at 300fps. In: CVPR (2014)Google Scholar
  8. 8.
    Cheng, M.-M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.-M.: Global contrast based salient region detection. IEEE TPAMI 37(3), 569–582 (2015)CrossRefGoogle Scholar
  9. 9.
    Cogswell, M., Lin, X., Purushwalkam, S., Batra, D.: Combining the best of graphical models and convnets for semantic segmentation (2014). arXiv:1412.4313
  10. 10.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  12. 12.
    Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C., Winn, J., Zisserman, A.: The Pascal visual object classes challenge a retrospective. IJCV 111(1), 98–136 (2015)CrossRefGoogle Scholar
  13. 13.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph based image segmentation. IJCV 59(2), 167–181 (2004)CrossRefGoogle Scholar
  14. 14.
    Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)Google Scholar
  15. 15.
    Hou, Q., Cheng, M.-M., Hu, X.-W., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: IEEE CVPR (2017)Google Scholar
  16. 16.
    Jeff Wu, C.F.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia (2014)Google Scholar
  18. 18.
    Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., Li, S.: Salient object detection: a discriminative regional feature integration approach. In: CVPR (2013)Google Scholar
  19. 19.
    Kolesnikov, A., Lampert, C.H.: Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 695–711. Springer, Cham (2016). CrossRefGoogle Scholar
  20. 20.
    Krahenbuhl P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2011)Google Scholar
  21. 21.
    Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)Google Scholar
  23. 23.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Google Scholar
  24. 24.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  25. 25.
    McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (1997)zbMATHGoogle Scholar
  26. 26.
    Nowozin, S.: Optimal decisions from probabilistic models: the intersection-over-union case. In: CVPR (2014)Google Scholar
  27. 27.
    Papandreou, G., Chen, L.-C., Murphy, K.P., Yuille, A.L.: Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. In: ICCV (2015)Google Scholar
  28. 28.
    Pathak, D., Krahenbuhl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV (2015)Google Scholar
  29. 29.
    Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully convolutional multi-class multiple instance learning. In: ICLR (2014)Google Scholar
  30. 30.
    Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015)Google Scholar
  31. 31.
    Qi, X., Liu, Z., Shi, J., Zhao, H., Jia, J.: Augmented feedback in semantic segmentation under image level supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 90–105. Springer, Cham (2016). CrossRefGoogle Scholar
  32. 32.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR (2014)Google Scholar
  33. 33.
    Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)CrossRefGoogle Scholar
  34. 34.
    Wei, Y., Feng, J., Liang, X., Cheng, M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: CVPR (2017)Google Scholar
  35. 35.
    Xu, J., Schwing, A., Urtasun, R.: Learning to segment under various forms of weak supervision. In: CVPR (2015)Google Scholar
  36. 36.
    Yunchao, W., Xiaodan, L., Yunpeng, C., Xiaohui, S., Cheng, M.-M., Yao, Z., Shuicheng, Y.: STC: a simple to complex framework for weakly-supervised semantic segmentation (2015). arXiv:1509.03150
  37. 37.
    Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 543–559. Springer, Cham (2016). CrossRefGoogle Scholar
  38. 38.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  39. 39.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Nankai UniversityTianjinChina
  2. 2.University of OxfordOxfordUK
  3. 3.NUSSingaporeSingapore

Personalised recommendations