Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 3, pp 3261–3277 | Cite as

Weakly supervised detection with decoupled attention-based deep representation

Article
  • 114 Downloads

Abstract

Training object detectors with only image-level annotations is an important problem with a variety of applications. However, due to the deformable nature of objects, a target object delineated by a bounding box always includes irrelevant context and occlusions, which causes large intra-class object variations and ambiguity in object-background distinction. For this reason, identifying the object of interest from a substantial amount of cluttered backgrounds is very challenging. In this paper, we propose a decoupled attention-based deep model to optimize region-based object representation. Different from existing approaches posing object representation in a single-tower model, our proposed network decouples object representation into two separate modules, i.e., image representation and attention localization. The image representation module captures content-based semantic representation, while the attention localization module regresses an attention map which simultaneously highlights the locations of the discriminative object parts and down weights the irrelevant backgrounds presented in the image. The combined representation alleviates the impact from the noisy context and occlusions inside an object bounding box. As a result, object-background ambiguity can be largely reduced and background regions can be suppressed effectively. In addition, the proposed object representation model can be seamlessly integrated into a state-of-the-art weakly supervised detection framework, and the entire model can be trained end-to-end. We extensively evaluate the detection performance on the PASCAL VOC 2007, VOC 2010 and VOC2012 datasets. Experimental results demonstrate that our approach effectively improves weakly supervised object detection.

Keywords

Weak supervision Object detection Deep learning Attention model 

Notes

Acknowledgements

This work is supported by Chinese National Natural Science Foundation under Grants 61471049, 61372169 and 61532018.

References

  1. 1.
    Ba J, Mnih V, Kavukcuoglu K (2015) Multiple object recognition with visual attention. International Conference on Learning Representations, In, pp 1–10Google Scholar
  2. 2.
    Bency AJ, Kwon H, Lee H, Karthikeyan S, Manjunath BS (2016) Weakly supervised localization using deep feature maps. European Conference on Computer VisionCrossRefGoogle Scholar
  3. 3.
    Bilen H, Vedaldi A (2016) Weakly supervised deep detection networks. IEEE Conference on Computer Vision and Pattern RecognitionCrossRefGoogle Scholar
  4. 4.
    Bilen H, Pedersoli M, Tuytelaars T (2015) Weakly supervised object detection with convex clustering. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 1081–1089Google Scholar
  5. 5.
    Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst. doi: 10.1109/TNNLS.2016.2582746
  6. 6.
    Chang X, Yu Y, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39:1617–1632. doi: 10.1109/TPAMI.2016.2608901 CrossRefGoogle Scholar
  7. 7.
    Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27:1502–1513MathSciNetCrossRefGoogle Scholar
  8. 8.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26:3911–3920MathSciNetCrossRefGoogle Scholar
  9. 9.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47:1180–1197CrossRefGoogle Scholar
  10. 10.
    Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. International Conference on Learning Representations, In, pp 1–14Google Scholar
  11. 11.
    Cinbis RG, Verbeek J, Schmid C (2017) Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans Pattern Anal Mach Intell 39:189–203. doi: 10.1109/TPAMI.2016.2535231 CrossRefGoogle Scholar
  12. 12.
    Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387Google Scholar
  13. 13.
    Deselaers T, Alexe B, Ferrari V (2012) Weakly supervised localization and learning with generic knowledge. Int J Comput Vis 100:275–293. doi: 10.1007/s11263-012-0538-3 MathSciNetCrossRefGoogle Scholar
  14. 14.
    Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2014) The Pascal visual object classes challenge: a retrospective. Int J Comput Vis 111:98–136. doi: 10.1007/s11263-014-0733-5 CrossRefGoogle Scholar
  15. 15.
    Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the KITTI dataset. Int J Robot Res 32:1231–1237. doi: 10.1177/0278364913491297 CrossRefGoogle Scholar
  16. 16.
    Gidaris S, Komodakis N (2015) Object detection via a multi-region & semantic segmentation-aware CNN model. IEEE International Conference on Computer VisionCrossRefGoogle Scholar
  17. 17.
    Girshick R (2015) Fast R-CNN. IEEE International Conference on Computer VisionCrossRefGoogle Scholar
  18. 18.
    Han J, Zhang D, Cheng G, Guo L, Ren J (2015) Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans Geosci Remote Sens 53:3325–3337CrossRefGoogle Scholar
  19. 19.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 171–180Google Scholar
  20. 20.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia. pp 675–678Google Scholar
  21. 21.
    Jiang W, Zhao Z, Su F (2016) Bayes pooling of visual phrases for object retrieval. Multimed Tools Appl 75:9095–9119. doi: 10.1007/s11042-015-2939-0 CrossRefGoogle Scholar
  22. 22.
    Karthikeyan S, Ngo T, Eckstein M, Manjunath BS (2015) Eye tracking assisted extraction of attentionally important objects from videos. Proc IEEE Conf Comput Vis Pattern Recognit. doi: 10.1109/CVPR.2015.7298944
  23. 23.
    Krizhevsky A, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceeding NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, Curran Associates Inc., Lake Tahoe, Nevada — December 03–06, 2012, pp. 1097–1105Google Scholar
  24. 24.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016) SSD : single shot MultiBox detector. European Conference on Computer VisionGoogle Scholar
  25. 25.
    Long J, Shelhamer E (2015) Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern RecognitionCrossRefGoogle Scholar
  26. 26.
    Ma Z, Chang X, Yang Y, Sebe N, Hauptmann AG (2017) The many shades of negativity. IEEE Trans Multimedia 19:1558–1568CrossRefGoogle Scholar
  27. 27.
    Ma Z, Chang X, Xu Z, Sebe N, Hauptmann AG (2017) Joint attributes and event analysis for multimedia event detection. IEEE Trans Neural Netw Learn Syst. doi: 10.1109/TNNLS.2017.2709308
  28. 28.
    Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. Advances in Neural Information Processing Systems, In, pp 2204–2212Google Scholar
  29. 29.
    Oquab M, Bottou L, Laptev I, Sivic J (1717–1724) (2014) learning and transferring mid-level image representations using convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition. pp, InGoogle Scholar
  30. 30.
    Oquab M, Bottou L, Laptev I, Sivic J (2015) Is object localization for free? - weakly-supervised learning with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, In, pp 685–694Google Scholar
  31. 31.
    Papadopoulos DP, Clarke ADF, Keller F, Ferrari V (2014) Training object class detectors from eye tracking data. In: European Conference on Computer Vision. pp 1–16Google Scholar
  32. 32.
    Redmon J, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  33. 33.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceeding NIPS'15 Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press Cambridge, Montreal, Canada — December 07–12, 2015, pp. 91–99Google Scholar
  34. 34.
    Ren W, Member S, Huang K, Member S (2016) Weakly supervised large scale object localization with multiple instance learning and bag splitting. IEEE Trans Pattern Anal Mach Intell 38:405–416. doi: 10.1109/TPAMI.2015.2456908 CrossRefGoogle Scholar
  35. 35.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252. doi: 10.1007/s11263-015-0816-y MathSciNetCrossRefGoogle Scholar
  36. 36.
    Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention. International Conference on Learning Representations, In, pp 1–11Google Scholar
  37. 37.
    Shi M, Ferrari V (2016) Weakly supervised object localization using size estimates. In: European Conference on Computer VisionGoogle Scholar
  38. 38.
    Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. IEEE, Las VegasGoogle Scholar
  39. 39.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. pp 1–14Google Scholar
  40. 40.
    Song HO, Girshick R, Jegelka S, Mairal J, Harchaoui Z, Darrell T (2014) On learning to localize objects with minimal supervision. In: Proceeding ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning vol. 32, Beijing, China, 21–26 June, 2014Google Scholar
  41. 41.
    Song HO, Lee YJ, Jegelka S, Darrell T (2014) Weakly-supervised discovery of visual pattern configurations. In: Proceeding NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press Cambridge, Montreal, Canada, 8–13 December, 2014Google Scholar
  42. 42.
    Treue S, Martinez Trujillo JC (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575–579. doi: 10.1038/21176 CrossRefGoogle Scholar
  43. 43.
    Uijlings JRR, Sande KE a., Gevers T, Smeulders a. WM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171Google Scholar
  44. 44.
    Uijlings JRR, Keller F, Ferrari V (2016) We don’t need no bounding-boxes: training object class detectors using only human verification. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  45. 45.
    Wang C, Huang K, Ren W, Zhang J, Maybank S (2015) Large-scale weakly supervised object localization via latent category learning. IEEE Trans Image Process 24:1371–1385. doi: 10.1109/TIP.2015.2396361 MathSciNetCrossRefGoogle Scholar
  46. 46.
    Xu H, Saenko K (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. European Conference on Computer Vision, In, pp 451–466Google Scholar
  47. 47.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. International Conference on Machine learningGoogle Scholar
  48. 48.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In, IEEE Conference on Computer Vision and Pattern Recognition, p 10Google Scholar
  49. 49.
    Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120:215–232. doi: 10.1007/s11263-016-0907-4 MathSciNetCrossRefGoogle Scholar
  50. 50.
    Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on Intrasaliency prior transfer and deep Intersaliency mining. IEEE Trans Neural Netw Learn Syst 27:1163–1176. doi: 10.1109/TNNLS.2015.2495161 MathSciNetCrossRefGoogle Scholar
  51. 51.
    Zhang D, Meng D, Zhao L, Han J (2016) Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. In: Proceeding IJCAI'16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI Press, New York, USA, 9–15 July, 2016, pp. 3538–3544Google Scholar
  52. 52.
    Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39:865–878. doi: 10.1109/TPAMI.2016.2567393 CrossRefGoogle Scholar
  53. 53.
    Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. IEEE Conference on Computer Vision and Pattern RecognitionCrossRefGoogle Scholar
  54. 54.
    Zhu L, Shen J, Jin H, Xie L, Zheng R (2015) Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans Multimedia 17:981–993. doi: 10.1109/TMM.2015.2431496 CrossRefGoogle Scholar
  55. 55.
    Zhu L, Shen J, Jin H, Zheng R, Xie L (2015) Content-based visual landmark search via multimodal hypergraph learning. IEEE Trans Cybern 45:2756–2769. doi: 10.1109/TCYB.2014.2383389 CrossRefGoogle Scholar
  56. 56.
    Zhu Z, Liang D, Zhang S, Huang X, Baoli Li SH (2016) Traffic-sign detection and classification in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 2110–2118Google Scholar
  57. 57.
    Zhu L, Shen J, Liu X, Xie L, Nie L (2016) Learning compact visual representation with canonical views for robust mobile landmark search. In: Proceeding IJCAI'16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI Press, New York, USA, 9–15 July 2016, pp. 3959–3965Google Scholar
  58. 58.
    Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern. doi: 10.1109/TCYB.2016.2591068
  59. 59.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29:472–486. doi: 10.1109/TKDE.2016.2562624 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of Information and Communication EngineeringBeijing University of Posts and TelecommunicationsBeijingChina
  2. 2.Beijing Key Laboratory of Network System and Network CultureBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations