Multimedia Tools and Applications

, Volume 78, Issue 10, pp 13247–13261 | Cite as

Instance-level object retrieval via deep region CNN

  • Shuhuan Mei
  • Weiqing Min
  • Hua DuanEmail author
  • Shuqiang Jiang


Instance retrieval is a fundamental problem in the multimedia field for its various applications. Since the relevancy is defined at the instance level, it is more challenging comparing to traditional image retrieval methods. Recent advances show that Convolutional Neural Networks (CNNs) offer an attractive method for image feature representations. However, the CNN method extracts features from the whole image, thus the extracted features contain a large amount of background noisy information, leading to poor retrieval performance. To solve the problem, this paper proposed a deep region CNN method with object detection for instance-level object retrieval, which has two phases, i.e., offline Faster R-CNN training and online instance retrieval. First, we train a Faster R-CNN model to better locate the region of the objects. Second, we extract the CNN features from the detected object image region and then retrieve relevant images based on the visual similarity of these features. Furthermore, we utilized three different strategies for feature fusing based on the detected object region candidates from Faster R-CNN. We conduct the experiment on a large dataset: INSTRE with 23,070 object images and additional one million distractor images. Qualitative and quantitative evaluation results have demonstrated the advantage of our proposed method. In addition, we conducted extensive experiments on the Oxford dataset and the experimental results further validated the effectiveness of our proposed method.


Faster R-CNN Deep learning Instance-level object retrieval Instre 



This work was supported in part by the National Natural Science Foundation of China (61532018,61322212, 61602437, 61672497, 61472229 and 61202152), in part by the Beijing Municipal Commission of Science and Technology (D161100001816001),in part by Beijing Natural Science Foundation (4174106), in part by the Lenovo Outstanding Young Scientists Program, in part by National Program for Special Support of Eminent Professionals and National Program for Support of Top-notch Young Professionals, and in part by China Postdoctoral Science Foundation (2016M590135, 2017T100110). This work was also supported in part by Science and Technology Development Fund of Shandong Province of China (2016ZDJS02A11 and ZR2017MF027), the Taishan Scholar Climbing Program of Shandong Province, and SDUST Research Fund (2015TDJH102).


  1. 1.
    Arandjelovic R, Zisserman A (2013) All about VLAD. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1578–1585Google Scholar
  2. 2.
    Babenko A, Lempitsky V (2015) Aggregating local deep features for image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 1269–1277Google Scholar
  3. 3.
    Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. In: European conference on computer vision, pp 584–599. Springer, BerlinGoogle Scholar
  4. 4.
    Chandrasekhar V, Lin J, Morere O, Veillard A, Goh H (2015) Compact global descriptors for visual search. In: Data compression conference (DCC), 2015, pp 333–342. IEEEGoogle Scholar
  5. 5.
    Chen DM, Girod B (2015) A hybrid mobile visual search system with compact global signatures. IEEE Transactions on Multimedia 17(7):1019–1030CrossRefGoogle Scholar
  6. 6.
    Chu L, Jiang S, Wang S, Zhang Y, Huang Q (2013) Robust spatial consistency graph model for partial duplicate image retrieval. IEEE Transactions on Multimedia 15(8):1982–1996CrossRefGoogle Scholar
  7. 7.
    Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition Icml, vol 32, pp 647–655Google Scholar
  8. 8.
    Duan LY, Ji R, Chen Z, Huang T, Gao W (2014) Towards mobile document image retrieval for digital library. IEEE Transactions on Multimedia 16(2):346–359CrossRefGoogle Scholar
  9. 9.
    Duan LY, Lin J, Wang Z, Huang T, Gao W (2015) Weighted component hashing of binary aggregated descriptors for fast visual search. IEEE Transactions on multimedia 17(6):828–842CrossRefGoogle Scholar
  10. 10.
    Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448Google Scholar
  11. 11.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587Google Scholar
  12. 12.
    Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision, pp 392–407. Springer, BerlinGoogle Scholar
  13. 13.
    Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: Learning global representations for image search. In: European conference on computer vision, pp 241–257. Springer, BerlinGoogle Scholar
  14. 14.
    Gordo A, Larlus D (2017) Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: IEEE Conference on computer vision and pattern recognition (CVPR)Google Scholar
  15. 15.
    He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 346–361. Springer, BerlinGoogle Scholar
  16. 16.
    Hoang T, Do TT, Le Tan DK, Cheung NM (2017) Selective deep convolutional features for image retrieval. In: Proceedings of the 2017 ACM on Multimedia Conference, pp 1600–1608Google Scholar
  17. 17.
    Hong R, Li L, Cai J, Tao D, Wang M, Tian Q (2017) Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans Image Process 26(9):4128–4138MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. Computer Vision–ECCV 2008:304–317Google Scholar
  19. 19.
    Ji R, Duan LY, Chen J, Xie L, Yao H, Gao W (2013) Learning to distribute vocabulary indexing for scalable visual search. IEEE Transactions on Multimedia 15(1):153–166CrossRefGoogle Scholar
  20. 20.
    Jiang YG, Wang J, Xue X, Chang SF (2013) Query-adaptive image search with hash codes. IEEE transactions on Multimedia 15(2):442–453CrossRefGoogle Scholar
  21. 21.
    Kalantidis Y, Mellina C, Osindero S (2016) Cross-dimensional weighting for aggregated deep convolutional features. In: European conference on computer vision, pp 685–701. Springer, BerlinGoogle Scholar
  22. 22.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  23. 23.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, pp 21–37. Springer, BerlinGoogle Scholar
  24. 24.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  25. 25.
    Noh H, Araujo A, Sim J, Han B (2016) Image retrieval with deep local features and attention-based keypoints. arXiv:1612.06321
  26. 26.
    Panda J, Brown MS, Jawahar CV (2013) Offline mobile instance retrieval with a small memory footprint, pp 1257–1264Google Scholar
  27. 27.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: IEEE conference on computer vision and pattern recognition, 2007, pp 1–8Google Scholar
  28. 28.
    Radenović F, Tolias G, Chum O (2016) Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In: European conference on computer vision, pp 3–20. Springer, BerlinGoogle Scholar
  29. 29.
    Razavian AS, Sullivan J, Carlsson S, Maki A (2014) Visual instance retrieval with deep convolutional networks. arXiv:1412.6574
  30. 30.
    Redmon J, Farhadi A (2016)Google Scholar
  31. 31.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  32. 32.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  33. 33.
    Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Transactions on Multimedia 14(3):883–895CrossRefGoogle Scholar
  34. 34.
    Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813Google Scholar
  35. 35.
    Sharma G, Schiele B (2015)Google Scholar
  36. 36.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  37. 37.
    Sivic J, Zisserman A, et al (2003) Video google: a text retrieval approach to object matching in videos. In: Iccv, vol 2, pp 1470–1477Google Scholar
  38. 38.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  39. 39.
    Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of cnn activations. arXiv:1511.05879
  40. 40.
    Wang S, Jiang S (2015) Instre: a new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 11(3):37Google Scholar
  41. 41.
    Xie Y, Jiang S, Huang Q (2013) Weighted visual vocabulary to balance the descriptive ability on general dataset. Neurocomputing 119:478–488CrossRefGoogle Scholar
  42. 42.
    Zheng L, Yang Y, Tian Q (2016) Sift meets cnn: a decade survey of instance retrieval. arXiv:1608.01807
  43. 43.
    Zhou W, Lu Y, Li H, Song Y, Tian Q (2010) Spatial coding for large scale partial-duplicate web image search. In: Proceedings of the 18th ACM international conference on Multimedia, pp 511–520. ACMGoogle Scholar
  44. 44.
    Zhou W, Li H, Lu Y, Tian Q (2013) Sift match verification by geometric coding for large-scale partial-duplicate web image search. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 9(1):4Google Scholar
  45. 45.
    Zisserman A (2014) Triangulation embedding and democratic aggregation for image search. In: Computer vision and pattern recognition, pp 3310–3317Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Shuhuan Mei
    • 1
    • 2
  • Weiqing Min
    • 2
  • Hua Duan
    • 1
    Email author
  • Shuqiang Jiang
    • 2
    • 3
  1. 1.College of Mathematics and Systems ScienceShandong University of Science and TechnologyQingdaoChina
  2. 2.Key Lab of Intelligent Information ProcessingInstitute of Computing Technology, CASBeijingChina
  3. 3.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations