Advertisement

Deep Learning-Based Descriptors for Object Instance Search

  • Jie LinEmail author
  • Olivier Morère
  • Antoine Veillard
  • Vijay Chandrasekhar
Chapter

Abstract

In the past 5 years, deep learning has achieved remarkable breakthroughs, mainly attributed to the success of convolutional neural networks (CNN) on vision applications like ImageNet classification. In this chapter, we are interested in deep learning-based descriptors for object instance search in images. Specifically, we propose to tackle some practical issues of existing CNN models, with a focus on resource-efficient yet effective deep descriptors extracted from CNN. (1) How to achieve compact image representations (e.g., hundreds of bits) from deep neural networks in an end-to-end manner? (2) How to encode scale/rotation invariances into deep CNN architecture? To address the issues, our approach has two novel contributions. First, we propose Restricted Boltzmann Machine with a novel batch-level regularization scheme specifically designed for the purpose of descriptor hashing (RBMH), which is able to match the performance of the uncompressed descriptor with tiny 32–256 bit hashes. Second, inspired from invariance theory, we propose Nested Invariance Pooling (NIP), a method for computing group-invariant transformations with feed-forward neural networks. We specifically incorporate scale, translation, and rotation invariances, but the scheme can be extended to any arbitrary sets of transformations. A thorough empirical evaluation with state of the art shows that the results obtained both with the NIP descriptors and the NIP+RBMH hashes are consistently outstanding across a wide range of datasets.

References

  1. 1.
  2. 2.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)Google Scholar
  3. 3.
    Anselmi, F., Leibo, J.Z., Rosasco, L., Mutch, J., Tacchetti, A., Poggio, T.: Magic materials: a theory of deep hierarchical architectures for learning sensory representations. CBCL paper (2013)Google Scholar
  4. 4.
    Anselmi, F., Leibo, J.Z., Rosasco, L., Mutch, J., Tacchetti, A., Poggio, T.: Unsupervised learning of invariant representations in hierarchical architectures. arXiv preprint arXiv:1311.4158 (2013)Google Scholar
  5. 5.
    Anselmi, F., Poggio, T.: Representation learning in sensory cortex: a theory. Tech. rep., Center for Brains, Minds and Machines (CBMM) (2014)Google Scholar
  6. 6.
    Arandjelovic, R., Zisserman, A.: All about vlad. In: Computer Vision and Pattern Recognition (CVPR), pp. 1578–1585 (2013)Google Scholar
  7. 7.
    Azizpour, H., Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: From generic to specific deep representations for visual recognition. In: Computer Vision and Pattern Recognition Workshops (CVPR), pp. 36–45 (2015)Google Scholar
  8. 8.
    Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: International Conference on Computer Vision (ICCV), pp. 1269–1277 (2015)Google Scholar
  9. 9.
    Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European Conference on Computer Vision (ECCV), pp. 584–599. Springer (2014)Google Scholar
  10. 10.
    Bromley, J., Guyon, I., LeCun, Y., Sackinger, E., Shah, R.: Signature verification using a Siamese time delay neural network. In: J. Cowan, G. Tesauro (eds.) Neural Information Processing Systems (NIPS), vol. 6. Morgan Kaufmann (1993)Google Scholar
  11. 11.
    Chandrasekhar, V., Lin, J., Morère, O., Goh, H., Veillard, A.: A practical guide to CNNs and Fisher vectors for image instance retrieval. Signal Processing (SIGPRO) (2016)Google Scholar
  12. 12.
    Chandrasekhar, V., Lin, J., Morère, O., Veillard, A., Goh, H.: Compact global descriptors for visual search. In: Data Compression Conference (DCC), pp. 333–342. IEEE (2015)Google Scholar
  13. 13.
    Chandrasekhar, V., Makar, M., Takacs, G., Chen, D., Tsai, S.S., Cheung, N.M., Grzeszczuk, R., Reznik, Y., Girod, B.: Survey of sift compression schemes. In: Mobile Multimedia Processing Workshop (MMP), pp. 35–40. Citeseer (2010)Google Scholar
  14. 14.
    Chandrasekhar, V., Takacs, G., Chen, D., Tsai, S., Grzeszczuk, R., Girod, B.: Chog: Compressed histogram of gradients a low bit-rate feature descriptor. In: Computer Vision and Pattern Recognition (CVPR), pp. 2504–2511. IEEE (2009)Google Scholar
  15. 15.
    Chen, D., Tsai, S., Chandrasekhar, V., Takacs, G., Chen, H., Vedantham, R., Grzeszczuk, R., Girod, B.: Residual enhanced visual vectors for on-device image matching. Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR) pp. 850–854 (2011). URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6190128
  16. 16.
    Chen, D., Tsai, S., Chandrasekhar, V., Takacs, G., Vedantham, R., Grzeszczuk, R., Girod, B.: Residual enhanced visual vector as a compact signature for mobile visual search. Signal Processing (SIGPRO) 93(8), 2316–2327 (2013)CrossRefGoogle Scholar
  17. 17.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 539–546. IEEE (2005)Google Scholar
  18. 18.
    Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Annual Symposium on Computational Geometry (SoCG), pp. 253–262. ACM (2004)Google Scholar
  19. 19.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Girshick, R.: Fast r-cnn. In: International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)Google Scholar
  21. 21.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition (CVPR), pp. 580–587. IEEE (2014)Google Scholar
  22. 22.
    Goh, H., Thome, N., Cord, M., Lim, J.H.: Unsupervised and supervised visual codes with restricted Boltzmann machines. In: European Conference on Computer Vision (ECCV), pp. 298–311. Springer (2012)Google Scholar
  23. 23.
    Gong, Y., Kumar, S., Rowley, H., Lazebnik, S.: Learning binary codes for high-dimensional data using bilinear projections. In: Computer Vision and Pattern Recognition (CVPR), pp. 484–491 (2013)Google Scholar
  24. 24.
    Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learning binary codes. In: Computer Vision and Pattern Recognition (CVPR), pp. 817–824. IEEE (2011)Google Scholar
  25. 25.
    Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learning binary codes. Computer Vision and Pattern Recognition (CVPR) pp. 817–824 (2011)Google Scholar
  26. 26.
    Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: European Conference on Computer Vision (ECCV), pp. 392–407. Springer (2014)Google Scholar
  27. 27.
    Gordo, A., Perronnin, F., Gong, Y., Lazebnik, S.: Asymmetric distances for binary embeddings. Pattern Analysis and Machine Intelligence (PAMI) 36(1), 33–47 (2014)CrossRefGoogle Scholar
  28. 28.
    Gordo, A., Rodríguez-Serrano, J.A., Perronnin, F., Valveny, E.: Leveraging category-level labels for instance-level image retrieval. In: Computer Vision and Pattern Recognition (CVPR), pp. 3045–3052. IEEE (2012)Google Scholar
  29. 29.
    Grauman, K., Fergus, R.: Learning binary hash codes for large-scale image search. In: Machine Learning for Computer Vision, pp. 49–87. Springer (2013)Google Scholar
  30. 30.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 1735–1742. IEEE (2006)Google Scholar
  31. 31.
    Heo, J.P., Lee, Y., He, J., Chang, S.F., Yoon, S.E.: Spherical hashing. In: Computer Vision and Pattern Recognition (CVPR), pp. 2957–2964. IEEE (2012)Google Scholar
  32. 32.
    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation (NC) 14(8), 1771–1800 (2002)CrossRefGoogle Scholar
  33. 33.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation (NC) 18(7), 1527–1554 (2006)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Hinton, G.E., Sejnowski, T.J.: Learning and relearning in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition 1, 282–317 (1986)Google Scholar
  35. 35.
    Huiskes, M.J., Thomee, B., Lew, M.S.: New trends and ideas in visual concept detection: the mir flickr retrieval evaluation initiative. In: Multimedia Information Retrieval (MIR), pp. 527–536. ACM (2010)Google Scholar
  36. 36.
    Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. In: European Conference on Computer Vision (ECCV), pp. 774–787. Springer (2012)Google Scholar
  37. 37.
    Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: Computer Vision and Pattern Recognition (CVPR), pp. 1169–1176. IEEE (2009)Google Scholar
  38. 38.
    Jégou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. Pattern Analysis and Machine Intelligence (PAMI) 34(9), 1704–1716 (2012)CrossRefGoogle Scholar
  39. 39.
    Jégou, H., Zisserman, A.: Triangulation embedding and democratic aggregation for image search. In: Computer Vision and Pattern Recognition (CVPR), pp. 3310–3317. IEEE (2014)Google Scholar
  40. 40.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Conference on Multimedia (ACMMM), pp. 675–678. ACM (2014)Google Scholar
  41. 41.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS) pp. 1–9 (2012). URL https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
  42. 42.
    Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: International Conference on Computer Vision (ICCV), pp. 2130–2137. IEEE (2009)Google Scholar
  43. 43.
    Liao, Q., Leibo, J.Z., Poggio, T.: Learning invariant representations and applications to face verification. In: Neural Information Processing Systems (NIPS), pp. 3057–3065 (2013)Google Scholar
  44. 44.
    Lin, J., Duan, L.Y., Huang, T., Gao, W.: Robust Fisher codes for large scale image retrieval. In: International Conference on Acoustics and Signal Processing (ICASSP) (2013)Google Scholar
  45. 45.
    Lin, J., Duan, L.y., Wang, S., Bai, Y., Lou, Y., Chandrasekhar, V., Huang, T., Kot, A., Gao, W.: Hnip: Compact deep invariant representations for video matching, localization and retrieval. Transactions on Multimedia (TMM) (2017)Google Scholar
  46. 46.
    Lin, J., Morère, O., Chandrasekhar, V., Veillard, A., Goh, H.: Co-sparsity regularized deep hashing for image instance retrieval. In: International Conference on Image Processing (ICIP). IEEE (2016)Google Scholar
  47. 47.
    Lin, J., Morère, O., Petta, J., Chandrasekhar, V., Veillard, A.: Tiny descriptors for image retrieval with unsupervised triplet hashing. In: Data Compression Conference (DCC) (2016)Google Scholar
  48. 48.
    Lin, J., Morère, O., Veillard, A., Duan, L.Y., Goh, H., Chandrasekhar, V.: Deephash for image instance retrieval: Getting regularization, depth and fine-tuning right. In: ACM International Conference on Multimedia Retrieval (2017)Google Scholar
  49. 49.
    Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: Computer Vision and Pattern Recognition (CVPR), pp. 2074–2081. IEEE (2012)Google Scholar
  50. 50.
    Lou, Y., Bai, Y., Lin, J., Wang, S., Chen, J., Chandrasekhar, V., Duan, L.y., Tiejun, H., Kot, A., Gao, W.: Compact deep invariant descriptors for video retrieval. In: Data Compression Conference (DCC) (2017)Google Scholar
  51. 51.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60(2), 91–110 (2004)CrossRefGoogle Scholar
  52. 52.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60(2), 91–110 (2004). doi: 10.1023/B:VISI.0000029664.99615.94. URL http://link.springer.com/10.1023/B:VISI.0000029664.99615.94
  53. 53.
    Morère, O., Lin, J., Veillard, A., Duan, L.y., Chandrasekhar, V., Poggio, T.: Nested invariance pooling and rbm hashing for image instance retrieval. In: ACM International Conference on Multimedia Retrieval (2017)Google Scholar
  54. 54.
    Morère, O., Veillard, A., Lin, J., Petta, J., Chandrasekhar, V., Poggio, T.: Group invariant deep representations for image instance retrieval. In: AAAI Symposium on Science of Intelligence (2017)Google Scholar
  55. 55.
    Nair, V., Hinton, G.E.: 3D object recognition with deep belief nets. In: Neural Information Processing Systems (NIPS), pp. 1339–1347 (2009)Google Scholar
  56. 56.
    Norouzi, M., Blei, D.M.: Minimal loss hashing for compact binary codes. In: International Conference on Machine Learning (ICML), pp. 353–360 (2011)Google Scholar
  57. 57.
    Perronnin, F., Larlus, D.: Fisher vectors meet neural networks: A hybrid classification architecture. In: Computer Vision and Pattern Recognition (CVPR), pp. 3743–3752 (2015)Google Scholar
  58. 58.
    Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: Computer Vision and Pattern Recognition (CVPR), pp. 3384–3391. IEEE (2010)Google Scholar
  59. 59.
    Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant kernels. In: Neural Information Processing Systems (NIPS), pp. 1509–1517 (2009)Google Scholar
  60. 60.
    Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition Workshop (CVPR), pp. 806–813 (2014)Google Scholar
  61. 61.
    Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative filtering. In: International Conference on Machine Learning (ICML), pp. 791–798. ACM (2007)Google Scholar
  62. 62.
    Sharif Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. In: International Conference on Learning Representations (ICLR). ICLR (2015)Google Scholar
  63. 63.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  64. 64.
    Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory. In: Parallel distributed processing: explorations in the microstructure of cognition, pp. 194–281. MIT Press (1986)Google Scholar
  65. 65.
    Tolias, G., Avrithis, Y., Jégou, H.: To aggregate or not to aggregate: Selective match kernels for image search. In: International Conference on Computer Vision (ICCV), pp. 1401–1408 (2013)Google Scholar
  66. 66.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  67. 67.
    Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE (2008)Google Scholar
  68. 68.
    Tuytelaars, T.: Dense interest points. In: Computer Vision and Pattern Recognition (CVPR), pp. 2281–2288. IEEE (2010)Google Scholar
  69. 69.
    Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable image retrieval. In: Computer Vision and Pattern Recognition (CVPR), pp. 3424–3431. IEEE (2010)Google Scholar
  70. 70.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Neural Information Processing Systems (NIPS), pp. 1753–1760 (2009)Google Scholar
  71. 71.
    Winder, S., Hua, G., Brown, M.: Picking the best daisy. In: Computer Vision and Pattern Recognition (CVPR), pp. 178–185. IEEE (2009)Google Scholar
  72. 72.
    Yahoo: Yahoo! 100 million image data set. http://webscope.sandbox.yahoo.com/
  73. 73.
    Yeo, C., Ahammad, P., Ramchandran, K.: Rate-efficient visual correspondences using random projections. In: International Conference on Image Processing (ICIP), pp. 217–220. IEEE (2008)Google Scholar
  74. 74.
    Zhang, C., Evangelopoulos, G., Voinea, S., Rosasco, L., Poggio, T.: A deep representation for invariance and music classification. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 6984–6988. IEEE (2014)Google Scholar
  75. 75.
    Zhang, T., Du, C., Wang, J.: Composite quantization for approximate nearest neighbor search. In: International Conference on Machine Learning (ICML), pp. 838–846 (2014)Google Scholar
  76. 76.
    Zhang, T., Qi, G.J., Tang, J., Wang, J.: Sparse composite quantization. In: Computer Vision and Pattern Recognition (CVPR), pp. 4548–4556 (2015)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Jie Lin
    • 1
    Email author
  • Olivier Morère
    • 2
  • Antoine Veillard
    • 2
  • Vijay Chandrasekhar
    • 3
  1. 1.Institute for Infocomm ResearchSingaporeSingapore
  2. 2.Université Pierre et Marie CurieParisFrance
  3. 3.Institute for Infocomm Research and Nanyang Technological UniversitySingaporeSingapore

Personalised recommendations