Compact Deep Aggregation for Set Retrieval

  • Yujie ZhongEmail author
  • Relja ArandjelovićEmail author
  • Andrew ZissermanEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem – that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities, all but one, etc.

To this end, we make the following contributions: first, we propose a CNN architecture – SetNet – to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval, and the score of an image is a count of the number of identities that match the query; second, we show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that – far exceeding a number of baselines; third, we explore the speed vs. retrieval quality trade-off for set retrieval using this compact descriptor; and, finally, we collect and annotate a large dataset of images containing various number of celebrities, which we use for evaluation and will be publicly released.



This work was funded by an EPSRC studentship and EPSRC Programme Grant Seebibyte EP/M013774/1.

Supplementary material

478824_1_En_36_MOESM1_ESM.pdf (4.2 mb)
Supplementary material 1 (pdf 4281 KB)


  1. 1.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of CVPR (2016)Google Scholar
  2. 2.
    Arandjelović, R., Zisserman, A.: All about VLAD. In: Proceedings of CVPR (2013)Google Scholar
  3. 3.
    Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2018)Google Scholar
  4. 4.
    Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of BMVC (2011)Google Scholar
  5. 5.
    Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P.: Revisiting the VLAD image representation. In: Proceedings of ACMM (2013)Google Scholar
  6. 6.
    Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part III. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). Scholar
  7. 7.
    Rezatofighi, S.H., Kumar, B., Milan, A., Abbasnejad, E., Dick, A., Reid, I.: DeepSetNet: predicting sets with deep neural networks. In: Proceedings of CVPR (2017)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR (2016)Google Scholar
  9. 9.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML (2015)Google Scholar
  10. 10.
    Iscen, A., Furon, T., Gripon, V., Rabbat, M., Jégou, H.: Memory vectors for similarity search in high-dimensional spaces. IEEE Trans. Big Data 4, 65–77 (2017)CrossRefGoogle Scholar
  11. 11.
    Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, pp. 774–787. Springer, Heidelberg (2012). Scholar
  12. 12.
    Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: Proceedings of CVPR (2010)Google Scholar
  13. 13.
    Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE PAMI 34, 1704–1716 (2011)CrossRefGoogle Scholar
  14. 14.
    Jégou, H., Zisserman, A.: Triangulation embedding and democratic aggregation for image search. In: Proceedings of CVPR (2014)Google Scholar
  15. 15.
    Kondor, R., Jebara, T.: A kernel between sets of vectors. In: Proceedings of ICML. AAAI Press (2003)Google Scholar
  16. 16.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Mathias, M., Benenson, R., Pedersoli, M., Van Gool, L.: Face detection without bells and whistles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 720–735. Springer, Cham (2014). Scholar
  18. 18.
    Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of CVPR, pp. 2161–2168 (2006)Google Scholar
  19. 19.
    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)Google Scholar
  20. 20.
    Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: Proceedings of CVPR (2010)Google Scholar
  21. 21.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). Scholar
  22. 22.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of CVPR (2007)Google Scholar
  23. 23.
    Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). Scholar
  24. 24.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of CVPR (2015)Google Scholar
  25. 25.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of ICCV, vol. 2, pp. 1470–1477 (2003)Google Scholar
  26. 26.
    Sun, Y., Zheng, L., Deng, W., Wang, S.: SVDNet for pedestrian retrieval. In: Proceedings of ICCV (2017)Google Scholar
  27. 27.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deep-face: closing the gap to human-level performance in face verification. In: IEEE CVPR (2014)Google Scholar
  28. 28.
    Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 776–789. Springer, Heidelberg (2010). Scholar
  29. 29.
    Vedaldi, A., Lenc, K.: MatConvNet: convolutional neural networks for MATLAB. In: Proceedings of ACMM (2015)Google Scholar
  30. 30.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: Proceedings of CVPR (2010)Google Scholar
  31. 31.
    Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., Smola, A.: Deep sets. In: NIPS, pp. 3391–3401 (2017)Google Scholar
  32. 32.
    Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Visual Geometry Group, Department of Engineering ScienceUniversity of OxfordOxfordUK
  2. 2.DeepMindLondonUK

Personalised recommendations