A Structured Listwise Approach to Learning to Rank for Image Tagging

  • Jorge SánchezEmail author
  • Franco Luque
  • Leandro Lichtensztein
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


With the growing quantity and diversity of publicly available image data, computer vision plays a crucial role in understanding and organizing visual information today. Image tagging models are very often used to make this data accessible and useful. Generating image labels and ranking them by their relevance to the visual content is still an open problem. In this work, we use a bilinear compatibility function inspired from zero-shot learning that allows us to rank tags according to their relevance to the image content. We propose a novel listwise structured loss formulation to learn it from data. We leverage captioned image data and propose different “tags from captions” schemes meant to capture user attention and intra-user agreement in a simple and effective manner. We evaluate our method on the COCO-Captions, PASCAL-sentences and MIRFlickr-25k datasets showing promising results.


Learning to rank Zero-shot learning Image tagging Visual-semantic compatibility Multimodal embedding 



This work was supported in part by grants PICT 2014-1651 and 2016-0118 from ANPCyT, Argentinean Ministry of Education, Culture, Science and Technology. This work used Nabucodonosor Cluster from CCAD-UNC, which is part of SNCAD, Argentina.


  1. 1.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  2. 2.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  3. 3.
    Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, p. 48. ACM (2009)Google Scholar
  4. 4.
    Huiskes, M.J., Lew, M.S.: The MIR Flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, pp. 39–43. ACM (2008)Google Scholar
  5. 5.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  6. 6.
    Verbeek, J., Guillaumin, M., Mensink, T., Schmid, C.: Image annotation with TagProp on the MIRFlickr set. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 537–546. ACM (2010)Google Scholar
  7. 7.
    Wu, P., Hoi, S.C.H., Zhao, P., He, Y.: Mining social images with distance metric learning for automated image tagging. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 197–206. ACM (2011)Google Scholar
  8. 8.
    Li, X., Snoek, C.G., Worring, M.: Learning social tag relevance by neighbor voting. IEEE Trans. Multimedia 11(7), 1310–1322 (2009)CrossRefGoogle Scholar
  9. 9.
    Zhu, G., Yan, S., Ma, Y.: Image tag refinement towards low-rank, content-tag prior and error sparsity. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 461–470. ACM (2010)Google Scholar
  10. 10.
    Ma, H., Zhu, J., Lyu, M.R.T., King, I.: Bridging the semantic gap between image contents and tags. IEEE Trans. Multimedia 12(5), 462–473 (2010)CrossRefGoogle Scholar
  11. 11.
    Gao, Y., Wang, M., Zha, Z.J., Shen, J., Li, X., Wu, X.: Visual-textual joint relevance learning for tag-based social image search. IEEE Trans. Image Process. 22(1), 363–376 (2013)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C.G., Bimbo, A.D.: Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv. (CSUR) 49(1), 14 (2016)CrossRefGoogle Scholar
  13. 13.
    Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 453–465 (2014)CrossRefGoogle Scholar
  14. 14.
    Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1641–1648. IEEE (2011)Google Scholar
  15. 15.
    Chao, W.-L., Changpinyo, S., Gong, B., Sha, F.: An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 52–68. Springer, Cham (2016). Scholar
  16. 16.
    Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)Google Scholar
  17. 17.
    Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600 (2017)
  18. 18.
    Fu, Y., Xiang, T., Jiang, Y.G., Xue, X., Sigal, L., Gong, S.: Recent advances in zero-shot recognition: toward data-efficient understanding of visual content. IEEE Signal Process. Mag. 35(1), 112–125 (2018)CrossRefGoogle Scholar
  19. 19.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  21. 21.
    Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(7), 1425–1438 (2016)CrossRefGoogle Scholar
  22. 22.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  23. 23.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  24. 24.
    Loeff, N., Alm, C.O., Forsyth, D.A.: Discriminating image senses by clustering with multimodal features. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 547–554. Association for Computational Linguistics (2006)Google Scholar
  25. 25.
    Lazaridou, A., Bruni, E., Baroni, M.: Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1403–1414 (2014)Google Scholar
  26. 26.
    Lazaridou, A., Pham, N.T., Baroni, M.: Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598 (2015)
  27. 27.
    Silberer, C., Ferrari, V., Lapata, M.: Visually grounded meaning representations. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2284–2297 (2017)CrossRefGoogle Scholar
  28. 28.
    Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: theory and algorithm. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1192–1199. ACM (2008)Google Scholar
  29. 29.
    Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147. Association for Computational Linguistics (2010)Google Scholar
  30. 30.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  31. 31.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  32. 32.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  33. 33.
    Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear)Google Scholar
  34. 34.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
  35. 35.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jorge Sánchez
    • 1
    • 2
    Email author
  • Franco Luque
    • 1
    • 2
  • Leandro Lichtensztein
    • 3
  1. 1.CONICETCórdobaArgentina
  2. 2.Universidad Nacional de CórdobaCórdobaArgentina
  3. 3.Deep Vision AI Inc.CórdobaArgentina

Personalised recommendations