Multimedia Tools and Applications

, Volume 77, Issue 17, pp 22385–22406 | Cite as

Neural ranking for automatic image annotation

  • Weifeng Zhang
  • Hua HuEmail author
  • Haiyang HuEmail author


Automatic image annotation aims to predict labels for images according to their semantic contents and has become a research focus in computer vision, as it helps people to edit, retrieve and understand large image collections. In the last decades, researchers have proposed many approaches to solve this task and achieved remarkable performance on several standard image datasets. In this paper, we propose a novel learning to rank approach to address image auto-annotation problem. Unlike typical learning to rank algorithms for image auto-annotation which directly rank annotations for image, our approach consists of two phases. In the first phase, neural ranking models are trained to rank image’s semantic neighbors. Then nearest-neighbor based models propagate annotations from these semantic neighbors to the image. Thus our approach integrates learning to rank algorithms and nearest-neighbor based models, including TagProp and 2PKNN, and inherits their advantages. Experimental results show that our method achieves better or comparable performance compared with the state-of-the-art methods on four challenging benchmarks including Corel5K, ESP Games, IAPR TC-12 and NUS-WIDE.


Automatic image annotation Learning to rank Neural networks Nearest neighbor 



This work is supported by the Natural Science Foundation of China (No. 61572162) and the Zhejiang Provincial Key Science and Technology Project Foundation (No. 2018C01012).

Compliance with ethical standards

Conflict of interests

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.


  1. 1.
    Agrawal A, Lu J, Antol S (2015) Vqa: Visual question answering. Int J Comput Vis 123(1):4–31MathSciNetCrossRefGoogle Scholar
  2. 2.
    Ballan L, Uricchio T, Seidenari L, Bimbo AD (2014) A cross-media model for automatic image annotation. In: ACM ICMR, pp 73–80Google Scholar
  3. 3.
    Blei D, Jordan M (2003) Modeling annotated data. In: ACM SIGIR, pp 127–134Google Scholar
  4. 4.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  5. 5.
    Burges C (2005) Learning to rank using gradient descent. In: ICML, pp 89–96Google Scholar
  6. 6.
    Burges C (2010) From ranknet to lambdarank to lambdamart: An overview. In: Technical report, Microsoft ResearchGoogle Scholar
  7. 7.
    Cai D, He X, Han J (2007) Semi-supervised discriminant analysis. In: ICCVGoogle Scholar
  8. 8.
    Cao Z, Qin T (2007) Learning to rank: from pairwise approach to listwise approach. In: ICML, pp 129–136Google Scholar
  9. 9.
    Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410CrossRefGoogle Scholar
  10. 10.
    Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC, pp 1–12Google Scholar
  11. 11.
    Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: CVPR, pp 539–546Google Scholar
  12. 12.
    Dehghani M, Zamani H, Severyn A, Kamps J, Croft WB (2017) Neural ranking models with weak supervision. In: ACM SIGIR, pp 65–74Google Scholar
  13. 13.
    Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: CVPR, pp 248–255Google Scholar
  14. 14.
    Fabian L, Michael J, Nebojsa J (2013) Efficient ranking from pairwise comparisons. In: ICML, pp 109–117Google Scholar
  15. 15.
    Fenga S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: CVPR, pp 1002–1009Google Scholar
  16. 16.
    Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: CVPR, pp 1924–1932Google Scholar
  17. 17.
    Fernando B, Gawes E, Oramas J, Ghodrati J, Tuytelaars T (2017) Rank pooling for action recognition. IEEE Trans Pattern Anal Mach Intell 39(4):773–787CrossRefGoogle Scholar
  18. 18.
    Fu H, Zhang Q, Qiu G (2012) Random forest for image annotation. In: ECCV, pp 86–99Google Scholar
  19. 19.
    Gao Z, Nie W, Liu A (2016) Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomputing 173(1):110–117CrossRefGoogle Scholar
  20. 20.
    Gao Z, Zhang H, Liu A (2016) Human action recognition on depth dataset. Neural Comput Applic 27(7):2047–2054CrossRefGoogle Scholar
  21. 21.
    Gao Z, Zhang L, Chen M (2014) Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools Appl 68(3):641–657CrossRefGoogle Scholar
  22. 22.
    Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. arXiv:13124894
  23. 23.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRefGoogle Scholar
  24. 24.
    Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-setence embeddings using large weakly annotated photo collections. In: ECCV, pp 529–545Google Scholar
  25. 25.
    Gu Y, Xue H, Yang J (2016) Cross-modal saliency correlation for image annotation. Neural Process Lett 45(3):777–789CrossRefGoogle Scholar
  26. 26.
    Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV, pp 309–316Google Scholar
  27. 27.
    Hardoon D, Szedmak S, Shawe-Taylor J (2004) Cannonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefzbMATHGoogle Scholar
  28. 28.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778Google Scholar
  29. 29.
    Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML, pp 448–456Google Scholar
  30. 30.
    Jeon J, Lavreko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR, pp 119–126Google Scholar
  31. 31.
    Joachims T (2002) Optimizing search engines using clickthrough data. In: ACM SIGKDD, pp 133–142Google Scholar
  32. 32.
    Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: Image annotation by exploiting image metadata. In: ICCV, pp 4624–4632Google Scholar
  33. 33.
    Kang F, Sukthankar R (2006) Correlated label propagation with application to multi-label learning. In: CVPR, pp 1719–1726Google Scholar
  34. 34.
    Kingma D, Ba J (2014) Adam: A method for stochastic optimization. arXiv:14126980
  35. 35.
    Kiros R, Szepesvari C (2015) Deep representations and codes for image auto-annotation. In: NIPS, pp 917–925Google Scholar
  36. 36.
    Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. arXiv:14117399
  37. 37.
    Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1106–1114Google Scholar
  38. 38.
    Lavrenko V, Manmatha R, Jeon J (2004) A model for learning the semantics of pictures. In: NIPS, pp 553–560Google Scholar
  39. 39.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR, pp 2169–2178Google Scholar
  40. 40.
    Li X, Snoek C, Worring M (2007) Learning social tag relevance by neighbor voting. IEEE TMM 11(7):1310–1322Google Scholar
  41. 41.
    Li Z, Liu J, Xu C, Lu H (2013) Mlrank: Multi-correlation learning to rank for image annotation. Pattern Recogn 46(10):2700–2710CrossRefzbMATHGoogle Scholar
  42. 42.
    Liu J, Li M, Liu Q, Lu H, Ma S (2009) Image annotation via graph learning. Pattern Recogn 42(2):218–228CrossRefzbMATHGoogle Scholar
  43. 43.
    Liu T (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331CrossRefGoogle Scholar
  44. 44.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110CrossRefGoogle Scholar
  45. 45.
    Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: ECCV, pp 316–329Google Scholar
  46. 46.
    Makadia A, Pavlovic V, Kumar S (2010) Baselines for image annotation. Int J Comput Vis 90(1):88–105CrossRefGoogle Scholar
  47. 47.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
  48. 48.
    Montazer G, Giveki D (2017) Scene classification using multi-resolution waholb features and neural network classifier. Neural Process Lett 46(2):681–704CrossRefGoogle Scholar
  49. 49.
    Moran S, Lanvrenko V (2014) Sparse kernel learning for image annotation. In: ACM ICMR, p 113Google Scholar
  50. 50.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3):145–175CrossRefzbMATHGoogle Scholar
  51. 51.
    Peng X, Zou C, Qiao Y, Peng Q (2010) Action recognition with stacked fisher vectors. In: ECCV, pp 581–595Google Scholar
  52. 52.
    Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large scale image classification. In: ECCV, pp 143–156Google Scholar
  53. 53.
    Song Y, Zhuang Z, Li H, Zhao Q, Li J, Lee W, Giles CL (2008) Real-time automatic tag recommendation. In: ACM SIGIR, pp 515–522Google Scholar
  54. 54.
    Thomas D, Andreas K, Joel W (2014) Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. J Mach Learn Res 15(1):3873–3923MathSciNetzbMATHGoogle Scholar
  55. 55.
    Thorsten J (2006) Training linear svms in linear time. In: KDD, pp 217–226Google Scholar
  56. 56.
    Venkatesh N, Subhransu M, Manmatha R (2015) Automatic image annotation using deep learning representations. In: ACM ICMR, pp 603–606Google Scholar
  57. 57.
    Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. In: ECCV, pp 836–849Google Scholar
  58. 58.
    Verma Y, Jawahar C (2013) Exploring svm for image annotation in presence of confusing labels. In: British Machine Vision Conference, pp 1–11Google Scholar
  59. 59.
    Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) Cnn-rnn: A unified framework for multi-label image classification. In: CVPR, pp 2285–2294Google Scholar
  60. 60.
    Wang L, Liu L, Khan L (2004) Automatic image annotation and retrieval ussing subspace clustering algorithm. In: ACM International Workshop Multimedia Databases, pp 100–108Google Scholar
  61. 61.
    Weston J, Bengio S, Usunier N (2011) Wsabie: Scaling up to large vocabulary image annotation. In: IJCAI, pp 2764–2770Google Scholar
  62. 62.
    Wu F, Jing X, Yue D (2017) Multi-view discriminant dictionary learning via learning view-specific and shared structured dictionaries for image classification. Neural Process Lett 45(2):649–666CrossRefGoogle Scholar
  63. 63.
    Yan X, Su XG (2009) Linear regression analysis: Theory and computing. World Scientfic Publishing Co, Inc, River EdgeCrossRefzbMATHGoogle Scholar
  64. 64.
    Yan Y, Nie F, Li W, Gao C, Yang Y, Xu D (2016) Image classification by cross-media active learning with privileged information. IEEE Trans Multimedia 18(12):2494–2502CrossRefGoogle Scholar
  65. 65.
    Yang C, Dong M, Hua J (2007) Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In: CVPR, pp 2057–2063Google Scholar
  66. 66.
    Yang Y, Xu D, Nie F, Yan S, Zhuang Y (2010) Image clustering using local discriminant models and global integration. IEEE Trans Image Process 19(10):2761–2773MathSciNetCrossRefzbMATHGoogle Scholar
  67. 67.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRefGoogle Scholar
  68. 68.
    Yun H, Raman P, Vishwanathan S (2014) Ranking via robust binary classification. In: NIPS, pp 2582–2590Google Scholar
  69. 69.
    Zhang S, Huang J, Huang Y (2010) Automatic image annotation using group sparsity. In: CVPR, pp 3312–3319Google Scholar
  70. 70.
    Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina
  2. 2.Zhejiang Future Technology InstituteJiaxingChina
  3. 3.Science and Technology on Communication Information Security Control LaboratoryJiangnan Electronic Communication InstituteJiaxingChina

Personalised recommendations