Deep Cross-Modal Projection Learning for Image-Text Matching

  • Ying Zhang
  • Huchuan LuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs. Despite the great progress of associating the deep cross-modal embeddings with the bi-directional ranking loss, developing the strategies for mining useful triplets and selecting appropriate margins remains a challenge in real applications. In this paper, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss for learning discriminative image-text embeddings. The CMPM loss minimizes the KL divergence between the projection compatibility distributions and the normalized matching distributions defined with all the positive and negative samples in a mini-batch. The CMPC loss attempts to categorize the vector projection of representations from one modality onto another with the improved norm-softmax loss, for further enhancing the feature compactness of each class. Extensive analysis and experiments on multiple datasets demonstrate the superiority of the proposed approach.


Image-text matching Cross-modal projection Joint embedding learning Deep learning 



This work was supported by the Natural Science Foundation of China under Grant 61725202, 61751212, 61771088, 61632006 and 91538201.


  1. 1.
    Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: a deep quadruplet network for person re-identification. In: CVPR, pp. 1320–1329 (2017)Google Scholar
  3. 3.
    Deng, J., Guo, J., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. arXiv: 1801.07698 (2018)
  4. 4.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016).
  5. 5.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR, pp. 1735–1742 (2006)Google Scholar
  6. 6.
    Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRefGoogle Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  8. 8.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv: 1703.07737 (2017)
  9. 9.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv: 1704.04861 (2017)
  10. 10.
    Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: CVPR, pp. 4555–4564 (2016)Google Scholar
  11. 11.
    Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: CVPR, pp. 7254–7262 (2017)Google Scholar
  12. 12.
    Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)Google Scholar
  13. 13.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv: 1412.6980 (2014)
  14. 14.
    Klein, B., Lev, G., Sadeh, G., Wolf, L.: Associating neural word embeddings with deep image representations using fisher vectors. In: CVPR, pp. 4437–4446 (2015)Google Scholar
  15. 15.
    Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: ICCV, pp. 1908–1917 (2017)Google Scholar
  16. 16.
    Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR, pp. 5187–5196 (2017)Google Scholar
  17. 17.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  18. 18.
    Lin, X., Parikh, D.: Leveraging visual question answering for image-caption ranking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 261–277. Springer, Cham (2016). Scholar
  19. 19.
    Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017)Google Scholar
  20. 20.
    Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: ICML, pp. 507–516 (2016)Google Scholar
  21. 21.
    Liu, Y., Guo, Y., Bakker, E.M., Lew, M.S.: Learning a recurrent residual fusion network for multimodal matching. In: ICCV, pp. 4127–4136 (2017)Google Scholar
  22. 22.
    Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: ICCV, pp. 2623–2631 (2015)Google Scholar
  23. 23.
    Ma, Z., Lu, Y., Foster, D.P.: Finding linear structure in large datasets with scalable canonical correlation analysis. In: ICML, pp. 169–178 (2015)Google Scholar
  24. 24.
    van der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  26. 26.
    Nam, H., Ha, J., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: CVPR, pp. 2156–2164 (2017)Google Scholar
  27. 27.
    Ranjan, R., Castillo, C.D., Chellappa, R.: L2-constrained softmax loss for discriminative face verification. arXiv: 1703.09507 (2017)
  28. 28.
    Reed, S.E., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: CVPR, pp. 49–58 (2016)Google Scholar
  29. 29.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)Google Scholar
  30. 30.
    Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: NIPS, pp. 1849–1857 (2016)Google Scholar
  31. 31.
    Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: CVPR, pp. 4004–4012 (2016)Google Scholar
  32. 32.
    Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10, 000 classes. In: CVPR, pp. 1891–1898 (2014)Google Scholar
  33. 33.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: DeepFace: closing the gap to human-level performance in face verification. In: CVPR, pp. 1701–1708 (2014)Google Scholar
  34. 34.
    Ustinova, E., Lempitsky, V.S.: Learning deep embeddings with histogram loss. In: NIPS, pp. 4170–4178 (2016)Google Scholar
  35. 35.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. PAMI 39(4), 652–663 (2017)CrossRefGoogle Scholar
  36. 36.
    Wang, F., Liu, W., Liu, H., Cheng, J.: Additive margin softmax for face verification. arXiv: 1801.05599 (2018)
  37. 37.
    Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: NormFace: L\({}_{\text{2}}\) hypersphere embedding for face verification. arXiv: 1704.06369 (2017)
  38. 38.
    Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., Liu, W.: CosFace: large margin cosine loss for deep face recognition. arXiv: 1801.09414 (2018)
  39. 39.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR, pp. 5005–5013 (2016)Google Scholar
  40. 40.
    Wang, L., Li, Y., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. arXiv: 1704.03470 (2017)
  41. 41.
    Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). Scholar
  42. 42.
    Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. In: CVPR, pp. 1249–1258 (2016)Google Scholar
  43. 43.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  44. 44.
    Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: CVPR, pp. 3441–3450 (2015)Google Scholar
  45. 45.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)Google Scholar
  46. 46.
    Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv: 1512.02167 (2015)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Dalian University of TechnologyDalianChina

Personalised recommendations