Advertisement

Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach

  • Angelo Carraggi
  • Marcella CorniaEmail author
  • Lorenzo Baraldi
  • Rita Cucchiara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)

Abstract

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, data coming from different modalities can be projected in a common embedding space, in which distances can be used to infer the similarity between pairs of images and sentences. While this approach has shown impressive performances on fully supervised settings, its application to semi-supervised scenarios has been rarely investigated. In this paper we propose a domain adaptation model for cross-modal retrieval, in which the knowledge learned from a supervised dataset can be transferred on a target dataset in which the pairing between images and sentences is not known, or not useful for training due to the limited size of the set. Experiments are performed on two target unsupervised scenarios, respectively related to the fashion and cultural heritage domain. Results show that our model is able to effectively transfer the knowledge learned on ordinary visual-semantic datasets, achieving promising results. As an additional contribution, we collect and release the dataset used for the cultural heritage domain.

Keywords

Multi-modal retrieval Visual-semantic embeddings Semi-supervised learning 

Notes

Acknowledgements

This work was supported by the CultMedia project (CTN02_00015_9852246), co-founded by the Italian MIUR. We also acknowledge the support of Facebook AI Research with the donation of the GPUs used for this research.

References

  1. 1.
    Baraldi, L., Cornia, M., Grana, C., Cucchiara, R.: Aligning text and document illustrations: towards visually explainable digital humanities. In: International Conference on Pattern Recognition (2018)Google Scholar
  2. 2.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  3. 3.
    Chen, T.H., Liao, Y.H., Chuang, C.Y., Hsu, W.T., Fu, J., Sun, M.: Show, adapt and tell: adversarial training of cross-domain image captioner. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  4. 4.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Visual saliency for image captioning in new multimedia services. In: IEEE International Conference on Multimedia and Expo Workshops (2017)Google Scholar
  5. 5.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14(2), 48 (2018)CrossRefGoogle Scholar
  6. 6.
    Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimedia 20, 3377–3388 (2018)CrossRefGoogle Scholar
  7. 7.
    Eisenschtat, A., Wolf, L.: Linking image and text with 2-way nets. In: IEEE International Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  8. 8.
    Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  10. 10.
    Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213 (2017)
  11. 11.
    Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE International Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  12. 12.
    Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. In: IEEE International Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  13. 13.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE International Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  14. 14.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
  15. 15.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  16. 16.
    Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: IEEE International Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  17. 17.
    Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: International Conference on Machine Learning (2017)Google Scholar
  18. 18.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  19. 19.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  20. 20.
    Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: IEEE International Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  21. 21.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: 40th Annual Meeting on Association for Computational Linguistics (2002)Google Scholar
  22. 22.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)Google Scholar
  23. 23.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016)Google Scholar
  24. 24.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  25. 25.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: International Conference on Learning Representations (2016)Google Scholar
  26. 26.
    Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  27. 27.
    Wang, L., Li, Y., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018)Google Scholar
  28. 28.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Linguist. 2, 67–78 (2014)CrossRefGoogle Scholar
  29. 29.
    Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic segmentation of urban scenes. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  30. 30.
    Zhang, Y., Shen, D., Wang, G., Gan, Z., Henao, R., Carin, L.: Deconvolutional paragraph representation learning. In: Advances in Neural Information Processing Systems (2017)Google Scholar
  31. 31.
    Zhu, S., Fidler, S., Urtasun, R., Lin, D., Loy, C.C.: Be your own prada: fashion synthesis with structural coherence. In: IEEE International Conference on Computer Vision (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Angelo Carraggi
    • 1
  • Marcella Cornia
    • 1
    Email author
  • Lorenzo Baraldi
    • 1
  • Rita Cucchiara
    • 1
  1. 1.University of Modena and Reggio EmiliaModenaItaly

Personalised recommendations