Learning to Learn from Web Data Through Deep Semantic Embeddings

  • Raul GomezEmail author
  • Lluis Gomez
  • Jaume Gibert
  • Dimosthenis Karatzas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model for semantic image retrieval. We demonstrate that the pipeline can learn from images with associated text without supervision and perform a thorough analysis of five different text embeddings in three different benchmarks. We show that the embeddings learnt with Web and Social Media data have competitive performances over supervised methods in the text based image retrieval task, and we clearly outperform state of the art in the MIRFlickr dataset when training in the target data. Further we demonstrate how semantic multimodal image retrieval can be performed using the learnt embeddings, going beyond classical instance-level retrieval problems. Finally, we present a new dataset, InstaCities1M, composed by Instagram images and their associated texts that can be used for fair comparison of image-text embeddings.


Self-supervised learning Webly supervised learning Text embeddings Multimodal retrieval Multimodal embeddings 



This work was supported by the Doctorats Industrials program from the Generalitat de Catalunya, the Spanish project TIN2017-89779-P, the H2020 Marie Skłodowska-Curie actions of the European Union, grant agreement No 712949 (TECNIOspring PLUS), and the Agency for Business Competitiveness of the Government of Catalonia (ACCIO).


  1. 1.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. (2003)Google Scholar
  2. 2.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information (2016)Google Scholar
  3. 3.
    Ding, G., Guo, Y., Zhou, J.: Collective matrix factorization hashing for multimodal data. In: Proceedings IEEE Computer Society Conference Computer Vision and Pattern Recognition (2014)Google Scholar
  4. 4.
    Fu, J., Wu, Y., Mei, T., Wang, J., Lu, H., Rui, Y.: Relaxing from vocabulary: robust weakly-supervised deep learning for vocabulary-free image tagging. In: Proceedings IEEE International Conference Computer Vision and Pattern Recognition (2015)Google Scholar
  5. 5.
    Gomez, L., Patel, Y., Rusiñol, M., Karatzas, D., Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: CVPR (2017)Google Scholar
  6. 6.
    Gordo, A., Almazan, J., Murray, N., Perronin, F.: LEWIS: latent embeddings for word images and their semantics. In: Proceedings IEEE International Conference Computer Vision and Pattern Recognition (2015)Google Scholar
  7. 7.
    Gordo, A., Larlus, D.: Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic retrieval. In: CVPR (2017)Google Scholar
  8. 8.
    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)Google Scholar
  9. 9.
    Huiskes, M.J., Lew, M.S.: The MIR flickr retrieval evaluation. In: Proceeding 1st ACM International Conference Multimedia Information Retrieval - MIR 2008 (2008)Google Scholar
  10. 10.
    Jia, Y., et al.: Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv (2014)Google Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Li, F.-F.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  12. 12.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: NIPS (2014)Google Scholar
  13. 13.
    Li, K., Qi, G.J., Ye, J., Hua, K.A.: Linear subspace ranking hashing for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (2017)Google Scholar
  14. 14.
    Li, W., et al.: WebVision Challenge: Visual Learning and Understanding With Web Data (2017)Google Scholar
  15. 15.
    Li, W., Wang, L., Li, W., Agustsson, E., Van Gool, L.: WebVision Database: Visual Learning and Understanding from Web Data (2017)Google Scholar
  16. 16.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  17. 17.
    Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-view retrieval. In: Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  18. 18.
    Liu, L., Lin, Z., Shao, L., Shen, F., Ding, G., Han, J.: Sequential discrete hashing for scalable cross-modality similarity retrieval. IEEE Trans. Image Process. (2017)Google Scholar
  19. 19.
    Mar, J., David, V., Ger, D., Antonio, M.L.: Learning appearance in virtual scenarios for pedestrian detection. In: CVPR (2010)Google Scholar
  20. 20.
    Melucci, M.: Relevance feedback algorithms inspired by quantum detection. IEEE Trans. Knowl. Data Eng. (2016)Google Scholar
  21. 21.
    Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)Google Scholar
  22. 22.
    Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. In: NIPS (2013)Google Scholar
  23. 23.
    Patel, Y., Gomez, L., Rusiñol, M., Karatzas, D.: Dynamic lexicon generation for natural scene images. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 395–410. Springer, Cham (2016). Scholar
  24. 24.
    Patrini, G., Rozza, A., Menon, A., Nock, R., Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: CVPR (2016)Google Scholar
  25. 25.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP (2014)Google Scholar
  26. 26.
    Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2013)Google Scholar
  27. 27.
    Princeton University: WordNet (2010).
  28. 28.
    Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  29. 29.
    Salvador, A., et al.: Learning cross-modal embeddings for cooking recipes and food images. In: CVPR (2017)Google Scholar
  30. 30.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  31. 31.
    Wang, J., Li, G.: A multi-modal hashing learning framework for automatic image annotation. In: 2017 IEEE Second International Conference on Data Science in Cyberspace (2017)Google Scholar
  32. 32.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR (2016)Google Scholar
  33. 33.
    Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  34. 34.
    Xu, X., He, L., Lu, H., Shimada, A., Taniguchi, R.I.: Non-linear matrix completion for social image tagging. IEEE Access (2017)Google Scholar
  35. 35.
    Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. (2017)Google Scholar
  36. 36.
    Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS (2014)Google Scholar
  37. 37.
    Zhang, D., Li, W.J.: Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI, pp. 2177–2183 (2014)Google Scholar
  38. 38.
    Zhang, X., Zhang, X., Li, X., Li, Z., Wang, S.: Classify social image by integrating multi-modal content. Multimed. Tools Appl. (2018)Google Scholar
  39. 39.
    Zhen, Y., Yeung, D.Y.: Co-regularized hashing for multimodal data. In: Advances in Neural Information Processing Systems, pp. 1385–1393 (2012)Google Scholar
  40. 40.
    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2017)Google Scholar
  41. 41.
    Zhou, B., Liu, L., Oliva, A., Torralba, A.: Recognizing city identity via attribute analysis of geo-tagged images. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 519–534. Springer, Cham (2014). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Eurecat, Centre Tecnològic de Catalunya, Unitat de Tecnologies AudiovisualsBarcelonaSpain
  2. 2.Computer Vision CenterUniversitat Autònoma de BarcelonaBarcelonaSpain

Personalised recommendations