“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and Text

  • Christian OttoEmail author
  • Sebastian Holzki
  • Ralph Ewerth
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11437)


Successful multimodal search and retrieval requires the automatic understanding of semantic cross-modal relations, which, however, is still an open research problem. Previous work has suggested the metrics cross-modal mutual information and semantic correlation to model and predict cross-modal semantic relations of image and text. In this paper, we present an approach to predict the (cross-modal) relative abstractness level of a given image-text pair, that is whether the image is an abstraction of the text or vice versa. For this purpose, we introduce a new metric that captures this specific relationship between image and text at the Abstractness Level (ABS). We present a deep learning approach to predict this metric, which relies on an autoencoder architecture that allows us to significantly reduce the required amount of labeled training data. A comprehensive set of publicly available scientific documents has been gathered. Experimental results on a challenging test set demonstrate the feasibility of the approach.


Image-text relations Multimodal embeddings Deep learning Visual-verbal divide 


  1. 1.
    Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. arXiv preprint arXiv:1809.02108 (2018)
  2. 2.
    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
  3. 3.
    Balaneshin-kordan, S., Kotov, A.: Deep neural architecture for multi-modal retrieval based on joint embedding space for text and images. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 28–36. ACM (2018)Google Scholar
  4. 4.
    Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)CrossRefGoogle Scholar
  5. 5.
    Barthes, R.: Image-music-text, ed. and trans. Heath, S., Fontana 332, London (1977)Google Scholar
  6. 6.
    Bateman, J.: Text and Image: A Critical Introduction to the Visual/Verbal Divide. Routledge, Abingdon (2014)CrossRefGoogle Scholar
  7. 7.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  8. 8.
    Bucak, S.S., Jin, R., Jain, A.K.: Multiple kernel learning for visual object recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1354–1369 (2014)CrossRefGoogle Scholar
  9. 9.
    Carvalho, M., Cadène, R., Picard, D., Soulier, L., Thome, N., Cord, M.: Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. arXiv preprint arXiv:1804.11146 (2018)
  10. 10.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. Association for Computational Linguistics (2014)Google Scholar
  11. 11.
    Fan, M., Wang, W., Dong, P., Han, L., Wang, R., Li, G.: Cross-media retrieval by learning rich semantic embeddings of multimedia. In: ACM Multimedia Conference, pp. 1698–1706 (2017)Google Scholar
  12. 12.
    Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Halliday, M.A.K., Matthiessen, C.M.: Halliday’s Introduction to Functional Grammar. Routledge, Abingdon (2013)CrossRefGoogle Scholar
  14. 14.
    Henning, C.A., Ewerth, R.: Estimating the information gap between textual and visual representations. In: ACM International Conference on Multimedia Retrieval (2017)Google Scholar
  15. 15.
    Jaques, N., Taylor, S., Sano, A., Picard, R.: Multi-task, multi-kernel learning for estimating individual wellbeing. In: Proceedings of the NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec, vol. 898 (2015)Google Scholar
  16. 16.
    Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407–2415 (2015)Google Scholar
  17. 17.
    Jin, Q., Liang, J.: Video description generation using audio and visual cues. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 239–242. ACM (2016)Google Scholar
  18. 18.
    Kang, C., et al.: Cross-modal similarity learning: a low rank bilinear formulation. In: ACM Conference on Information and Knowledge Management. ACM (2015)Google Scholar
  19. 19.
    Liang, J., Li, Z., Cao, D., He, R., Wang, J.: Self-paced cross-modal subspace matching. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 569-578. ACM (2016)Google Scholar
  20. 20.
    Liu, F., Zhou, L., Shen, C., Yin, J.: Multiple kernel learning in the primal for multimodal alzheimer’s disease classification. IEEE J. Biomed. Health Inform. 18(3), 984–990 (2014)CrossRefGoogle Scholar
  21. 21.
    Marsh, E.E., Domas White, M.: A taxonomy of relationships between images and text. J. Doc. 59(6), 647–672 (2003)CrossRefGoogle Scholar
  22. 22.
    Martinec, R., Salway, A.: A system for image-text relations in new (and old) media. Vis. Commun. 4(3), 337–371 (2005)CrossRefGoogle Scholar
  23. 23.
    Meutzner, H., Ma, N., Nickel, R., Schymura, C., Kolossa, D.: Improving audiovisual speech recognition using deep neural networks with dynamic stream reliability estimates. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5320–5324. IEEE (2017)Google Scholar
  24. 24.
    Neverova, N., Wolf, C., Taylor, G., Nebout, F.: Moddrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1692–1706 (2016)CrossRefGoogle Scholar
  25. 25.
    Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015)Google Scholar
  26. 26.
    Rajagopalan, S.S., Morency, L.-P., Baltrus̆aitis, T., Goecke, R.: Extending long short-term memory for multi-view structured learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 338–353. Springer, Cham (2016). Scholar
  27. 27.
    Ramanishka, V., et al.: Multimodal video description. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 1092-1096. ACM (2016)Google Scholar
  28. 28.
    Shutova, E., Kelia, D., Maillard, J.: Black holes and white rabbits: metaphor identification with visual features. In: NAACL, pp. 160–170 (2016)Google Scholar
  29. 29.
    Sohmen, L., Charbonnier, J., Blümel, I., Wartena, C., Heller, L.: Figures in scientific open access publications. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J.C. (eds.) TPDL 2018. LNCS, vol. 11057, pp. 220–226. Springer, Cham (2018). Scholar
  30. 30.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)Google Scholar
  31. 31.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9. (2015)Google Scholar
  32. 32.
    Unsworth, L.: Image/text relations and intersemiosis: towards multimodal text description for multiliteracies education. In: Proceedings of the 33rd International Systemic Functional Congress, pp. 1165–1205 (2007)Google Scholar
  33. 33.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  34. 34.
    Yan, T.K., Xu, X.S., Guo, S., Huang, Z., Wang, X.L.: Supervised robust discrete multimodal hashing for cross-media retrieval. In: ACM Conference on Information and Knowledge Management, pp. 1271–1280 (2016)Google Scholar
  35. 35.
    Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)Google Scholar
  36. 36.
    Yeh, Y.R., Lin, T.C., Chung, Y.Y., Wang, Y.C.F.: A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Trans. Multimedia 14(3), 563–574 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Leibniz Information Centre for Science and Technology (TIB)HanoverGermany
  2. 2.L3S Research Center, Leibniz Universität HannoverHanoverGermany

Personalised recommendations