Advertisement

Can Image Captioning Help Passage Retrieval in Multimodal Question Answering?

  • Shurong ShengEmail author
  • Katrien Laenen
  • Marie-Francine Moens
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11438)

Abstract

Passage retrieval for multimodal question answering, spanning natural language processing and computer vision, is a challenging task, particularly when the documentation to search from contains poor punctuation or obsolete word forms and with little labeled training data. Here, we introduce a novel approach to conducting passage retrieval for multimodal question answering of ancient artworks where the query image caption of the multimodal query is provided as additional evidence to state-of-the-art retrieval models in the cultural heritage domain trained on a small dataset. The query image caption is generated with an advanced image captioning model trained on an external dataset. Consequently, the retrieval model obtains transferred knowledge from the external dataset. Extensive experiments prove the efficiency of this approach on a benchmark dataset compared to state-of-the-art approaches.

Keywords

Multimodal question answering Passage retrieval Query image caption Markov random field 

Notes

Acknowledgments

This work is funded by the KU Leuven BOF/IF/RUN/2015. We additionally thank our anonymous reviewers for the helpful comments.

Supplementary material

482053_1_En_12_MOESM1_ESM.pdf (82 kb)
Supplementary material 1 (pdf 82 KB)

References

  1. 1.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol. 3, p. 6 (2018)Google Scholar
  2. 2.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRefGoogle Scholar
  3. 3.
    Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6298–6306 (2017)Google Scholar
  4. 4.
    Li, H., Wang, Y., de Melo, G., Tu, C., Chen, B.: Multimodal question answering over structured data with ambiguous entities. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 79–88. International World Wide Web Conferences Steering Committee (2017)Google Scholar
  5. 5.
    Shen, Y., Rong, W., Jiang, N., Peng, B., Tang, J., Xiong, Z.: Word embedding based correlation model for question/answer matching. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 3511–3517 (2017)Google Scholar
  6. 6.
    Sheng, S., Van Gool, L., Moens, M.F.: A dataset for multimodal question answering in the cultural heritage domain. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 10–17 (2016)Google Scholar
  7. 7.
    Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A Markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: Schoeffmann, K., et al. (eds.) MMM 2018. LNCS, vol. 10704, pp. 3–15. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-73603-7_1CrossRefGoogle Scholar
  8. 8.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  9. 9.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  10. 10.
    Wang, J.Z., Grieb, K., Zhang, Y., Chen, C.C., Chen, Y., Li, J.: Machine annotation and retrieval for digital imagery of historical materials. Int. J. Dig. Libr. 6(1), 18–29 (2006)CrossRefGoogle Scholar
  11. 11.
    Xu, L., Merono-Penuela, A., Huang, Z., van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26 (2017)Google Scholar
  12. 12.
    Xu, L., Wang, X.: Semantic description of cultural digital images: using a hierarchical model and controlled vocabulary. D-Lib Mag. 21(5/6) (2015)Google Scholar
  13. 13.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)Google Scholar
  14. 14.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Shurong Sheng
    • 1
    Email author
  • Katrien Laenen
    • 1
  • Marie-Francine Moens
    • 1
  1. 1.Department of Computer ScienceKU LeuvenLeuvenBelgium

Personalised recommendations