Abstract
Passage retrieval for multimodal question answering, spanning natural language processing and computer vision, is a challenging task, particularly when the documentation to search from contains poor punctuation or obsolete word forms and with little labeled training data. Here, we introduce a novel approach to conducting passage retrieval for multimodal question answering of ancient artworks where the query image caption of the multimodal query is provided as additional evidence to state-of-the-art retrieval models in the cultural heritage domain trained on a small dataset. The query image caption is generated with an advanced image captioning model trained on an external dataset. Consequently, the retrieval model obtains transferred knowledge from the external dataset. Extensive experiments prove the efficiency of this approach on a benchmark dataset compared to state-of-the-art approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol. 3, p. 6 (2018)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6298–6306 (2017)
Li, H., Wang, Y., de Melo, G., Tu, C., Chen, B.: Multimodal question answering over structured data with ambiguous entities. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 79–88. International World Wide Web Conferences Steering Committee (2017)
Shen, Y., Rong, W., Jiang, N., Peng, B., Tang, J., Xiong, Z.: Word embedding based correlation model for question/answer matching. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 3511–3517 (2017)
Sheng, S., Van Gool, L., Moens, M.F.: A dataset for multimodal question answering in the cultural heritage domain. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 10–17 (2016)
Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A Markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: Schoeffmann, K., et al. (eds.) MMM 2018. LNCS, vol. 10704, pp. 3–15. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73603-7_1
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Wang, J.Z., Grieb, K., Zhang, Y., Chen, C.C., Chen, Y., Li, J.: Machine annotation and retrieval for digital imagery of historical materials. Int. J. Dig. Libr. 6(1), 18–29 (2006)
Xu, L., Merono-Penuela, A., Huang, Z., van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26 (2017)
Xu, L., Wang, X.: Semantic description of cultural digital images: using a hierarchical model and controlled vocabulary. D-Lib Mag. 21(5/6) (2015)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)
Acknowledgments
This work is funded by the KU Leuven BOF/IF/RUN/2015. We additionally thank our anonymous reviewers for the helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sheng, S., Laenen, K., Moens, MF. (2019). Can Image Captioning Help Passage Retrieval in Multimodal Question Answering?. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11438. Springer, Cham. https://doi.org/10.1007/978-3-030-15719-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-15719-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15718-0
Online ISBN: 978-3-030-15719-7
eBook Packages: Computer ScienceComputer Science (R0)