Abstract
Image captioning, which aims to automatically generate sentences for images, has been exploited in many works. The attention-based methods have achieved impressive performance due to its superior ability of adapting the image’s feature to the context dynamically. Since the recurrent neural network has difficulties in remembering the information too far in the past, we argue that the attention model may not be adequately supervised by the guidance from the previous information at a distance. In this paper, we propose a memory-enhanced attention model for image captioning, aiming to improve the attention mechanism with previous learned knowledge. Specifically, we store the visual and semantic knowledge which has been exploited in the past into memories, and generate a global visual or semantic feature to improve the attention model. We verify the effectiveness of the proposed model on two prevalent benchmark datasets MS COCO and Flickr30k. The comparison with the state-of-the-art models demonstrates the superiority of the proposed model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998 (2017)
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Chua, T.S.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR (2017)
Chen, M., Ding, G., Zhao, S., Chen, H., Liu, Q., Han, J.: Reference based LSTM for image captioning. In: AAAI (2017)
Ding, G., Guo, Y., Zhou, J., Gao, Y.: Large-scale cross-modality search via collective matrix factorization hashing. TIP 25(11), 5427–5440 (2016)
Dodds, A.: Rehabilitating Blind and Visually Impaired People: A Psychological Approach. Springer, Heidelberg (2013). https://doi.org/10.1007/978-1-4899-4461-0
Fakoor, R., Mohamed, A.R., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: IEEE International Conference on Computer Vision, pp. 2407–2415 (2015)
Jin, J., Fu, K., Cui, R., Sha, F., Zhang, C.: Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272 (2015)
Kaiser, L., Nachum, O., Roy, A., Bengio, S.: Learning to remember rare events. In: CVPR (2017)
Kumar, A., et al.: Ask me anything: dynamic memory networks for natural language processing. In: International Conference on Machine Learning, pp. 1378–1387 (2016)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)
Lin, Z., Ding, G., Han, J., Wang, J.: Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Trans. Cybern. (2016)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning (2017)
Roopnarine, J., Johnson, J.E.: Approaches to early childhood education. Merrill/Prentice Hall, Upper Saddle River (2013)
Sukhbaatar, S., Weston, J., Fergus, R., et al.: End-to-end memory networks. In: Advances in Neural Information Processing Systems, pp. 2440–2448 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprint arXiv:1410.3916 (2014)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., Cohen, W.W.: Encode, review, and decode: reviewer module for caption generation. In: NIPS (2016)
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. arXiv preprint arXiv:1611.01646 (2016)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, H., Ding, G., Lin, Z., Guo, Y., Han, J. (2018). Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning. In: Ren, J., et al. Advances in Brain Inspired Cognitive Systems. BICS 2018. Lecture Notes in Computer Science(), vol 10989. Springer, Cham. https://doi.org/10.1007/978-3-030-00563-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-00563-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00562-7
Online ISBN: 978-3-030-00563-4
eBook Packages: Computer ScienceComputer Science (R0)