Image Captioning with Memorized Knowledge
- 2 Downloads
Image captioning, which aims to automatically generate text description of given images, has received much attention from researchers. Most existing approaches adopt a recurrent neural network (RNN) as a decoder to generate captions conditioned on the input image information. However, traditional RNNs deal with the sequence in a recurrent way, squeezing the information of all previous words into hidden cells and updating the context information by fusing the hidden states with the current word information. This may miss the rich knowledge too far in the past. In this paper, we propose a memory-enhanced captioning model for image captioning. We firstly introduce an external memory to store the past knowledge, i.e., all the information of generated words. When predicting the next word, the decoder can retrieve knowledge information about the past by means of a selective reading mechanism. Furthermore, to better explore the knowledge stored in the memory, we introduce several variants that consider different types of past knowledge. To verify the effectiveness of the proposed model, we conduct extensive experiments and comparisons on the well-known image captioning dataset MS COCO. Compared with the state-of-the-art captioning models, the proposed memory-enhanced captioning model shows a significant improvement in terms of the performance (improving 3.5% in terms of CIDEr). The proposed memory-enhanced captioning model, as demonstrated in the experiments, is more effective and superior to the state-of-the-art methods.
KeywordsImage captioning Attention Memory Encoder-decoder
Compliance with Ethical Standards
This article does not contain any studies with human participants or animals performed by any of the authors.
- 1.Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. 2017. Bottom-up and top-down attention for image captioning and vqa. arXiv:1707.07998.
- 2.Banerjee S, Lavie A. Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005. vol. 29, p. 65–72.Google Scholar
- 3.Chen H, Ding G, Lin Z, Guo Y, Han J. Attend to knowledge: memory-enhanced attention network for image captioning. International Conference on Brain Inspired Cognitive Systems. Springer; 2018. p. 161–71.Google Scholar
- 4.Chen H, Ding G, Lin Z, Zhao S, Han J. Show, observe and tell: attribute-driven attention model for image captioning. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization; 2018. p. 606–12.Google Scholar
- 5.Chen H, Ding G, Zhao S, Han J. 2018. Temporal-difference learning with sampling baseline for image captioning. AAAI Conference on Artificial Intelligence.Google Scholar
- 6.Chen L, Zhang H, Xiao J, Nie L, Shao J, Chua TS. 2017. Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning CVPR.Google Scholar
- 7.Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J. 2017. Reference based LSTM for image captioning AAAI.Google Scholar
- 8.Cho K, Van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods on Natural Language processing. 2014. p. 1724–34.Google Scholar
- 9.Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M. 2015. Language models for image captioning: the quirks and what works. In Annual Meeting of the Association for Computational Linguistics. 2015. p. 100–5.Google Scholar
- 10.Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL. 2015. Exploring nearest neighbor approaches for image captioning. arXiv:1505.04467.
- 11.Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q. 2018. Neural image caption generation with weighted training and reference. Cognitive Computation. https://doi.org/10.1007/s12559-018-9581-x.
- 12.Ding G, Guo Y, Chen K, Chu C, Han J, Dai Q. 2019. Decode: deep confidence network for robust image classification. IEEE Transactions on Image Processing.Google Scholar
- 13.Ding G, Guo Y, Zhou J, Gao Y. Large-scale cross-modality search via collective matrix factorization hashing. TIP 2016;25(11):5427–40.Google Scholar
- 14.Dodds A. 2013. Rehabilitating blind and visually impaired people: a psychological approach. Springer.Google Scholar
- 15.Elliott D, Keller F. Image description using visual dependency representations. In Conference on Empirical Methods on Natural Language Processing. 2013. p. 1292–302.Google Scholar
- 16.Fakoor R, Mohamed Ar, Mitchell M, Kang SB, Kohli P. 2016. Memory-augmented attention modelling for videos. arXiv:1611.02261.
- 17.Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D. Every picture tells a story: generating sentences from images. In European Conference on Computer Vision. 2010. p. 15–29.Google Scholar
- 18.Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L. 2017. Semantic compositional networks for visual captioning. In CVPR.Google Scholar
- 19.Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. 2014. p. 529–45.Google Scholar
- 20.Gu J, Cai J, Wang G, Chen T. 2018. Stack-captioning: coarse-to-fine learning for image captioning. In AAAI.Google Scholar
- 24.Jia X, Gavves E, Fernando B, Tuytelaars T. 2015. Guiding the long-short term memory model for image caption generation. In IEEE International Conference on Computer Vision. 2015. p. 2407–15.Google Scholar
- 25.Jin J, Fu K, Cui R, Sha F, Zhang C. 2015. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272.
- 26.Kaiser L, Nachum O, Roy A, Bengio S. 2017. Learning to remember rare events CVPR.Google Scholar
- 27.Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 3128–37.Google Scholar
- 28.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 2012. p. 1097–105.Google Scholar
- 29.Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg A, Berg T. Baby talk: understanding and generating simple image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 2011. p. 1601–8.Google Scholar
- 30.Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y. Collective generation of natural image descriptions. In Annual Meeting of the Association for Computational Linguistics. 2012. p. 359–68.Google Scholar
- 31.Kuznetsova P, Ordonez V, Berg T, Choi Y. Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Ling 2014;2(10):351–62.Google Scholar
- 33.Lan X, Ye M, Shao R, Zhong B, Yuen PC, Zhou H. Learning modality-consistency feature templates: a robust rgb-infrared tracking system. IEEE Trans Ind Electron. 2019:1–1. https://doi.org/10.1109/TIE.2019.2898618.
- 34.Lan X, Ye M, Zhang S, Zhou H, Yuen PC. Modality-correlation-aware sparse representation for RGB-infrared object tracking. Pattern Recogn Lett. 2018. https://doi.org/10.1016/j.patrec.2018.10.002.
- 37.Li N, Chen Z. Image captioning with visual-semantic LSTM. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization; 2018. p. 793–799.Google Scholar
- 38.Li Y, Pan Q, Yang T, Wang S, Tang J, Cambria E. Learning word representations for sentiment analysis. Cogn Comput. 2017;843–851.Google Scholar
- 39.Lin CY, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics; 2003. p. 71–78.Google Scholar
- 41.Lin Z, Ding G, Han J, Wang J. 2016. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Transactions on Cybernetics.Google Scholar
- 42.Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K. Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 873–81.Google Scholar
- 43.Liu X, Li H, Shao J, Chen D, Wang X. 2018. Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. arXiv:1803.08314.
- 45.Lu J, Xiong C, Parikh D, Socher R. 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning.Google Scholar
- 46.Luo R, Price B, Cohen S, Shakhnarovich G. 2018. Discriminability objective for training descriptive captions. arXiv:1803.04376.
- 47.Mao J, Xu W, Yang Y, Wang J, Yuille AL. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In International Conference on Learning Representations.Google Scholar
- 48.Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III. Midge: generating image descriptions from computer vision detections. In Conference of the European Chapter of the Association for Computational Linguistics. 2012. p. 747–56.Google Scholar
- 49.Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational linguistics. Association for Computational Linguistics; 2002. p. 311–8.Google Scholar
- 50.Ranzato M, Chopra S, Auli M, Zaremba W. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732.
- 51.Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V. 2016. Self-critical sequence training for image captioning CVPR.Google Scholar
- 52.Roopnarine J, Johnson JE. 2013. Approaches to early childhood education. Merrill/Prentice Hall.Google Scholar
- 53.Vedantam R, Lawrence Zitnick C, Parikh D. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 4566–75.Google Scholar
- 54.Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: a neural image caption generator. InCVPR. 2015 p. 3156–64.Google Scholar
- 55.Wang M, Lu Z, Li H, Liu Q. Memory-enhanced decoder for neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. p. 278–86.Google Scholar
- 56.Weston J, Chopra S, Bordes A. 2014. Memory networks. arXiv:1410.3916.
- 58.Wu G, Han J, Lin Z, Ding G, Zhang B, Ni Q. 2018. Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Transactions on Industrial Electronics.Google Scholar
- 59.Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In ICML. 2015. p. 2048–57.Google Scholar
- 60.Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW. 2016. Encode, review, and decode: reviewer module for caption generation NIPS.Google Scholar
- 61.Yao T, Pan Y, Li Y, Qiu Z, Mei T. 2016. Boosting image captioning with attributes. arXiv:1611.01646.
- 62.You Q, Jin H, Wang Z, Fang C, Luo J. 2016. Image captioning with semantic attention. In IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 4651–59.Google Scholar