Deep Learning in Natural Language Generation from Images

He, Xiaodong; Deng, Li

doi:10.1007/978-981-10-5209-5_10

Xiaodong He³ &
Li Deng⁴

9989 Accesses
8 Citations

Abstract

Natural language generation from images, referred to as image or visual captioning also, is an emerging deep learning application that is in the intersection between computer vision and natural language processing. Image captioning also forms the technical foundation for many practical applications. The advances in deep learning technologies have created significant progress in this area in recent years. In this chapter, we review the key developments in image captioning and their impact in both research and industry deployment. Two major schemes developed for image captioning, both based on deep learning, are presented in detail. A number of examples of natural language descriptions of images produced by two state-of-the-art captioning systems are provided to illustrate the high quality of the systems’ outputs. Finally, recent research on generating stylistic natural language from images is reviewed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, L., Batra, D., & Parikh, D. (2015). Vqa: Visual question answering. In ICCV.
Google Scholar
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In ECCV.
Google Scholar
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2017). Bottom-up and top-down attention for image captioning and VQA. arXiv:1707.07998.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
Google Scholar
Baker, J., et al. (2009). Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, 26(4),
Google Scholar
Ballas, N., Yao, L., Pal, C., & Courville, A. (2016). Delving deeper into convolutional networks for learning video representations. In ICLR.
Google Scholar
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS (pp. 1171–1179).
Google Scholar
Bridle, J., et al. (1998). An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Language Engineering, Johns Hopkins University CLSP.
Google Scholar
Chen, X., & Lawrence Zitnick, C. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR (pp. 2422–2431).
Google Scholar
Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR.
Google Scholar
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015). Gated feedback recurrent neural networks. In ICML.
Google Scholar
Cui, Y., Ronchi, M. R., Lin, T. -Y., Dollar, P., & Zitnick, L. (2015). Coco captioning challenge. In http://mscoco.org/dataset/captions-challenge2015.
Dahl, G., Yu, D., & Deng, L. (2011). Large-vocabulry continuous speech recognition with context-dependent DBN-HMMs. In Proceedings of ICASSP.
Google Scholar
Das, A., et al. (2017). Visual dialog. In CVPR.
Google Scholar
Deng, L. (1998). A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Communication, 24(4),
Google Scholar
Deng, L., & O’Shaughnessy, D. (2003). SPEECH PROCESSING A Dynamic and Optimization-Oriented Approach. New York: Marcel Dekker.
Google Scholar
Deng, L., & Yu, D. (2007). Use of differential cepstra as acoustic features in hidden trajectory modeling for phonetic recognition. In Proceedings of ICASSP.
Google Scholar
Deng, L., & Yu, D. (2014). Deep Learning: Methods and Applications. Breda: NOW Publishers.
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. -J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).
Google Scholar
Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. In Proceedings of ICASSP.
Google Scholar
Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In ACL.
Google Scholar
Devlin, J., et al. (2015). Language models for image captioning: The quirks and what works. In Proceedings of CVPR.
Google Scholar
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR (pp, 2625–2634).
Google Scholar
Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In ACL.
Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., et al. (2015). From captions to visual concepts and back. In CVPR (pp. 1473–1482).
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. In ECCV.
Google Scholar
Fei-Fei, L., & Perona, P. (2016). Stacked attention networks for image question answering. In Proceedings of CVPR.
Google Scholar
Gan, C., et al. (2017a). Stylenet: Generating attractive visual captions with styles. In CVPR.
Google Scholar
Gan, Z., et al. (2017b). Semantic compositional networks for visual captioning. In CVPR.
Google Scholar
Girshick, R. (2015). Fast r-cnn. In ICCV.
Google Scholar
He, X., & Deng, L. (2017). Deep learning for image-to-text generation. In IEEE Signal Processing Magazine.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. In CVPR.
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. -r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B., & Sainath, T. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29.
Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Google Scholar
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47.
Google Scholar
Huang, P., et al. (2013). Learning deep structured semantic models for web search using clickthrough data. Proceedings of CIKM.
Google Scholar
Huang, T. -H., et al. (2016). Visual storytelling. In NAACL.
Google Scholar
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR (pp. 3128–3137).
Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of NIPS.
Google Scholar
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2015). Babytalk: Understanding and generating simple image descriptions. In CVPR.
Google Scholar
Lin, K., Li, D., He, X., Zhang, Z., & Sun, M.- T. (2017). Adversarial ranking for language generation. In NIPS.
Google Scholar
Lin, T. -Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2015). Microsoft coco: Common objects in context. In ECCV.
Google Scholar
Liu, C., Mao, J., Sha, M., & Yuille, A. (2016). Attention correctness in neural image captioning. preprint arXiv:1605.09553.
Luong, M. -T., Le, Q. V., Sutskever, I., Vinyals, O., & Kaiser, L. (2015). Multi-task sequence to sequence learning. In ICLR.
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2015). Deep captioning with multimodal recurrent neural networks (m-RNN). In ICLR.
Google Scholar
Ordonez, V., Kulkarni, G., Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. In CVPR.
Google Scholar
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In ACL (pp. 311–318).
Google Scholar
Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., & Carin, L. (2016). Variational autoencoder for deep learning of images, labels and captions. In NIPS.
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image annotations using amazons mechanical turk. In NAACL HLT Workshop Creating Speech and Language Data with Amazons Mechanical Turk.
Google Scholar
Ren, Z., Wang, X., Zhang, N., Lv, X., & Li, L. -J. (2017). Deep reinforcement learning-based image captioning with embedding reward. In CVPR.
Google Scholar
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In CVPR.
Google Scholar
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In NIPS (pp. 3104–3112).
Google Scholar
Tran, K., He, X., Zhang, L., Sun, J., Carapcea, C., Thrasher, C., Buehler, C., & Sienkiewicz, C. (2016). Rich image captioning in the wild. arXiv preprint arXiv:1603.09016.
Varior, R. R., Shuai, B., Lu, J., Xu, D., & Wang, G. (2016). A siamese long short-term memory architecture for human re-identification. In ECCV.
Google Scholar
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In CVPR (pp. 4566–4575).
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015a). Sequence to sequence-video to text. In ICCV.
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015b). Translating videos to natural language using deep recurrent neural networks. In NAACL.
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR (pp. 3156–3164).
Google Scholar
Wei, L., Huang, Q., Ceylan, D., Vouga, E., & Li, H. (2015). Densecap: Fully convolutional localization networks for dense captioning. Computer Science.
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML (pp. 2048–2057).
Google Scholar
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016). Encode, review, and decode: Reviewer module for caption generation. In NIPS.
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In CVPR.
Google Scholar
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Transactions of ACL.
Google Scholar
Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. In CVPR.
Google Scholar
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.
Google Scholar
Zhang, H., et al. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
Google Scholar
Zhang, C., Platt, J. C., & Viola, P. A. (2005). Multiple instance boosting for object detection. In NIPS.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, USA
Xiaodong He
Citadel, Chicago & Seattle, USA
Li Deng

Authors

Xiaodong He
View author publications
You can also search for this author in PubMed Google Scholar
Li Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong He .

Editor information

Editors and Affiliations

AI Research at Citadel , Chicago, Illinois, USA
Li Deng
Tsinghua University , Beijing, China
Yang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

He, X., Deng, L. (2018). Deep Learning in Natural Language Generation from Images. In: Deng, L., Liu, Y. (eds) Deep Learning in Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/978-981-10-5209-5_10

Download citation

DOI: https://doi.org/10.1007/978-981-10-5209-5_10
Published: 24 May 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5208-8
Online ISBN: 978-981-10-5209-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics