Skip to main content

Deep Learning in Natural Language Generation from Images

  • Chapter
  • First Online:
Deep Learning in Natural Language Processing

Abstract

Natural language generation from images, referred to as image or visual captioning also, is an emerging deep learning application that is in the intersection between computer vision and natural language processing. Image captioning also forms the technical foundation for many practical applications. The advances in deep learning technologies have created significant progress in this area in recent years. In this chapter, we review the key developments in image captioning and their impact in both research and industry deployment. Two major schemes developed for image captioning, both based on deep learning, are presented in detail. A number of examples of natural language descriptions of images produced by two state-of-the-art captioning systems are provided to illustrate the high quality of the systems’ outputs. Finally, recent research on generating stylistic natural language from images is reviewed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, L., Batra, D., & Parikh, D. (2015). Vqa: Visual question answering. In ICCV.

    Google Scholar 

  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In ECCV.

    Google Scholar 

  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2017). Bottom-up and top-down attention for image captioning and VQA. arXiv:1707.07998.

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.

    Google Scholar 

  • Baker, J., et al. (2009). Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, 26(4),

    Google Scholar 

  • Ballas, N., Yao, L., Pal, C., & Courville, A. (2016). Delving deeper into convolutional networks for learning video representations. In ICLR.

    Google Scholar 

  • Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS (pp. 1171–1179).

    Google Scholar 

  • Bridle, J., et al. (1998). An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Language Engineering, Johns Hopkins University CLSP.

    Google Scholar 

  • Chen, X., & Lawrence Zitnick, C. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR (pp. 2422–2431).

    Google Scholar 

  • Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR.

    Google Scholar 

  • Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015). Gated feedback recurrent neural networks. In ICML.

    Google Scholar 

  • Cui, Y., Ronchi, M. R., Lin, T. -Y., Dollar, P., & Zitnick, L. (2015). Coco captioning challenge. In http://mscoco.org/dataset/captions-challenge2015.

  • Dahl, G., Yu, D., & Deng, L. (2011). Large-vocabulry continuous speech recognition with context-dependent DBN-HMMs. In Proceedings of ICASSP.

    Google Scholar 

  • Das, A., et al. (2017). Visual dialog. In CVPR.

    Google Scholar 

  • Deng, L. (1998). A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Communication, 24(4),

    Google Scholar 

  • Deng, L., & O’Shaughnessy, D. (2003). SPEECH PROCESSING A Dynamic and Optimization-Oriented Approach. New York: Marcel Dekker.

    Google Scholar 

  • Deng, L., & Yu, D. (2007). Use of differential cepstra as acoustic features in hidden trajectory modeling for phonetic recognition. In Proceedings of ICASSP.

    Google Scholar 

  • Deng, L., & Yu, D. (2014). Deep Learning: Methods and Applications. Breda: NOW Publishers.

    Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L. -J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).

    Google Scholar 

  • Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. In Proceedings of ICASSP.

    Google Scholar 

  • Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In ACL.

    Google Scholar 

  • Devlin, J., et al. (2015). Language models for image captioning: The quirks and what works. In Proceedings of CVPR.

    Google Scholar 

  • Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR (pp, 2625–2634).

    Google Scholar 

  • Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In ACL.

    Google Scholar 

  • Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., et al. (2015). From captions to visual concepts and back. In CVPR (pp. 1473–1482).

    Google Scholar 

  • Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. In ECCV.

    Google Scholar 

  • Fei-Fei, L., & Perona, P. (2016). Stacked attention networks for image question answering. In Proceedings of CVPR.

    Google Scholar 

  • Gan, C., et al. (2017a). Stylenet: Generating attractive visual captions with styles. In CVPR.

    Google Scholar 

  • Gan, Z., et al. (2017b). Semantic compositional networks for visual captioning. In CVPR.

    Google Scholar 

  • Girshick, R. (2015). Fast r-cnn. In ICCV.

    Google Scholar 

  • He, X., & Deng, L. (2017). Deep learning for image-to-text generation. In IEEE Signal Processing Magazine.

    Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. In CVPR.

    Google Scholar 

  • Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. -r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B., & Sainath, T. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29.

    Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Google Scholar 

  • Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47.

    Google Scholar 

  • Huang, P., et al. (2013). Learning deep structured semantic models for web search using clickthrough data. Proceedings of CIKM.

    Google Scholar 

  • Huang, T. -H., et al. (2016). Visual storytelling. In NAACL.

    Google Scholar 

  • Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR (pp. 3128–3137).

    Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of NIPS.

    Google Scholar 

  • Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2015). Babytalk: Understanding and generating simple image descriptions. In CVPR.

    Google Scholar 

  • Lin, K., Li, D., He, X., Zhang, Z., & Sun, M.- T. (2017). Adversarial ranking for language generation. In NIPS.

    Google Scholar 

  • Lin, T. -Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2015). Microsoft coco: Common objects in context. In ECCV.

    Google Scholar 

  • Liu, C., Mao, J., Sha, M., & Yuille, A. (2016). Attention correctness in neural image captioning. preprint arXiv:1605.09553.

  • Luong, M. -T., Le, Q. V., Sutskever, I., Vinyals, O., & Kaiser, L. (2015). Multi-task sequence to sequence learning. In ICLR.

    Google Scholar 

  • Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2015). Deep captioning with multimodal recurrent neural networks (m-RNN). In ICLR.

    Google Scholar 

  • Ordonez, V., Kulkarni, G., Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.

    Google Scholar 

  • Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. In CVPR.

    Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In ACL (pp. 311–318).

    Google Scholar 

  • Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., & Carin, L. (2016). Variational autoencoder for deep learning of images, labels and captions. In NIPS.

    Google Scholar 

  • Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image annotations using amazons mechanical turk. In NAACL HLT Workshop Creating Speech and Language Data with Amazons Mechanical Turk.

    Google Scholar 

  • Ren, Z., Wang, X., Zhang, N., Lv, X., & Li, L. -J. (2017). Deep reinforcement learning-based image captioning with embedding reward. In CVPR.

    Google Scholar 

  • Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In CVPR.

    Google Scholar 

  • Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In NIPS (pp. 3104–3112).

    Google Scholar 

  • Tran, K., He, X., Zhang, L., Sun, J., Carapcea, C., Thrasher, C., Buehler, C., & Sienkiewicz, C. (2016). Rich image captioning in the wild. arXiv preprint arXiv:1603.09016.

  • Varior, R. R., Shuai, B., Lu, J., Xu, D., & Wang, G. (2016). A siamese long short-term memory architecture for human re-identification. In ECCV.

    Google Scholar 

  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In CVPR (pp. 4566–4575).

    Google Scholar 

  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015a). Sequence to sequence-video to text. In ICCV.

    Google Scholar 

  • Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015b). Translating videos to natural language using deep recurrent neural networks. In NAACL.

    Google Scholar 

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR (pp. 3156–3164).

    Google Scholar 

  • Wei, L., Huang, Q., Ceylan, D., Vouga, E., & Li, H. (2015). Densecap: Fully convolutional localization networks for dense captioning. Computer Science.

    Google Scholar 

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML (pp. 2048–2057).

    Google Scholar 

  • Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016). Encode, review, and decode: Reviewer module for caption generation. In NIPS.

    Google Scholar 

  • You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In CVPR.

    Google Scholar 

  • Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Transactions of ACL.

    Google Scholar 

  • Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. In CVPR.

    Google Scholar 

  • Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.

    Google Scholar 

  • Zhang, H., et al. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.

    Google Scholar 

  • Zhang, C., Platt, J. C., & Viola, P. A. (2005). Multiple instance boosting for object detection. In NIPS.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodong He .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

He, X., Deng, L. (2018). Deep Learning in Natural Language Generation from Images. In: Deng, L., Liu, Y. (eds) Deep Learning in Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/978-981-10-5209-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-5209-5_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-5208-8

  • Online ISBN: 978-981-10-5209-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics