Generating Diverse and Meaningful Captions

Lindh, Annika; Ross, Robert J.; Mahalunkar, Abhijit; Salton, Giancarlo; Kelleher, John D.

doi:10.1007/978-3-030-01418-6_18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11139))

Included in the following conference series:

International Conference on Artificial Neural Networks

7324 Accesses
9 Citations
5 Altmetric

Abstract

Image Captioning is a task that requires models to acquire a multimodal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram metrics, these models tend to output the same generic captions for similar images. In this work, we address this limitation and train a model that generates more diverse and specific captions through an unsupervised training approach that incorporates a learning signal from an Image Retrieval model. We summarize previous results and improve the state-of-the-art on caption diversity and novelty. We make our source code publicly available online (https://github.com/AnnikaLindh/Diverse_and_Specific_Image_Captioning).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://pytorch.org/

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Bengio, Y., et al.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432 [cs] (2013)
Bernardi, R., et al.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55(1), 409–442 (2016)
Article Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv:1504.00325 [cs] (2015)
Conneau, A., et al.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680 (2017)
Google Scholar
Conneau, A., Kiela, D.: SentEval: an evaluation toolkit for universal sentence representations. In: Chair, N.C. et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Google Scholar
Dai, B., et al.: Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2989–2998 (2017)
Google Scholar
Dai, B., Lin, D.: Contrastive learning for image captioning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 898–907. Curran Associates, Inc. (2017)
Google Scholar
Devlin, J., et al.: Language models for image captioning: the quirks and what works. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Short Papers, vol. 2, pp. 100–105 (2015)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482 (2015)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural. Inf. Process. Syst. 27, 2672–2680 (2014)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jang, E., et al.: Categorical reparameterization with gumbel-softmax. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664–676 (2017)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Lin, C.-Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. arXiv:1405.0312 [cs] (2014)
Luo, R.: An Image Captioning codebase in PyTorch. GitHub repository. https://github.com/ruotianluo/ImageCaptioning.pytorch (2017)
Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3125–3134 (2017)
Google Scholar
Papineni, K., et al.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia (2002)
Google Scholar
Pennington, J., et al.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Shetty, R., et al.: Speaking the same language: matching machine to human captions by adversarial training. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4155–4164 (2017)
Google Scholar
Vedantam, R., et al.: CIDEr: Consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., et al.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Article Google Scholar
Wang, L., et al.: Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30. pp. 5758–5768. Curran Associates, Inc. (2017)
Google Scholar

Download references

Acknowledgments

This research was supported by the ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Author information

Authors and Affiliations

ADAPT Centre, Dublin, Ireland
Annika Lindh, Robert J. Ross, Giancarlo Salton & John D. Kelleher
Dublin Institute of Technology (DIT), Dublin, Ireland
Annika Lindh, Robert J. Ross, Abhijit Mahalunkar, Giancarlo Salton & John D. Kelleher

Authors

Annika Lindh
View author publications
You can also search for this author in PubMed Google Scholar
Robert J. Ross
View author publications
You can also search for this author in PubMed Google Scholar
Abhijit Mahalunkar
View author publications
You can also search for this author in PubMed Google Scholar
Giancarlo Salton
View author publications
You can also search for this author in PubMed Google Scholar
John D. Kelleher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Annika Lindh .

Editor information

Editors and Affiliations

Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
CITEC Bielefeld University, Bielefeld, Germany
Barbara Hammer
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lindh, A., Ross, R.J., Mahalunkar, A., Salton, G., Kelleher, J.D. (2018). Generating Diverse and Meaningful Captions. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11139. Springer, Cham. https://doi.org/10.1007/978-3-030-01418-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-01418-6_18
Published: 27 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01417-9
Online ISBN: 978-3-030-01418-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics