A Survey on Automatic Image Captioning

  • Gargi Srivastava
  • Rajeev Srivastava
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 834)


Automatic image captioning is the process of providing natural language captions for images automatically. Considering the huge number of images available in recent time, automatic image captioning is very beneficial in managing huge image datasets by providing appropriate captions. It also finds application in content based image retrieval. This field includes other image processing areas such as segmentation, feature extraction, template matching and image classification. It also includes the field of natural language processing. Scene analysis is a prominent step in automatic image captioning which is garnering the attention of many researchers. The better the scene analysis the better is the image understanding which further leads to generate better image captions. The survey presents various techniques used by researchers for scene analysis performed on different image datasets.


Image captioning Scene analysis Computer vision 


  1. 1.
    Sumathi, T., Hemalatha, M.: A combined hierarchical model for automatic image annotation and retrieval. In: International Conference on Advanced Computing (2011)Google Scholar
  2. 2.
    Yu, M.T., Sein, M.M.: Automatic image captioning system using integration of N-cut and color-based segmentation method. In: Society of Instrument and Control Engineers Annual Conference (2011)Google Scholar
  3. 3.
    Ushiku, Y., Harada, T., Kuniyoshi, Y.: Automatic sentence generation from images. In: ACM Multimedia (2011)Google Scholar
  4. 4.
    Federico, M., Furini, M.: Enhancing learning accessibility through fully automatic captioning. In: International Cross-Disciplinary Conference on Web Accessibility (2011)Google Scholar
  5. 5.
    Feng, Y., Lapata, M.: Automatic caption generation for news images. IEEE Trans. Pattern Anal. Mach. Intell. 35(4), 797–811 (2013)CrossRefGoogle Scholar
  6. 6.
    Xi, S.M., Im Cho, Y.: Image caption automatic generation method based on weighted feature. In: International Conference on Control, Automation and Systems (2013)Google Scholar
  7. 7.
    Horiuchi, S., Moriguchi, H., Shengbo, X., Honiden, S.: Automatic image description by using word-level features. In: International Conference on Internet Multimedia Computing and Service (2013)Google Scholar
  8. 8.
    Ramnath, K., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M.: AutoCaption: automatic caption generation for personal photos. In: IEEE Winter Conference on Applications of Computer Vision (2014)Google Scholar
  9. 9.
    Sivakrishna Reddy, A., Monolisa, N., Nathiya, M., Anjugam, D.: A combined hierarchical model for automatic image annotation and retrieval. In: International Conference on Innovations in Information Embedded and Communication Systems (2015)Google Scholar
  10. 10.
    Shivdikar, K., Kak, A., Marwah, K.: Automatic image annotation using a hybrid engine. In: IEEE India Conference (2015)Google Scholar
  11. 11.
    Mathews, A.: Captioning images using different styles. In: ACM Multimedia Conference (2015)Google Scholar
  12. 12.
    Mathews, A., Xie, L., He, X.: Choosing basic-level concept names using visual and language context. In: IEEE Winter Conference on Applications of Computer Vision (2015)Google Scholar
  13. 13.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: International Conference on Computer Vision (2015)Google Scholar
  14. 14.
    Vijay, K., Ramya, D.: Generation of caption selection for news images using stemming algorithm. In: International Conference on Computation of Power, Energy, Information and Communication (2015)Google Scholar
  15. 15.
    Shahaf, D., Horvitz, E., Mankoff, R.: Inside jokes: identifying humorous cartoon captions. In: International Conference on Knowledge Discovery and Data Mining (2015)Google Scholar
  16. 16.
    Li, X., Lan, W., Dong, J., Liu, H.: Adding Chinese captions to images. In: International Conference in Multimedia Retrieval (2016)Google Scholar
  17. 17.
    Jin, J., Nakayama, H.: Annotation order matters: recurrent image annotator for arbitrary length image tagging. In: International Conference on Pattern Recognition (2016)Google Scholar
  18. 18.
    Shi, Z., Zou, Z.: Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Trans. Geosci. Remote Sens. 55(6), 3623–3634 (2016)CrossRefGoogle Scholar
  19. 19.
    Shetty, R., Tavakoli, H.R., Laaksonen, J.: Exploiting scene context for image captioning. In: Vision and Language Integration Meets Multimedia Fusion (2016)Google Scholar
  20. 20.
    Li, X., Song, X., Herranz, L., Zhu, Y., Jiang, S.: Image captioning with both object and scene information. In: ACM Multimedia (2016)Google Scholar
  21. 21.
    Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: ACM Multimedia (2016)Google Scholar
  22. 22.
    Liu, C., Wang, C., Sun, F., Rui, Y.: Image2Text: a multimodal caption generator. In: ACM Multimedia (2016)Google Scholar
  23. 23.
    Blandfort, P., Karayil, T., Borth, D., Dengel, A.: Introducing concept and syntax transition networks for image captioning. In: International Conference on Multimedia Retrieval (2016)Google Scholar
  24. 24.
    Tariq, A., Foroosh, H.: A context-driven extractive framework for generating realistic image descriptions. IEEE Trans. Image Process. 26(2), 619–631 (2017)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology (BHU) VaranasiVaranasiIndia

Personalised recommendations