SPICE: Semantic Propositional Image Caption Evaluation

  • Peter AndersonEmail author
  • Basura Fernando
  • Mark Johnson
  • Stephen Gould
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?



We are grateful to the COCO Consortium (in particular, Matteo R. Ronchi, Tsung-Yi Lin, Yin Cui and Piotr Dollár) for agreeing to run our SPICE code against entries in the 2015 COCO Captioning Challenge. We would also like to thank Sebastian Schuster for sharing the Stanford Scene Graph Parser code in advance of public release, Ramakrishna Vedantam and Somak Aditya for sharing their human caption judgments, and Kelvin Xu, Jacob Devlin and Qi Wu for providing model-generated captions for evaluation. This work was funded in part by the Australian Centre for Robotic Vision.


  1. 1.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  2. 2.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention (2015). arXiv preprint arXiv:1502.03044
  3. 3.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47, 853–899 (2013)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)Google Scholar
  5. 5.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014)Google Scholar
  6. 6.
    Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft COCO captions: Data collection and evaluation server (2015). arXiv preprint arXiv:1504.00325
  7. 7.
    Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. PAMI 35(12), 2891–2903 (2013)CrossRefGoogle Scholar
  8. 8.
    Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: ACL, pp. 452–457 (2014)Google Scholar
  9. 9.
    Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. JAIR 55, 409–442 (2016)zbMATHGoogle Scholar
  10. 10.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)Google Scholar
  11. 11.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL Workshop, pp. 25–26 (2004)Google Scholar
  12. 12.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)Google Scholar
  13. 13.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: EACL 2014 Workshop on Statistical Machine Translation (2014)Google Scholar
  14. 14.
    Giménez, J., Màrquez, L.: Linguistic features for automatic evaluation of heterogenous MT systems. In: ACL Second Workshop on Statistical Machine TranslationGoogle Scholar
  15. 15.
    Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Image retrieval using scene graphs. In: CVPR (2015)Google Scholar
  16. 16.
    Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: EMNLP 4th Workshop on Vision and Language (2015)Google Scholar
  17. 17.
    Wang, C., Xue, N., Pradhan, S.: A transition-based algorithm for AMR parsing. In: HLT-NAACL (2015)Google Scholar
  18. 18.
    Lin, D., Fidler, S., Kong, C., Urtasun, R.: Visual semantic search: retrieving videos via complex textual queries. In: CVPR (2014)Google Scholar
  19. 19.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: ACL (2003)Google Scholar
  20. 20.
    De Marneffe, M.C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., Manning, C.D.: Universal stanford dependencies: a cross-linguistic typology. LREC 14, 4585–4592 (2014)Google Scholar
  21. 21.
    Lo, C.k., Tumuluru, A.K., Wu, D.: Fully automatic semantic MT evaluation. In: ACL Seventh Workshop on Statistical Machine Translation (2012)Google Scholar
  22. 22.
    Pradhan, S.S., Ward, W., Hacioglu, K., Martin, J.H., Jurafsky, D.: Shallow semantic parsing using support vector machines. In: HLT-NAACL, pp. 233–240 (2004)Google Scholar
  23. 23.
    Ellebracht, L., Ramisa, A., Swaroop, P., Cordero, J., Moreno-Noguer, F., Quattoni, A.: Semantic tuples for evaluation of image sentence generation. In: EMNLP 4th Workshop on Vision and Language (2015)Google Scholar
  24. 24.
    Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., Schneider, N.: Abstract meaning representation (AMR) 1.0 specification. In: EMNLP, pp. 1533–1544 (2012)Google Scholar
  25. 25.
    Flanigan, J., Thomson, S., Carbonell, J., Dyer, C., Smith, N.A.: A discriminative graph-based parser for the abstract meaning representation. In: ACL (2014)Google Scholar
  26. 26.
    Werling, K., Angeli, G., Manning, C.: Robust subgraph generation improves abstract meaning representation parsing. In: ACL (2015)Google Scholar
  27. 27.
    Cai, S., Knight, K.: Smatch: an evaluation metric for semantic feature structures. In: ACL (2), pp. 748–752 (2013)Google Scholar
  28. 28.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: CVPR, pp. 2641–2649 (2015)Google Scholar
  29. 29.
    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations (2016). arXiv preprint arXiv:1602.07332
  30. 30.
    Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR, June 2011Google Scholar
  31. 31.
    Hale, J.: A probabilistic earley parser as a psycholinguistic model. In: NAACL, pp. 1–8 (2001)Google Scholar
  32. 32.
    Levy, R.: Expectation-based syntactic comprehension. Cognition 106(3), 1126–1177 (2008)CrossRefGoogle Scholar
  33. 33.
    Stanojević, M., Kamran, A., Koehn, P., Bojar, O.: Results of the WMT15 metrics shared task. In: ACL Tenth Workshop on Statistical Machine Translation, pp. 256–273 (2015)Google Scholar
  34. 34.
    Machacek, M., Bojar, O.: Results of the WMT14 metrics shared task. In: ACL Ninth Workshop on Statistical Machine Translation, pp. 293–301 (2014)Google Scholar
  35. 35.
    Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From images to sentences through scene description graphs using commonsense reasoning and knowledge (2015). arXiv preprint arXiv:1511.03292
  36. 36.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  37. 37.
    Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s Mechanical Turk. In: HLT-NAACL, pp. 139–147 (2010)Google Scholar
  38. 38.
    Fang, H., Gupta, S., Iandola, F.N., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: CVPR (2015)Google Scholar
  39. 39.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  40. 40.
    Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., Mitchell, M.: Language models for image captioning: The quirks and what works (2015). arXiv preprint arXiv:1505.01809
  41. 41.
    Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: CVPR, pp. 2533–2541 (2015)Google Scholar
  42. 42.
    Devlin, J., Gupta, S., Girshick, R.B., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning (2015). arXiv preprint arXiv:1505.04467
  43. 43.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn) (2014). arXiv preprint arXiv:1412.6632
  44. 44.
    Kolár, M., Hradis, M., Zemcík, P.: Technical report: Image captioning with semantically similar images (2015). arXiv preprint arXiv:1506.03995
  45. 45.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Multimodal neural language models. ICML 14, 595–603 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Peter Anderson
    • 1
    Email author
  • Basura Fernando
    • 1
  • Mark Johnson
    • 2
  • Stephen Gould
    • 1
  1. 1.The Australian National UniversityCanberraAustralia
  2. 2.Macquarie UniversitySydneyAustralia

Personalised recommendations