Advertisement

Pre-gen Metrics: Predicting Caption Quality Metrics Without Generating Captions

  • Marc TantiEmail author
  • Albert Gatt
  • Adrian Muscat
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

Image caption generation systems are typically evaluated against reference outputs. We show that it is possible to predict output quality without generating the captions, based on the probability assigned by the neural model to the reference captions. Such pre-gen metrics are strongly correlated to standard evaluation metrics.

Keywords

Image captioning Neural architectures Evaluation metrics 

Notes

Acknowledgments

The research in this paper is partially funded by the Endeavour Scholarship Scheme (Malta). Scholarships are part-financed by the European Union - European Social Fund (ESF) - Operational Programme II Cohesion Policy 2014–2020 Investing in human capital to create more opportunities and promote the well-being of society.

References

  1. 1.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part V. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_24CrossRefGoogle Scholar
  2. 2.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings on the Workshop on Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol. 29, pp. 65–72 (2005)Google Scholar
  3. 3.
    Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. JAIR 55, 409–442 (2016)CrossRefGoogle Scholar
  4. 4.
    Cahill, A.: Correlating human and automatic evaluation of a German surface realiser. In: Proceedings of the ACL-IJCNLP 2009, pp. 97–100 (2009).  https://doi.org/10.3115/1667583.1667615, http://dl.acm.org/citation.cfm?id=1667583.1667615, http://www.aclweb.org/anthology-new/P/P09/P09-2025.pdf
  5. 5.
    Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the EACL 2006, pp. 249–256 (2006)Google Scholar
  6. 6.
    Caporaso, J.G., Deshpande, N., Fink, J.L., Bourne, P.E., Bretonnel Cohen, K., Hunter, L.: Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pac. Symp. Biocomput. 13, 640–651 (2008). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2517250/Google Scholar
  7. 7.
    Dorr, B., Monz, C., Oard, D., President, S., Zajic, D., Schwartz, R.: Extrinsic evaluation of automatic metrics. Technical report, Institute for Advanced Computer Studies, University of Maryland, College Park, College Park, MD (2004)Google Scholar
  8. 8.
    Elliott, D., Keller, F.: Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302. Association for Computational Linguistics, Seattle, Washington, October 2013. http://www.aclweb.org/anthology/D13-1128
  9. 9.
    Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: Proceedings of the ACL 2014, pp. 452–457 (2014)Google Scholar
  10. 10.
    Espinosa, D., Rajkumar, R., White, M., Berleant, S.: Further Meta-evaluation of broad-coverage surface realization. In: Proceedings of the EMNLP 2010, pp. 564–574 (2010). http://www.aclweb.org/anthology/D10-1055
  11. 11.
    Gatt, A., Belz, A.: Introducing shared tasks to NLG: the TUNA shared task evaluation challenges. In: Krahmer, E., Theune, M. (eds.) EACL/ENLG -2009. LNCS (LNAI), vol. 5790, pp. 264–293. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15573-4_14CrossRefGoogle Scholar
  12. 12.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47(1), 853–899 (2013).  https://doi.org/10.1109/cvprw.2013.51MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Johnson, J., et al.: Image retrieval using scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2015.  https://doi.org/10.1109/cvpr.2015.7298990
  14. 14.
    Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., Erdem, E.: Re-evaluating automatic metrics for image captioning. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics (2017).  https://doi.org/10.18653/v1/e17-1019
  15. 15.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. CoRR 1411.2539 (2014)Google Scholar
  16. 16.
    Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. In: CVPR 2011. IEEE, June 2011.  https://doi.org/10.1109/cvpr.2011.5995466
  17. 17.
    Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 957–966. PMLR, Lille (2015). http://proceedings.mlr.press/v37/kusnerb15.html
  18. 18.
    Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the ACL 2004 (2004)Google Scholar
  19. 19.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Proceedings of the ECCV 2014, pp. 740–755 (2014).  https://doi.org/10.1007/978-3-319-10602-1_48Google Scholar
  20. 20.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. CoRR 1301.3781 (2013)Google Scholar
  21. 21.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the ACL 2002, pp. 311–318 (2002)Google Scholar
  22. 22.
    Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Comput. Linguist. 35(4), 529–558 (2009)CrossRefGoogle Scholar
  23. 23.
  24. 24.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the CVPR 2015 (2015)Google Scholar
  25. 25.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017).  https://doi.org/10.1109/tpami.2016.2587640CrossRefGoogle Scholar
  26. 26.
    Wubben, S., van den Bosch, A., Krahmer, E.: Sentence simplification by monolingual machine translation. In: Proceedings of the ACL 2012, pp. 1015–1024 (2012). http://www.aclweb.org/anthology/P12-1107
  27. 27.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of MaltaMsidaMalta

Personalised recommendations