Advertisement

LCEval: Learned Composite Metric for Caption Evaluation

  • Naeha SharifEmail author
  • Lyndon White
  • Mohammed Bennamoun
  • Wei Liu
  • Syed Afaq Ali Shah
Article
  • 13 Downloads

Abstract

Automatic evaluation metrics hold a fundamental importance in the development and fine-grained analysis of captioning systems. While current evaluation metrics tend to achieve an acceptable correlation with human judgements at the system level, they fail to do so at the caption level. In this work, we propose a neural network-based learned metric to improve the caption-level caption evaluation. To get a deeper insight into the parameters which impact a learned metric’s performance, this paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics. We also compare metrics trained with different training examples to measure the variations in their evaluation. Moreover, we perform a robustness analysis, which highlights the sensitivity of learned and handcrafted metrics to various sentence perturbations. Our empirical analysis shows that our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.

Keywords

Image captioning Automatic evaluation metric Neural networks Learned metrics Correlation Accuracy Robustness 

Notes

Acknowledgements

We are grateful to NVIDIA for providing Titan-Xp GPU, which was used for the experiments. We also thank Somak Aditya for sharing COMPOSITE dataset and Ramakrishna Vedantam for sharing PASCAL50S and ABSTRACT50S datasets. Thanks to Yin Cui for providing the dataset containing the captions of 12 teams who participated in the 2015 COCO captioning challenge. This work is supported by Australian Research Council, ARC DP150100294.

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. OSDI., 16, 265–283.Google Scholar
  2. Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., & Fermüller, C. (2017). Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding, 173, 33–45.CrossRefGoogle Scholar
  3. Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016) Spice: Semantic propositional image caption evaluation. In European conference on computer vision (pp. 382–398). Springer.Google Scholar
  4. Banerjee, S., & Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65–72).Google Scholar
  5. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.CrossRefGoogle Scholar
  6. Bojar, O., Graham, Y., Kamran, A., & Stanojević, M. (2016). Results of the wmt16 metrics shared task. In Proceedings of the first conference on machine translation: volume 2, shared task papers (vol. 2, pp. 199–231)Google Scholar
  7. Bojar, O., Helcl, J., Kocmi, T., Libovickỳ, J., Musil, T. (2017). Results of the wmt17 neural MT training task. In Proceedings of the second conference on machine translation (pp. 525–533)Google Scholar
  8. Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., et al. (2015a). Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325.
  9. Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P. et al. (2015b). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  10. Corston-Oliver, S., Gamon, M., Brockett, C. (2001). A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th annual meeting on association for computational linguistics (pp. 148–155). Association for Computational Linguistics.Google Scholar
  11. Cui, Y., Yang, G., Veit, A., Huang, X., & Belongie, S. (2018). Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5804–5812).Google Scholar
  12. Dancey, C. P., & Reidy, J. (2004). Statistics without maths for psychology. Harlow: Prentice Hall.Google Scholar
  13. Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376–380).Google Scholar
  14. Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers) Google Scholar
  15. Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., & Hockenmaier, J. et al. (2010). Every picture tells a story: Generating sentences from images. In European conference on computer vision (pp. 15–29). Springer.Google Scholar
  16. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 249–256 (2010)Google Scholar
  17. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  18. Hodosh, M., & Hockenmaier, J. (2016). Focused evaluation for image description with binary forced-choice tasks. In Proceedings of the 5th workshop on vision and language (pp. 19–28).Google Scholar
  19. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.MathSciNetCrossRefzbMATHGoogle Scholar
  20. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).Google Scholar
  21. Karpathy, A., Joulin, A., & Fei-Fei, L. F. (2014). Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (pp. 1889–1897).Google Scholar
  22. Khosrovian, K., Pfahl, D., & Garousi, V. (2008). Gensim 2.0: A customizable process simulation model for software process evaluation. In International conference on software process (pp. 294–306). Springer.Google Scholar
  23. Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., & Erdem, E. (2016). Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600.
  24. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  25. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).Google Scholar
  26. Kulesza, A., & Shieber, S. M. (2004). A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th international conference on theoretical and methodological issues in machine translation (pp. 75–84).Google Scholar
  27. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903.CrossRefGoogle Scholar
  28. Lin, C. Y. (2004) Rouge: A package for automatic evaluation of summaries. Text summarization branches out.Google Scholar
  29. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., & Ramanan, D. et al. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.Google Scholar
  30. Liu, D., & Gildea, D. (2005). Syntactic features for evaluation of machine translation. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 25–32).Google Scholar
  31. Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2016). Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370.
  32. Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (vol. 6).Google Scholar
  33. Ma, Q., Bojar, O., Graham, Y. (2018). Results of the wmt18 metrics shared task: Both characters and embeddings achieve good performance. In Proceedings of the third conference on machine translation: shared task papers (pp. 671–688).Google Scholar
  34. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).Google Scholar
  35. Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., & Berg, A. et al. (2012). Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th conference of the european chapter of the association for computational linguistics (pp. 747–756). Association for Computational Linguistics.Google Scholar
  36. Ng, A. Y. (2004). Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on machine learning (p. 78). ICML ’04, ACM, New York, NY, USA. http://doi.acm.org/10.1145/1015330.1015435.
  37. Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., et al. (2016). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision, 119(1), 46–59.MathSciNetCrossRefGoogle Scholar
  38. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). Association for Computational Linguistics.Google Scholar
  39. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).Google Scholar
  40. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE international conference on computer vision (ICCV) (pp. 2641–2649). IEEE.Google Scholar
  41. Ritter, S., Long, C., Paperno, D., Baroni, M., Botvinick, M., & Goldberg, A. (2015). Leveraging preposition ambiguity to assess compositional distributional models of semantics. In The fourth joint conference on lexical and computational semantics.Google Scholar
  42. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In Proceedings of the IEEE international conference on computer vision (pp. 433–440).Google Scholar
  43. Sharif, N., White, L., Bennamoun, M., & Shah, S. A. A. (2018a). Nneval: Neural network based evaluation metric. In Proceedings of the 15th European conference on computer vision. Springer Lecture Notes in Computer Science.Google Scholar
  44. Sharif, N., White, L., Bennamoun, M., Shah, S. A. A. (2018b). Learning-based composite metrics for improved caption evaluation. In Proceedings of ACL 2018, student research workshop (pp. 14–20).Google Scholar
  45. van Miltenburg, E., & Elliott, D. (2017). Room for improvement in automatic image description: An error analysis. arXiv preprint arXiv:1704.04198.
  46. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566–4575).Google Scholar
  47. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In 2015 IEEE conference on Computer vision and pattern recognition (CVPR) (pp. 3156–3164) IEEE.Google Scholar
  48. White, L., Togneri, R., Liu, W., & Bennamoun, M. (2015). How well sentence embeddings capture meaning. In Proceedings of the 20th Australasian document computing symposium (pp. 9:1–9:8). ADCS ’15, ACM. http://clic.cimec.unitn.it/marco/publications/ritt er-etal-prepositions-starsem-2015.pdf.
  49. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., & Salakhudinov, R. et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).Google Scholar
  50. Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2016). Boosting image captioning with attributes. OpenReview, 2(5), 8.Google Scholar
  51. You, Q., Jin, H., & Luo, J. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121.
  52. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4651–4659).Google Scholar
  53. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Naeha Sharif
    • 1
    Email author
  • Lyndon White
    • 2
  • Mohammed Bennamoun
    • 1
  • Wei Liu
    • 1
  • Syed Afaq Ali Shah
    • 3
  1. 1.Department of Computer ScienceThe University of Western AustraliaPerthAustralia
  2. 2.Invenia LabsCambridgeUK
  3. 3.Discipline of Information Technology, Mathematics and StatisticsMurdoch UniversityPerthAustralia

Personalised recommendations