Multimedia Tools and Applications

, Volume 78, Issue 3, pp 2689–2702 | Cite as

See and chat: automatically generating viewer-level comments on images

  • Jingwen Chen
  • Ting YaoEmail author
  • Hongyang Chao


Image is becoming a predominant medium for social interactions. Automatically expressing opinions on an image, which we refer to as image commenting, has great potential to improve user engagement and thus becomes an emerging yet very challenging research topic. The machine-generated comments should be both relevant to image content and natural as human language. To deal with these challenges, we propose a novel two-stage approach, consisting of similar image search and comment ranking. In the first step, given an image, visually similar images are discovered by k-nearest neighbor (k-NN) search from a large image dataset. The comments associated with these images are exploited as candidates to mimic how viewers respond to this given image. In the second step, ranking canonical correlation analysis (RCCA), which is an extension of CCA by jointly learning a cross-view embedding space and a bilinear similarity function between the views of image and comment, is exploited for ranking the candidate comments. To create a benchmark for this emerging task, we collect a dataset with 426K images with 11 million associated comments. We show that our approach achieves superior performance and can suggest viewer-level comments.


Image commenting Cross-view embedding Deep convolutional neural networks 



This work is partially supported by NSF of China under Grant 61672548, U1611461, 61173081, and the Guangzhou Science and Technology Program, China, under Grant 201510010165.


  1. 1.
    Chen YY, Chen T, Hsu WH, Liao HYM, Chang SF (2014) Predicting viewer affective comments based on image content in social media. In: ICMRGoogle Scholar
  2. 2.
    Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick C (2015) Exploring nearest neighbor approaches for image captioning. arXiv:1505.04467
  3. 3.
    Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pp 2625–2634Google Scholar
  4. 4.
    Esuli A, Sebastiani F (2006) Sentiwordnet: a publicly available lexical resource for opinion mining. In: Proceedings of LRECGoogle Scholar
  5. 5.
    Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCVGoogle Scholar
  6. 6.
    Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCVGoogle Scholar
  7. 7.
    Hardoon D, Szedmák S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. CrossRefzbMATHGoogle Scholar
  8. 8.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPRGoogle Scholar
  9. 9.
    Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPRGoogle Scholar
  10. 10.
    Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLPGoogle Scholar
  11. 11.
    Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: ICMLGoogle Scholar
  12. 12.
    Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPSGoogle Scholar
  13. 13.
    Li Y, Yao T, Mei T, Chao H, Rui Y (2016) Share-and-chat: achieving human-level video commenting by search and multi-view embedding. In: MMGoogle Scholar
  14. 14.
    Lin Y, Lv F, Zhu S, Yang M, Cour T, Yu K, Cao L, Huang T (2011) Large-scale image classification: fast feature extraction and svm training. In: CVPRGoogle Scholar
  15. 15.
    Liu D, Hua XS, Yang L, Wang M, Zhang HJ (2009) Tag ranking. In: WWWGoogle Scholar
  16. 16.
    Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38 (11):39–41CrossRefGoogle Scholar
  17. 17.
    Pan Y, Li Y, Yao T, Mei T, Li H, Rui Y (2016) Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In: IJCAIGoogle Scholar
  18. 18.
    Pan Y, Yao T, Mei T, Li H, Ngo CW, Rui Y (2014) Click-through-based cross-view learning for image search. In: SIGIRGoogle Scholar
  19. 19.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLRGoogle Scholar
  20. 20.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPRGoogle Scholar
  21. 21.
    Wang XJ, Zhang L, Jing F, Ma WY (2006) Annosearch: Image auto-annotation by search. In: CVPRGoogle Scholar
  22. 22.
    Wu L, Hoi SCH, Jin R, Zhu J, Yu N (2009) Distance metric learning from uncertain side information with application to automated photo tagging. In: MMGoogle Scholar
  23. 23.
    Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICMLGoogle Scholar
  24. 24.
    Yao T, Mei T, Ngo C (2015) Learning query and image similarities with ranking canonical correlation analysis. In: ICCVGoogle Scholar
  25. 25.
    Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: CVPRGoogle Scholar
  26. 26.
    Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: ICCVGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Sun Yat-sen UniversityGuangzhouPeople’s Republic of China
  2. 2.Microsoft Research AsiaBeijingPeople’s Republic of China

Personalised recommendations