See and chat: automatically generating viewer-level comments on images
- 65 Downloads
- 1 Citations
Abstract
Image is becoming a predominant medium for social interactions. Automatically expressing opinions on an image, which we refer to as image commenting, has great potential to improve user engagement and thus becomes an emerging yet very challenging research topic. The machine-generated comments should be both relevant to image content and natural as human language. To deal with these challenges, we propose a novel two-stage approach, consisting of similar image search and comment ranking. In the first step, given an image, visually similar images are discovered by k-nearest neighbor (k-NN) search from a large image dataset. The comments associated with these images are exploited as candidates to mimic how viewers respond to this given image. In the second step, ranking canonical correlation analysis (RCCA), which is an extension of CCA by jointly learning a cross-view embedding space and a bilinear similarity function between the views of image and comment, is exploited for ranking the candidate comments. To create a benchmark for this emerging task, we collect a dataset with 426K images with 11 million associated comments. We show that our approach achieves superior performance and can suggest viewer-level comments.
Keywords
Image commenting Cross-view embedding Deep convolutional neural networksNotes
Acknowledgments
This work is partially supported by NSF of China under Grant 61672548, U1611461, 61173081, and the Guangzhou Science and Technology Program, China, under Grant 201510010165.
References
- 1.Chen YY, Chen T, Hsu WH, Liao HYM, Chang SF (2014) Predicting viewer affective comments based on image content in social media. In: ICMRGoogle Scholar
- 2.Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick C (2015) Exploring nearest neighbor approaches for image captioning. arXiv:1505.04467
- 3.Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pp 2625–2634Google Scholar
- 4.Esuli A, Sebastiani F (2006) Sentiwordnet: a publicly available lexical resource for opinion mining. In: Proceedings of LRECGoogle Scholar
- 5.Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCVGoogle Scholar
- 6.Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCVGoogle Scholar
- 7.Hardoon D, Szedmák S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. https://doi.org/10.1162/0899766042321814 https://doi.org/10.1162/0899766042321814 CrossRefzbMATHGoogle Scholar
- 8.He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPRGoogle Scholar
- 9.Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPRGoogle Scholar
- 10.Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLPGoogle Scholar
- 11.Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: ICMLGoogle Scholar
- 12.Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPSGoogle Scholar
- 13.Li Y, Yao T, Mei T, Chao H, Rui Y (2016) Share-and-chat: achieving human-level video commenting by search and multi-view embedding. In: MMGoogle Scholar
- 14.Lin Y, Lv F, Zhu S, Yang M, Cour T, Yu K, Cao L, Huang T (2011) Large-scale image classification: fast feature extraction and svm training. In: CVPRGoogle Scholar
- 15.Liu D, Hua XS, Yang L, Wang M, Zhang HJ (2009) Tag ranking. In: WWWGoogle Scholar
- 16.Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38 (11):39–41CrossRefGoogle Scholar
- 17.Pan Y, Li Y, Yao T, Mei T, Li H, Rui Y (2016) Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In: IJCAIGoogle Scholar
- 18.Pan Y, Yao T, Mei T, Li H, Ngo CW, Rui Y (2014) Click-through-based cross-view learning for image search. In: SIGIRGoogle Scholar
- 19.Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLRGoogle Scholar
- 20.Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPRGoogle Scholar
- 21.Wang XJ, Zhang L, Jing F, Ma WY (2006) Annosearch: Image auto-annotation by search. In: CVPRGoogle Scholar
- 22.Wu L, Hoi SCH, Jin R, Zhu J, Yu N (2009) Distance metric learning from uncertain side information with application to automated photo tagging. In: MMGoogle Scholar
- 23.Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICMLGoogle Scholar
- 24.Yao T, Mei T, Ngo C (2015) Learning query and image similarities with ranking canonical correlation analysis. In: ICCVGoogle Scholar
- 25.Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: CVPRGoogle Scholar
- 26.Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: ICCVGoogle Scholar