Cross-domain personalized image captioning

  • Cuirong Long
  • Xiaoshan YangEmail author
  • Changsheng Xu


Image captioning aims to translate an image to a complete and natural sentence. It involves both computer vision and natural language processing. Though image captioning has achieved good results under the rapid development of deep neural networks, excessively pursuing the evaluation results of the captioning models makes the generated text description too conservative in practical applications. It is necessary to increase the diversity of the text description and account for prior knowledge such as the user’s favorite vocabularies and writing styles. In this paper, we study the personalized image captioning which can generate sentences to describe the user’s own story and feelings of life with the most preferred word expression. Moreover, we propose cross-domain personalized image captioning (CDPIC) to learn domain-invariant captioning models which can be applied on different social media platforms. The proposed method can flexibly model user interest by embedding the user ID as an interest vector. To the best of our knowledge, we propose the first cross-domain personalized image captioning approach by combining the user interest modeling and a simple and effective domain-invariant constraint. The effectiveness of the proposed method is verified on datasets from the Instagram and Lookbook platforms.


Personalization Image captioning Domain adaptation 



This work was supported in part by National Key Research and Development Program of China (No. 2017YFB1002804), National Natural Science Foundation of China (No. 61702511, 61720106006, 1711530243, 61620106003, 61432019, 61632007, U1705262, U1836220) and Key Research Program of Frontier Sciences, CAS, Grant NO. QYZDJSSWJSC039. This work was also supported by Research Program of National Laboratory of Pattern Recognition (No. Z-2018007) and CCF-Tencent Open Fund.


  1. 1.
    Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M (2014) Domain-adversarial neural networks. Eprint ArxivGoogle Scholar
  2. 2.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answeringGoogle Scholar
  3. 3.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. Computer ScienceGoogle Scholar
  4. 4.
    Bengio S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks, pp 1171–1179Google Scholar
  5. 5.
    Chen TH, Liao YH, Chuang CY, Hsu WT, Fu J, Sun M (2017) Show, adapt and tell: Adversarial training of cross-domain image captioner. In: IEEE international conference on computer vision, pp 521–530Google Scholar
  6. 6.
    Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. Lect Notes Comput Sci 21(10):15–29CrossRefGoogle Scholar
  7. 7.
    Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2096–2030MathSciNetzbMATHGoogle Scholar
  8. 8.
    Glorot X, Bordes A, Bengio Y (2012) Deep sparse rectifier neural networks. Jmlr W Cp 15:315–323Google Scholar
  9. 9.
    Jiang YG, Wu Z, Tang J, Li Z, Xue X, Chang SF (2018) Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Trans Multimed (TMM) 20(11):3137–3147CrossRefGoogle Scholar
  10. 10.
    Jiang YG, Li M, Wang X, Liu W, Hua XS (2018) DeepProduct: Mobile Product Search with Portable Deep Features. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) 14(2):50:1–50:18Google Scholar
  11. 11.
    Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676CrossRefGoogle Scholar
  12. 12.
    Kim Y (2014) Convolutional neural networks for sentence classification. Eprint ArxivGoogle Scholar
  13. 13.
    Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. Computer ScienceGoogle Scholar
  14. 14.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Lawrence Zitnick C (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755Google Scholar
  15. 15.
    Lin WH, Chen KT, Chiang HY, Hsu W (2018) Netizen-style commenting on fashion photos: dataset and diversity measuresGoogle Scholar
  16. 16.
    Liu X, Qian X, Lu D, Hou X, Wang L (2014) Personalized tag recommendation for flickr users. In: IEEE international conference on multimedia and expoGoogle Scholar
  17. 17.
    Long M, Wang J (2015) Learning transferable features with deep adaptation networks. arXiv:1502.02791
  18. 18.
    Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning, pp 3242–3250Google Scholar
  19. 19.
    Mathews A, Xie L, He X (2016) Senticap: generating image descriptions with sentiments. In: Thirtieth AAAI conference on artificial intelligence, pp 3574–3580Google Scholar
  20. 20.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) Ibm research report bleu: a method for automatic evaluation of machine translation. Acl Proc Ann Meet Assoc Comput Linguist 30(2):311–318Google Scholar
  21. 21.
    Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks, pp 6432–6440Google Scholar
  22. 22.
    Plummer BA, Wang L, Cervantes CM, Caicedo JC (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IEEE international conference on computer vision, pp 2641–2649Google Scholar
  23. 23.
    Qian X, Liu X, Zheng C, Youtian DU, Hou X (2013) Tagging photos using users’ vocabularies. Neurocomputing 111(111):144–153CrossRefGoogle Scholar
  24. 24.
    Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Computer vision and pattern recognition, pp 4566–4575Google Scholar
  25. 25.
    Venugopalan S, Hendricks LA, Rohrbach M, Mooney R, Darrell T, Saenko K (2017) Captioning images with diverse objects. In: IEEE conference on computer vision and pattern recognition, pp 1170–1178Google Scholar
  26. 26.
    Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator, pp 3156–3164Google Scholar
  27. 27.
    Wang L, Chu X, Zhang W, Wei Y, Sun W, Wu C (2018) Social image captioning: Exploring visual attention and user attention. Sensors 18(2):646CrossRefGoogle Scholar
  28. 28.
    Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the international conference on machine learning, ICML 2015, pp 2048–2057Google Scholar
  29. 29.
    Zhao WX, Jiang J, Weng J, He J, Ee PL, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. Lect Notes Comput Sci 6611:338–349CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.HeFei University of TechnologyHefeiChina
  2. 2.Institute of Automation, Chinese Academy of SciencesBeijingChina
  3. 3.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations