Unpaired Image Captioning by Language Pivoting

  • Jiuxiang GuEmail author
  • Shafiq Joty
  • Jianfei Cai
  • Gang Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set of image-caption pairs. However, for some language, large scale image-caption paired corpus might not be available. We present an approach to this unpaired image captioning problem by language pivoting. Our method can effectively capture the characteristics of an image captioner from the pivot language (Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus. We evaluate our method on two image-to-English benchmark datasets: MSCOCO and Flickr30K. Quantitative comparisons against several baseline approaches demonstrate the effectiveness of our method.


Image captioning Unpaired learning 



This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the National Research Foundation, Singapore, and the Infocomm Media Development Authority, Singapore. We gratefully acknowledge the support of NVIDIA AI Tech Center (NVAITC) for our research at NTU ROSE Lab, Singapore.


  1. 1.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)Google Scholar
  2. 2.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)Google Scholar
  3. 3.
    Bertoldi, N., Barbaiani, M., Federico, M., Cattoni, R.: Phrase-based statistical machine translation with pivot languages. In: IWSLT, pp. 143–149 (2008)Google Scholar
  4. 4.
    Cer, D., Manning, C.D., Jurafsky, D.: The best lexical metric for phrase-based statistical mt system optimization. In: NAACL, pp. 555–563 (2010)Google Scholar
  5. 5.
    Chen, T.H., Liao, Y.H., Chuang, C.Y., Hsu, W.T., Fu, J., Sun, M.: Show, adapt and tell: adversarial training of cross-domain image captioner. In: ICCV, pp. 521–530 (2017)Google Scholar
  6. 6.
    Chen, Y., Liu, Y., Li, V.O.: Zero-resource neural machine translation with multi-agent communication game. In: AAAI, pp. 5086–5093 (2018)Google Scholar
  7. 7.
    Cheng, Y., et al.: Semi-supervised learning for neural machine translation. In: ACL, pp. 1965–1974 (2016)Google Scholar
  8. 8.
    Cheng, Y., Yang, Q., Liu, Y., Sun, M., Xu, W.: Joint training for pivot-based neural machine translation. In: IJCAI, pp. 3974–3980 (2017)Google Scholar
  9. 9.
    Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation, pp. 1724–1734 (2014)Google Scholar
  10. 10.
    Cohn, T., Hoang, C.D.V., Vymolova, E., Yao, K., Dyer, C., Haffari, G.: Incorporating structural alignment biases into an attentional neural translation model. In: ACL, pp. 876–885 (2016)Google Scholar
  11. 11.
    Cohn, T., Lapata, M.: Machine translation by triangulation: making effective use of multi-parallel corpora. In: ACL, pp. 728–735 (2007)Google Scholar
  12. 12.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: ACL, pp. 376–380 (2014)Google Scholar
  13. 13.
    Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: CVPR, pp. 2393–2402 (2018)Google Scholar
  14. 14.
    El Kholy, A., Habash, N., Leusch, G., Matusov, E., Sawaf, H.: Language independent connectivity strength features for phrase pivot statistical machine translation. In: ACL, pp. 412–418 (2013)Google Scholar
  15. 15.
    Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)Google Scholar
  16. 16.
    Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F.T.Y., Cho, K.: Zero-resource translation with multi-lingual neural machine translation. In: EMNLP, pp. 268–277 (2016)Google Scholar
  17. 17.
    Gu, J., Cai, J., Joty, S., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: CVPR, pp. 7181–7189 (2018)Google Scholar
  18. 18.
    Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: coarse-to-fine learning for image captioning. In: AAAI, pp. 6837–6844 (2018)Google Scholar
  19. 19.
    Gu, J., Wang, G., Cai, J., Chen, T.: An empirical study of language CNN for image captioning. In: ICCV, pp. 1222–1231 (2017)Google Scholar
  20. 20.
    Gu, J.: Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2017)CrossRefGoogle Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  22. 22.
    Hitschler, J., Schamoni, S., Riezler, S.: Multimodal pivots for image caption translation. In: ACL, pp. 2399–2409 (2016)Google Scholar
  23. 23.
    Jean, S., Cho, K., Memisevic, R., Bengio, Y.: On using very large target vocabulary for neural machine translation. In: ACL, pp. 1–10 (2015)Google Scholar
  24. 24.
    Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding long-short term memory for image caption generation. In: ICCV, pp. 2407–2415 (2015)Google Scholar
  25. 25.
    Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. In: TACL, pp. 339–352 (2016)Google Scholar
  26. 26.
    Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: EMNLP, pp. 1700–1709 (2013)Google Scholar
  27. 27.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)Google Scholar
  28. 28.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  29. 29.
    Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011)Google Scholar
  30. 30.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  31. 31.
    Liu, C., Sun, F., Wang, C., Wang, F., Yuille, A.: MAT: a multimodal attentive translator for image captioning. In: IJCAI, pp. 4033–4039 (2017)Google Scholar
  32. 32.
    Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. In: ICCV, pp. 873–881 (2017)Google Scholar
  33. 33.
    Luong, M.T., Le, Q.V., Sutskever, I., Vinyals, O., Kaiser, L.: Multi-task sequence to sequence learning. In: ICLR (2016)Google Scholar
  34. 34.
    Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. In: ACL, pp. 11–19 (2015)Google Scholar
  35. 35.
    Mi, H., Sankaran, B., Wang, Z., Ittycheriah, A.: Coverage embedding models for neural machine translation. In: EMNLP, pp. 955–960 (2016)Google Scholar
  36. 36.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)Google Scholar
  37. 37.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)Google Scholar
  38. 38.
    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)Google Scholar
  39. 39.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 7008–7024 (2017)Google Scholar
  40. 40.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  41. 41.
    Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neural machine translation. In: ACL, pp. 76–85 (2016)Google Scholar
  42. 42.
    Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: NAACL, pp. 484–491 (2007)Google Scholar
  43. 43.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: Consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)Google Scholar
  44. 44.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)Google Scholar
  45. 45.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. In: PAMI, pp. 652–663 (2017)CrossRefGoogle Scholar
  46. 46.
    Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21, 165–181 (2007)CrossRefGoogle Scholar
  47. 47.
    Wu, J., et al.: AI challenger: a large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475 (2017)
  48. 48.
    Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212 (2016)Google Scholar
  49. 49.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  50. 50.
    Yang, X., Zhang, H., Cai, J.: Shuffle-then-assemble: learning object-agnostic visual relationship features. In: ECCV (2018)Google Scholar
  51. 51.
    Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV, pp. 22–29 (2017)Google Scholar
  52. 52.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)Google Scholar
  53. 53.
    Zahabi, S.T., Bakhshaei, S., Khadivi, S.: Using context vectors in improving a machine translation system with bridge language. In: ACL, pp. 318–322 (2013)Google Scholar
  54. 54.
    Zhu, Y., et al.: Texygen: a benchmarking platform for text generation models. In: SIGIR, pp. 1097–1100 (2018)Google Scholar
  55. 55.
    Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. In: EMNLP, pp. 1568–1575 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jiuxiang Gu
    • 1
    Email author
  • Shafiq Joty
    • 2
  • Jianfei Cai
    • 2
  • Gang Wang
    • 3
  1. 1.ROSE LabNanyang Technological UniversitySingaporeSingapore
  2. 2.SCSENanyang Technological UniversitySingaporeSingapore
  3. 3.Alibaba AI LabsHangzhouChina

Personalised recommendations