Hierarchical Vision-Language Alignment for Video Captioning

  • Junchao Zhang
  • Yuxin PengEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11295)


We have witnessed promising advances on video captioning in recent years, which is a challenging task since it is hard to capture the semantic correspondences between visual content and language descriptions. Different granularities of language components (e.g. words, phrases and sentences), are corresponding to different granularities of visual elements (e.g. objects, visual relations and interested regions). These correspondences can provide multi-level alignments and complementary information for transforming visual content to language descriptions. Therefore, we propose an Attention Guided Hierarchical Alignment (AGHA) approach for video captioning. In the proposed approach, hierarchical vision-language alignments, including object-word, relation-phrase, and region-sentence alignments, are extracted from a well-learned model that suits for multiple tasks related to vision and language, which are then embedded into parallel encoder-decoder streams to provide multi-level semantic guidance and rich complementarities on description generation. Besides, multi-granularity visual features are also exploited to obtain the coarse-to-fine understanding on complex video content, where an attention mechanism is applied to extract comprehensive visual discrimination to enhance video captioning. Experimental results on widely-used dataset MSVD demonstrate that AGHA achieves promising improvement on popular evaluation metrics.


Video captioning Hierarchical vision-language alignment Multi-granularity 



This work was supported by National Natural Science Foundation of China under Grant 61771025.


  1. 1.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV, pp. 4534–4542 (2015)Google Scholar
  2. 2.
    Yang, Z., Han, Y., Wang, Z.: Catching the temporal regions-of-interest for video captioning. In: ACM MM, pp. 146–153 (2017)Google Scholar
  3. 3.
    Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017)Google Scholar
  4. 4.
    Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: CVPR, pp. 7512–7520 (2018)Google Scholar
  5. 5.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR, pp. 1–15 (2015)Google Scholar
  6. 6.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)Google Scholar
  7. 7.
    Zhu, L., Xu, Z., Yang, Y.: Bidirectional multirate reconstruction for temporal modeling in videos. In: CVPR, pp. 1339–1348 (2016)Google Scholar
  8. 8.
    Guadarrama, S., et al.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp. 2712–2719 (2013)Google Scholar
  9. 9.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV, pp. 433–440 (2013)Google Scholar
  10. 10.
    Chen, S., Chen, J., Jin, Q., Hauptmann, A.: Video captioning with guidance of multimodal latent topics. In: ACM MM, pp. 1838–1846 (2017)Google Scholar
  11. 11.
    Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: ACM MM, pp. 537–545 (2017)Google Scholar
  12. 12.
    Hori, C., et al.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017)Google Scholar
  13. 13.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, pp. 6504–6512 (2017)Google Scholar
  14. 14.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  15. 15.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)Google Scholar
  16. 16.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: CVPR, pp. 1261–1270 (2017)Google Scholar
  17. 17.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  18. 18.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  19. 19.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)Google Scholar
  21. 21.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  22. 22.
    Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 (2005)Google Scholar
  23. 23.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)Google Scholar
  24. 24.
    Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018)Google Scholar
  25. 25.
    Wu, A., Han, Y.: Multi-modal circulant fusion for video-to-language and backward. In: IJCAI, pp. 1029–1035 (2018)Google Scholar
  26. 26.
    Zhang, X., Gao, K., Zhang, Y., Zhang, D., Li, J., Tian, Q.: Task-driven dynamic fusion: Reducing ambiguity in video description. In: CVPR, pp. 6250–6258 (2017)Google Scholar
  27. 27.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)Google Scholar
  28. 28.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)Google Scholar
  29. 29.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)Google Scholar
  30. 30.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-Resnet and the impact of residual connections on learning. In: AAAI, pp. 4278–4284 (2017)Google Scholar
  31. 31.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)Google Scholar
  32. 32.
    Xu, Z., Yang, Y., Tsang, I., Sebe, N., Hauptmann, A.G.: Feature weighting via optimal thresholding for video analysis. In: ICCV, pp. 3440–3447 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Institute of Computer Science and TechnologyPeking UniversityBeijingChina

Personalised recommendations