Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 22, pp 31793–31805 | Cite as

Deep multimodal embedding for video captioning

  • Jin Young LeeEmail author
Article
  • 71 Downloads

Abstract

Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.

Keywords

Deep embedding LSTM network Multimodal features Video captioning 

Notes

References

  1. 1.
    Bahdanau D, Cho K, Benjio Y (2015) Neural machine translation by jointly learning to align and translate. International Conference on Learning RepresentationsGoogle Scholar
  2. 2.
    Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  3. 3.
    Chen S, Chen J, Jin Q, Hauptmann A (2017) Video captioning with guidance of multimodal latent topics. ACM MultimediaGoogle Scholar
  4. 4.
    Cho K, Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Benjio Y (2015) Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language ProcessingGoogle Scholar
  5. 5.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4)Google Scholar
  6. 6.
    Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. Association for Computational LinguisticsGoogle Scholar
  7. 7.
    Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrel T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  8. 8.
    Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore C, Plaka M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss R, Wilson K (2017) CNN Architectures for Large-Scale Audio Classification. International Conference on Acoustics, Speech, and Signal ProcessingGoogle Scholar
  9. 9.
    Hochreiter S, Schmidhuber J (1998) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  10. 10.
    Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multi-modal fusion. ACM MultimediaGoogle Scholar
  11. 11.
    Jin Q, Chen S, Chen J, Hauptmann A (2017) Knowing yourself: improving video caption via in-depth recap. ACM MultimediaGoogle Scholar
  12. 12.
    Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. AAAI Conference on Artificial IntelligenceGoogle Scholar
  13. 13.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing SystemsGoogle Scholar
  14. 14.
    Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. Association for Computational LinguisticsGoogle Scholar
  15. 15.
    Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. AAAI Conference on Artificial IntelligenceGoogle Scholar
  16. 16.
    Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  17. 17.
    Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  18. 18.
    Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. Association for Computational LinguisticsGoogle Scholar
  19. 19.
    Phan S, Henter GE, Miyao Y, Satoh S (2017) Consensus-based sequence training for video captioning arXiv preprint arXiv:1712.09532v1Google Scholar
  20. 20.
    Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. ACM MultimediaGoogle Scholar
  21. 21.
    Rich C, Steve L, Lee G (2001) Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Neural Information Processing Systems ConferenceGoogle Scholar
  22. 22.
    Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, Lecun Y (2014) Overfeat: Integrated recognition, localization and detection using convolutional networks. International Conference on Learning RepresentationsGoogle Scholar
  23. 23.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958zbMATHMathSciNetGoogle Scholar
  24. 24.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Proces SystGoogle Scholar
  25. 25.
    Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI Conference on Artificial IntelligenceGoogle Scholar
  26. 26.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2013) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  27. 27.
    Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. International Conference on Computational LinguisticsGoogle Scholar
  28. 28.
    Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  29. 29.
    Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrel T, Saenko K (2015) Sequence to sequence – video to text. IEEE International Conference on Computer VisionGoogle Scholar
  30. 30.
    Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. Conference of the North American Chapter of the Association for Computational LinguisticsGoogle Scholar
  31. 31.
    Vinyals O, Toshev A, Benjio S, Erhan D (2015) Show and tell: a neural image caption generator. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  32. 32.
    Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  33. 33.
    Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  34. 34.
    Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. ACM MultimediaGoogle Scholar
  35. 35.
    Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Benjio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine LearningGoogle Scholar
  36. 36.
    Yang Z, Xu Y, Wang H, Wang B, Han Y (2017) Multirate multimodal video captioning. ACM MultimediaGoogle Scholar
  37. 37.
    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. International Conference on Computer VisionGoogle Scholar
  38. 38.
    You Q, Jin H, Wang Z, Fang C, Luo J (2017) Image captioning with semantic attention,” IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  39. 39.
    Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Intelligent Mechatronics EngineeringSejong UniversitySeoulSouth Korea

Personalised recommendations