Multimedia Tools and Applications

, Volume 78, Issue 22, pp 31231–31243 | Cite as

Deep learning ensemble with data augmentation using a transcoder in visual description

  • Jin Young LeeEmail author


Visual description is very challenging work in computer vision. Since it is usually performed with compressed videos, its performance strongly depends on coding distortion. Therefore, it is very important that visual description networks are trained using video datasets with both high and low qualities. In order to generate them from a given training dataset, this paper introduces a new data augmentation method employing a transcoder. It converts one video quality into another by controlling a quantization parameter (QP). Two different networks are trained on the high and low quality videos, respectively, and then the proposed deep learning ensemble model determines optimum sentence among candidates generated from these networks. Experimental results show that the proposed method is very robust to the coding distortion.


Deep learning Ensemble Data augmentation Visual description Transcoder 



This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2018R1C1B5086072).


  1. 1.
    Chen DL, Dolan WB (2011) “Collecting highly parallel data for paraphrase evaluation,” Association for Computational Linguistics, pp. 190–200Google Scholar
  2. 2.
    Chen S, Chen J, Jin Q, Hauptmann A (2017) “Video captioning with guidance of multimodal latent topics,” ACM MultimediaGoogle Scholar
  3. 3.
    Denkowski M, Lavie A (2014) “Meteor universal: language specific translation evaluation for any target language,” Association for Computational Linguistics, pp. 376–380Google Scholar
  4. 4.
    Hochreiter S, Schmidhuber J (1998) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  5. 5.
    Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) “Describing videos using multi-modal fusion,” ACM MultimediaGoogle Scholar
  6. 6.
    Krizhevsky A, Sutskever I, Hinton GE (2012) “Imagenet classification with deep convolutional neural networks,” Neural Information Processing Systems, pp. 1106–1114Google Scholar
  7. 7.
    Lin CY (2004) “ROUGE: a package for automatic evaluation of summaries,” Association for Computational Linguistics, pp, 74–81Google Scholar
  8. 8.
    Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) “Hierarchical recurrent neural encoder for video representation with application to captioning,” IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  9. 9.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) “BLEU: a method for automatic evaluation of machine translation,” Association for Computational Linguistics, pp. 311–318Google Scholar
  10. 10.
    Phan S, Henter GE, Miyao Y, Satoh S (2017) “Consensus-based sequence training for video captioning,” arXiv preprint arXiv:1712.09532v1Google Scholar
  11. 11.
    Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) “ Multimodal video description,” ACM MultimediaGoogle Scholar
  12. 12.
    Simonyan K, Zisserman A (2014) “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556Google Scholar
  13. 13.
    Sullivan GJ, Ohm J-R, Han W-J, Wiegand T (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Trans on Circuit Systems for Video Technology 22(12):1649–1668CrossRefGoogle Scholar
  14. 14.
    Sun H, Kwok W, Zdepski JW (Apr. 1996) Architectures for MPEG compressed bitstream scaling. IEEE Trans Circuits Syst Video Technol 6(2):191–199CrossRefGoogle Scholar
  15. 15.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Proces Syst:3104–3112Google Scholar
  16. 16.
    Taylor L (2018) “Improving deep learning with generic data augmentation,” IEEE Symposium Series on Computational IntelligenceGoogle Scholar
  17. 17.
    Vedantam R, Zitnick CL, Parikh D (2015) “CIDEr: consensus-based image description evaluation,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575Google Scholar
  18. 18.
    Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) “Translating videos to natural language using deep recurrent neural networks,” Conference of the North American Chapter of the Association for Computational LinguisticsGoogle Scholar
  19. 19.
    Venugopalan S, et al. (2015) “Sequence to Sequence – Video to Text,” IEEE International Conference on Computer Vision, pp. 4534–4542Google Scholar
  20. 20.
    Wang B, Ma L, Zhang W, Liu W (2018) “Reconstruction network for video captioning,” IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  21. 21.
    Wiegand T, Sullivan GJ, Bjntegaard G, Luthra A (2003) Overview of the H.264/AVC video coding standard. IEEE Trans. on Circuit Systems for Video Technology 13(7):560–576CrossRefGoogle Scholar
  22. 22.
    Xu J, Yao T, Zhang Y, and Mei T (2017) “Learning multimodal attention LSTM networks for video captioning,” ACM MultimediaGoogle Scholar
  23. 23.
    Yang Z, Xu Y, Wang H, Wang B, Han Y (2017) “Multirate Multimodal Video Captioning,” ACM MultimediaGoogle Scholar
  24. 24.
    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) “Describing videos by exploiting temporal structure,” International Conference on Computer VisionGoogle Scholar
  25. 25.
    Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) “Video paragraph captioning using hierarchical recurrent neural networks,” IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Intelligent Mechatronics EngineeringSejong UniversitySeoulSouth Korea

Personalised recommendations