Video DeCaptioning Using U-Net with Stacked Dilated Convolutional Layers

  • Shivansh MundraEmail author
  • Arnav Kumar Jain
  • Sayan Sinha
Conference paper
Part of the The Springer Series on Challenges in Machine Learning book series (SSCML)


We present a supervised video decaptioning algorithm driven by an encoder-decoder pixel prediction. By analogy with auto-encoders, we use U-Net with stacked dilated Convolution layer which is a convolutional neural network trained to generate the decaptioned version of an arbitrary video with subtitles of any size, colour or background. Moreover, our method doesn’t require the mask of the region with text to be removed. In order to succeed at this task, our model needs to both understand the content of the entire frames of video, as well as produce a visually appealing hypothesis for the missing part behind text overlay. When training with our model, we have experimented with both a standard pixel-wise reconstruction loss, as well as total variation loss. The latter produces much sharper results because it enforces inherent local nature in the generated image. We found that our model learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of including dilated convolution layers and residual connections in the bottleneck layer in the reconstruction of videos without captions. Furthermore, our model can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.


  1. 1.
    Anil K Jain and Bin Yu. Automatic text location in images and video frames. Pattern recognition, 31(12):2055–2076, 1998.Google Scholar
  2. 2.
    Victor Wu, R Manmatha, and Edward M Riseman. Finding text in images. In ACM DL, pages 3–12, 1997.Google Scholar
  3. 3.
    Anil K Jain and Sushil Bhattacharjee. Text segmentation using gabor filters for automatic document processing. Machine vision and applications, 5(3):169–184, 1992.Google Scholar
  4. 4.
    Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.Google Scholar
  5. 5.
    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.Google Scholar
  6. 6.
    Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.Google Scholar
  7. 7.
    Arnav Kumar Jain, Abhinav Agarwalla, Kumar Krishna Agrawal, and Pabitra Mitra. Recurrent memory addressing for describing videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2200–2207. IEEE, 2017.Google Scholar
  8. 8.
    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.Google Scholar
  9. 9.
    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
  10. 10.
    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.Google Scholar
  11. 11.
    Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1538–1546, 2015.Google Scholar
  12. 12.
    Anil N. Hirani and Takashi Totsuka. Combining frequency and spatial domain information for fast interactive image noise removal. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 269–276, New York, NY, USA, 1996. ACM.Google Scholar
  13. 13.
    Mumford David Shiota Takahiro Nitzberg, Mark. Filtering, segmentation and depth. In Springer-Verlag.Google Scholar
  14. 14.
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.Google Scholar
  15. 15.
    Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.Google Scholar
  16. 16.
    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.Google Scholar
  17. 17.
    Muhammad Hanif, Anna Tonazzini, Pasquale Savino, and Emanuele Salerno. Sparse representation based inpainting for the restoration of document images affected by bleed-through. In Multidisciplinary Digital Publishing Institute Proceedings, volume 2, page 93, 2018.Google Scholar
  18. 18.
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.Google Scholar
  19. 19.
    Ugur Demir and Gozde Unal. Patch-based image inpainting with generative adversarial networks. arXiv preprint arXiv:1803.07422, 2018.Google Scholar
  20. 20.
    Xinshan Zhu, Yongjun Qian, Xianfeng Zhao, Biao Sun, and Ya Sun. A deep learning approach to patch-based image inpainting forensics. Signal Processing: Image Communication, 2018.Google Scholar
  21. 21.
    Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems, pages 341–349, 2012.Google Scholar
  22. 22.
    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.Google Scholar
  23. 23.
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.Google Scholar
  24. 24.
    Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.Google Scholar
  25. 25.
    Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. iclr (2015), 2015.Google Scholar
  26. 26.
    Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Transactions on graphics (TOG), 22(3):313–318, 2003.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Shivansh Mundra
    • 1
    Email author
  • Arnav Kumar Jain
    • 2
  • Sayan Sinha
    • 3
  1. 1.Department of Mechanical EngineeringIndian Institute of Technology KharagpurKharagpurIndia
  2. 2.Department of MathematicsIndian Institute of Technology KharagpurKharagpurIndia
  3. 3.Department of Computer Science and EngineeringIndian Institute of Technology KharagpurKharagpurIndia

Personalised recommendations