Efficient Uncertainty Estimation for Semantic Segmentation in Videos

  • Po-Yu HuangEmail author
  • Wan-Ting HsuEmail author
  • Chun-Yueh ChiuEmail author
  • Ting-Fan WuEmail author
  • Min SunEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


Uncertainty estimation in deep learning becomes more important recently. A deep learning model can’t be applied in real applications if we don’t know whether the model is certain about the decision or not. Some literature proposes the Bayesian neural network which can estimate the uncertainty by Monte Carlo Dropout (MC dropout). However, MC dropout needs to forward the model N times which results in N times slower. For real-time applications such as a self-driving car system, which needs to obtain the prediction and the uncertainty as fast as possible, so that MC dropout becomes impractical. In this work, we propose the region-based temporal aggregation (RTA) method which leverages the temporal information in videos to simulate the sampling procedure. Our RTA method with Tiramisu backbone is 10x faster than the MC dropout with Tiramisu backbone (\(N=5\)). Furthermore, the uncertainty estimation obtained by our RTA method is comparable to MC dropout’s uncertainty estimation on pixel-level and frame-level metrics.


Uncertainty Segmentation Video Efficient 



We thank Umbo CV, MediaTek, MOST 107-2634-F-007-007 for their support.


  1. 1.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)CrossRefGoogle Scholar
  2. 2.
    Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424 (2015)
  3. 3.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recognit. Lett. 30(2), 88–97 (2009)CrossRefGoogle Scholar
  4. 4.
    Chang, J., Wei, D., Fisher III, J.W.: A video representation using temporal superpixels. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2051–2058. IEEE (2013)Google Scholar
  5. 5.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016)
  6. 6.
    Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 686–695. IEEE (2017)Google Scholar
  7. 7.
    Denker, J.S., Lecun, Y.: Transforming neural-net output levels to probability distributions. In: Advances in Neural Information Processing Systems, pp. 853–859 (1991)Google Scholar
  8. 8.
    Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: JumpCut: non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph. 34(6), 195 (2015)CrossRefGoogle Scholar
  9. 9.
    Freeman, L.C.: Elementary Applied Statistics for Students in Behavioral Science. Wiley, New York (1965)Google Scholar
  10. 10.
    Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: 33rd International Conference on Machine Learning, ICML 2016, vol. 3, pp. 1651–1660 (2016)Google Scholar
  11. 11.
    Gal, Y., Ghahramani, Z.: Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158 (2015)
  12. 12.
    Graves, A.: Practical variational inference for neural networks. In: Advances in Neural Information Processing Systems, pp. 2348–2356 (2011)Google Scholar
  13. 13.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: Computer Vision and Pattern Recognition (CVPR), pp. 2141–2148. IEEE (2010)Google Scholar
  14. 14.
    Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745 (2011)
  15. 15.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  16. 16.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)Google Scholar
  17. 17.
    Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5849–5858 (2017)Google Scholar
  18. 18.
    Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers Tiramisu: fully convolutional denseNets for semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1175–1183. IEEE (2017)Google Scholar
  19. 19.
    Kampffmeyer, M., Salberg, A.B., Jenssen, R.: Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 680–688. IEEE (2016)Google Scholar
  20. 20.
    Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015)
  21. 21.
    Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 4762–4769. IEEE (2016)Google Scholar
  22. 22.
    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems, pp. 5580–5590 (2017)Google Scholar
  23. 23.
    Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–93 (1938)CrossRefGoogle Scholar
  24. 24.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  25. 25.
    MacKay, D.J.: A practical Bayesian framework for backpropagation networks. Neural Comput. 4(3), 448–472 (1992)CrossRefGoogle Scholar
  26. 26.
    Paszke, A., et al.: Automatic differentiation in PyTorch (2017)Google Scholar
  27. 27.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Computer Vision and Pattern Recognition (2017)Google Scholar
  28. 28.
    Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3227–3234 (2015)Google Scholar
  29. 29.
    Ramakanth, S.A., Babu, R.V.: SeamSeg: video object segmentation using patch seams. In: CVPR, vol. 2, p. 5 (2014)Google Scholar
  30. 30.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  31. 31.
    Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  33. 33.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890 (2017)Google Scholar
  35. 35.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. arXiv preprint arXiv:1703.10025 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.National Tsing Hua UniversityTaipeiTaiwan
  2. 2.Umbo Computer VisionTaipeiTaiwan

Personalised recommendations