Learning multi-scale features for foreground segmentation

Abstract

Foreground segmentation algorithms aim at segmenting moving objects from the background in a robust way under various challenging scenarios. Encoder–decoder-type deep neural networks that are used in this domain recently perform impressive segmentation results. In this work, we propose a variation of our formerly proposed method (Anonymous 2018) that can be trained end-to-end using only a few training examples. The proposed method extends the feature pooling module of FgSegNet by introducing fusion of features inside this module, which is capable of extracting multi-scale features within images, resulting in a robust feature pooling against camera motion, which can alleviate the need of multi-scale inputs to the network. Sample visualizations highlight the regions in the images on which the model is specially focused. It can be seen that these regions are also the most semantically relevant. Our method outperforms all existing state-of-the-art methods in CDnet2014 datasets by an average overall F-measure of 0.9847. We also evaluate the effectiveness of our method on SBI2015 and UCSD Background Subtraction datasets. The source code of the proposed method is made available at https://github.com/lim-anggun/FgSegNet_v2.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  1. 1.

    Lim LA, Keles HY (2018) Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recogn Lett 112:256–262

    Article  Google Scholar 

  2. 2.

    Babaee M, Dinh DT, Rigoll G (2017) A deep convolutional neural network for background subtraction. arXiv preprint arXiv:1702.01731

  3. 3.

    Badrinarayanan V, Kendall A, Cipolla R (2015) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561

  4. 4.

    Barnich O, Van Droogenbroeck M (2011) Vibe: a universal background subtraction algorithm for video sequences. IEEE Trans Image Process 20(6):1709–1724

    MathSciNet  Article  Google Scholar 

  5. 5.

    Basharat A, Gritai A, Shah M (2008) Learning object motion patterns for anomaly detection and improved object detection. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8

  6. 6.

    Bianco S, Ciocca G, Schettini R (2017) How far can you get by combining change detection algorithms? In: International conference on image analysis and processing. Springer, Berlin, pp 96–107

  7. 7.

    Braham M, Van Droogenbroeck M (2016) Deep background subtraction with scene-specific convolutional neural networks. In: 2016 International conference on systems, signals and image processing (IWSSIP). IEEE, pp 1–4

  8. 8.

    Brutzer S, Höferlin B, Heidemann G (2011) Evaluation of background subtraction techniques for video surveillance. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1937–1944

  9. 9.

    Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  10. 10.

    Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587

  11. 11.

    Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611

  12. 12.

    Cheung SCS, Kamath C (2004) Robust techniques for background subtraction in urban traffic video. Proc SPIE 5308:881–892

    Article  Google Scholar 

  13. 13.

    Chollet F, et al (2015) Keras. https://keras.io. Accessed 29 Aug 2019

  14. 14.

    Cinelli LP, Thomaz LA, da Silva AF, da Silva EA, Netto SL (2017) Foreground segmentation for anomaly detection in surveillance videos using deep residual networks. In: Proceedings XXXV Brazilian communication signal processing symposium, pp 914–918

  15. 15.

    Hofmann M, Tiefenbacher P, Rigoll G (2012) Background segmentation with feedback: the pixel-based adaptive segmenter. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 38–43

  16. 16.

    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

  17. 17.

    KaewTraKulPong P, Bowden R (2002) An improved adaptive background mixture model for real-time tracking with shadow detection. Video-based Surveill Syst 1:135–144

    Article  Google Scholar 

  18. 18.

    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  19. 19.

    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, NIPS 2012, 3–8 Dec 2012. Nevada, USA, pp 1097–1105

  20. 20.

    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  21. 21.

    Lim K, Jang WD, Kim CS (2017) Background subtraction using encoder-decoder structured convolutional neural network. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6

  22. 22.

    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  23. 23.

    Maddalena L, Petrosino A (2015) Towards benchmarking scene background initialization. In: International conference on image analysis and processing. Springer, Berlin, pp 469–476

  24. 24.

    Mahadevan V, Vasconcelos N (2010) Spatiotemporal saliency in dynamic scenes. IEEE Trans Pattern Anal Mach Intell 32(1):171–177. https://doi.org/10.1109/TPAMI.2009.112

    Article  Google Scholar 

  25. 25.

    Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  26. 26.

    Porikli F, Tuzel O (2003) Human body tracking by adaptive background models and mean-shift analysis. In: IEEE international workshop on performance evaluation of tracking and surveillance, pp 1–9

  27. 27.

    Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, pp 234–241

  28. 28.

    Sakkos D, Liu H, Han J, Shao L (2017) End-to-end video background subtraction with 3d convolutional neural networks. Multimed Tools Appl 77(17):23023–23041

    Article  Google Scholar 

  29. 29.

    Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 618–626

  30. 30.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  31. 31.

    Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806

  32. 32.

    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Stauffer C, Grimson WEL (1999) Adaptive background mixture models for real-time tracking. In: IEEE computer society conference on computer vision and pattern recognition, 1999, vol 2. IEEE, pp 246–252

  34. 34.

    Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 648–656

  35. 35.

    Ulyanov D, Vedaldi A, Lempitsky VS (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022

  36. 36.

    Van Droogenbroeck M, Paquot O (2012) Background subtraction: experiments and improvements for vibe. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 32–37

  37. 37.

    Wang Y, Jodoin PM, Porikli F, Konrad J, Benezeth Y, Ishwar P (2014) Cdnet 2014: an expanded change detection benchmark dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 387–394

  38. 38.

    Wang Y, Luo Z, Jodoin PM (2017) Interactive deep learning method for segmenting moving objects. Pattern Recognit Lett 96(Supplement C):66–75. https://doi.org/10.1016/j.patrec.2016.09.014

    Article  Google Scholar 

  39. 39.

    Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122

  40. 40.

    Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, Berlin, pp 818–833

  41. 41.

    Zhu S, Xia L (2015) Human action recognition based on fusion features extraction of adaptive background subtraction and optical flow model. Math Probl Eng 2015:387–464

    MathSciNet  MATH  Google Scholar 

  42. 42.

    Zivkovic Z (2004) Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004, vol 2. IEEE, pp 28–31

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their valuable suggestions and comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hacer Yalim Keles.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lim, L.A., Keles, H.Y. Learning multi-scale features for foreground segmentation. Pattern Anal Applic 23, 1369–1380 (2020). https://doi.org/10.1007/s10044-019-00845-9

Download citation

Keywords

  • Foreground segmentation
  • Convolutional neural networks
  • Feature pooling module
  • Background subtraction
  • Video surveillance