STFCN: Spatio-Temporal Fully Convolutional Neural Network for Semantic Segmentation of Street Scenes

Fayyaz, Mohsen; Saffar, Mohammad Hajizadeh; Sabokrou, Mohammad; Fathy, Mahmood; Huang, Fay; Klette, Reinhard

doi:10.1007/978-3-319-54407-6_33

Mohsen Fayyaz¹⁶,
Mohammad Hajizadeh Saffar¹⁶,
Mohammad Sabokrou¹⁶,
Mahmood Fathy¹⁷,
Fay Huang¹⁸ &
…
Reinhard Klette¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10116))

Included in the following conference series:

Asian Conference on Computer Vision

2344 Accesses
19 Citations

Abstract

This paper presents a novel method to involve both spatial and temporal features for semantic segmentation of street scenes. Current work on convolutional neural networks (CNNs) has shown that CNNs provide advanced spatial features supporting a very good performance of solutions for the semantic segmentation task. We investigate how involving temporal features also has a good effect on segmenting video data. We propose a module based on a long short-term memory (LSTM) architecture of a recurrent neural network for interpreting the temporal characteristics of video frames over time. Our system takes as input frames of a video and produces a correspondingly-sized output; for segmenting the video our method combines the use of three components: First, the regional spatial features of frames are extracted using a CNN; then, using LSTM the temporal features are added; finally, by deconvolving the spatio-temporal features we produce pixel-wise predictions. Our key insight is to build spatio-temporal convolutional networks (spatio-temporal CNNs) that have an end-to-end architecture for semantic video segmentation. We adapted fully some known convolutional network architectures (such as FCN-AlexNet and FCN-VGG16), and dilated convolution into our spatio-temporal CNNs. Our spatio-temporal CNNs achieve state-of-the-art semantic segmentation, as demonstrated for the Camvid and NYUDv2 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at https://github.com/junhyukoh/caffe-lstm.
2.
Our modified Caffe distribution and STFCN models are publicly available at https://github.com/MohsenFayyaz89/STFCN.
3.
Available at mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/.
4.
Available at https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.

References

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
Bittel, S., Kaiser, V., Teichmann, M., Thoma, M.: Pixel-wise segmentation of street with neural networks. arXiv preprint arXiv:1511.00513 (2015)
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88682-2_5
Chapter Google Scholar
Carreira, J., Sminchisescu, C.: Constrained parametric min-cuts for automatic object segmentation. In: CVPR, pp. 3241–3248 (2010)
Google Scholar
Chang, F.J., Lin, Y.Y., Hsu, K.J.: Multiple structured-instance learning for semantic segmentation with uncertain training data. In: CVPR, pp. 360–367 (2014)
Google Scholar
Chen, A.Y., Corso, J.J.: Propagating multi-class pixel labels throughout video frames. In: Image Processing Workshop (WNYIPW), pp. 14–17 (2010)
Google Scholar
Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. arXiv preprint arXiv:1511.03339 (2015)
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2011 (2011). www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)
Galasso, F., Keuper, M., Brox, T., Schiele, B.: Spectral graph reduction for efficient image and streaming video segmentation. In: CVPR, pp. 49–56 (2014)
Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: Neural computation, pp. 2451–2471 (2000)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Google Scholar
Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: CVPR, pp. 564–571 (2013)
Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). doi:10.1007/978-3-319-10584-0_23
Google Scholar
He, Y., Chiu, W.C., Keuper, M., Fritz, M.: RGBD semantic segmentation using spatio-temporal data-driven pooling. arXiv preprint arXiv:1604.02388 (2016)
Hickson, S., Birchfield, S., Essa, I., Christensen, H.: Efficient hierarchical graph-based segmentation of RGBD videos. In: CVPR, pp. 344–351 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 12, 1735–1780 (1997)
Article Google Scholar
Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. In: NIPS, pp. 1495–1503 (2015)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference Multimedia, pp. 675–678 (2014)
Google Scholar
Khoreva, A., Galasso, F., Hein, M., Schiele, B.: Classifier based graph construction for video segmentation. In: CVPR, pp. 951–960 (2015)
Google Scholar
Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004)
MATH Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: CVPR (2016)
Google Scholar
Russell, C., Kohli, P., Torr, P.H.: Associative hierarchical CRFs for object class image segmentation. In: ICCV, pp. 739–746 (2009)
Google Scholar
Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: CVPR, pp. 4286–4294 (2015)
Google Scholar
Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., Bu, J.: Weakly supervised multiclass video segmentation. In: CVPR, pp. 57–64 (2014)
Google Scholar
Liu, Y., Liu, J., Li, Z., Tang, J., Lu, H.: Weakly-supervised dual clustering for image semantic segmentation. In: CVPR, pp. 2075–2082 (2013)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)
Google Scholar
Martinovic, A., Knopp, J., Riemenschneider, H., Van Gool, L.: 3D all the way: semantic segmentation of urban scenes from start to end in 3D. In: CVPR, pp. 4456–4465 (2015)
Google Scholar
Matan, O., Burges, C.J., LeCun, Y., Denker, J.S.: Multi-digit recognition using a space displacement neural network. In: NIPS, pp. 488–495 (1991)
Google Scholar
Mottaghi, R., Fidler, S., Yao, J., Urtasun, R., Parikh, D.: Analyzing semantic segmentation using hybrid human-machine CRFs. In: CVPR, pp. 3143–3150 (2013)
Google Scholar
Richmond, D.L., Kainmueller, D., Yang, M.Y., Myers, E.W., Rother, C.: Relating cascaded random forests to deep convolutional neural networks for semantic segmentation. arXiv preprint arXiv:1507.07583 (2015)
Sabokrou, M., Fathy, M., Hoseini, M., Klette, R.: Real-time anomaly detection and localization in crowded scenes. In: CVPR, Workshops, pp. 56–62 (2015)
Google Scholar
Sabokrou, M., Fathy, M., Hoseini, M.: Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electron. Lett. 52, 1122–1124 (2016)
Article Google Scholar
Sharma, A., Tuzel, O., Jacobs, D.W.: Deep hierarchical parsing for semantic segmentation. In: CVPR, pp. 530–538 (2015)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances Neural Information Processing Systems, pp. 68–576 (2014)
Google Scholar
Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2012)
Google Scholar
Tighe, J., Lazebnik, S.: SuperParsing: scalable nonparametric image parsing with superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 352–365. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15555-0_26
Chapter Google Scholar
Volpi, M., Ferrari, V.: Semantic segmentation of urban scenes by learning local class interactions. In: CVPR, pp. 1–9 (2015)
Google Scholar
Wolf, R., Platt, J.C.: Postal address block location using a convolutional locator network. In: NIPS, pp. 745–745 (1994)
Google Scholar
Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.C.: Layered object models for image segmentation. IEEE Trans. PAMI 34, 1731–1743 (2012)
Article Google Scholar
Zheng, C., Wang, L.: Semantic segmentation of remote sensing imagery using object-based Markov random field model with regional penalties. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 8, 1924–1935 (2015)
Article Google Scholar
Zhang, L., Song, M., Liu, Z., Liu, X., Bu, J., Chen, C.: Probabilistic graphlet cut: exploiting spatial structure cue for weakly supervised image segmentation. In: CVPR, pp. 1908–1915 (2013)
Google Scholar
Zhang, Y., Chen, X., Li, J., Wang, C., Xia, C.: Semantic object segmentation via detection in weakly labeled video. In: CVPR, pp. 3641–3649 (2015)
Google Scholar
Zhu, Y., Urtasun, R., Salakhutdinov, R., Fidler, S.: Segdeepm: exploiting segmentation and context in deep neural networks for object detection. In: CVPR, pp. 4703–4711 (2015)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Malek-Ashtar University of Technology, Tehran, Iran
Mohsen Fayyaz, Mohammad Hajizadeh Saffar & Mohammad Sabokrou
Iran University of Science and Technology, Tehran, Iran
Mahmood Fathy
National Ilan University, Yilan, Taiwan
Fay Huang
Auckland University of Technology, Auckland, New Zealand
Reinhard Klette

Authors

Mohsen Fayyaz
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Hajizadeh Saffar
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Sabokrou
View author publications
You can also search for this author in PubMed Google Scholar
Mahmood Fathy
View author publications
You can also search for this author in PubMed Google Scholar
Fay Huang
View author publications
You can also search for this author in PubMed Google Scholar
Reinhard Klette
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohsen Fayyaz .

Editor information

Editors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chu-Song Chen
Tsinghua University , Beijing, China
Jiwen Lu
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Kai-Kuang Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fayyaz, M., Saffar, M.H., Sabokrou, M., Fathy, M., Huang, F., Klette, R. (2017). STFCN: Spatio-Temporal Fully Convolutional Neural Network for Semantic Segmentation of Street Scenes. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10116. Springer, Cham. https://doi.org/10.1007/978-3-319-54407-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-54407-6_33
Published: 15 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54406-9
Online ISBN: 978-3-319-54407-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics