Deep Learning for Scene Understanding

Nadeem, Uzair; Shah, Syed Afaq Ali; Sohel, Ferdous; Togneri, Roberto; Bennamoun, Mohammed

doi:10.1007/978-3-030-11479-4_2

Uzair Nadeem⁷,
Syed Afaq Ali Shah⁷,
Ferdous Sohel⁸,
Roberto Togneri⁹ &
…
Mohammed Bennamoun⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 136))

3519 Accesses
5 Citations

Abstract

With the progress in the field of computer vision, we are moving closer and closer towards the ultimate aim of human like vision for machines. Scene understanding is an essential part of this research. It seeks the goal that any image should be as understandable and decipherable for computers as it is for humans. The stall in the progress of the different components of scene understanding, due to the limitations of the traditional algorithms, has now been broken by the induction of neural networks for computer vision tasks. The advancements in parallel computational hardware has made it possible to train very deep and complex neural network architectures. This has vastly improved the performances of algorithms for all the different components of scene understanding. This chapter analyses these contributions of deep learning and also presents the advancements of high level scene understanding tasks, such as caption generation for images. It also sheds light on the need to combine these individual components into an integrated system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A.E. Johnson, M. Hebert, Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)
Article Google Scholar
D.G. Lowe, Object recognition from local scale-invariant features, in The proceedings of the IEEE International Conference on Computer Vision (1999)
Google Scholar
L. Wang, D.-C. He, Texture classification using texture spectrum. Pattern Recognit. 23(8), 905–910 (1990)
Article Google Scholar
F. Tombari, S. Salti, L. Di Stefano, Unique signatures of histograms for local surface description, in European Conference on Computer Vision (Berlin, Heidelberg, 2010)
Chapter Google Scholar
S.A.A. Shah, M. Bennamoun, F. Boussaid, Performance evaluation of 3D local surface descriptors for low and high resolution range image registration, in International Conference on Digital lmage Computing: Techniques and Applications (2014)
Google Scholar
Y. Guo, F.A. Sohel, M. Bennamoun, J. Wan, M. Lu, RoPS: a local feature descriptor for 3D rigid objects based on rotational projection statistics, in International Conference on Communications, Signal Processing, and Their Applications (2013)
Google Scholar
L. Bo, X. Ren, D. Fox, Depth kernel descriptors for object recognition, in IEEE/RSJ International Conference on Intelligent Robots and Systems (2011)
Google Scholar
Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, 3D object recognition in cluttered scenes with local surface features: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2270–2287 (2014)
Article Google Scholar
L. Deng, A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)
Article Google Scholar
M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification. IEEE Trans. Pattern Anal. Mach. Intell. 37(4), 713–727 (2015)
Article Google Scholar
S.A.A. Shah, M. Bennamoun, F. Boussaid, Iterative deep learning for image set based face and object recognition. Neurocomputing 174, 866–874 (2016)
Article Google Scholar
A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012)
Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
C.R. Qi, H. Su, K. Mo, L.J. Guibas, PointNet: deep learning on point sets for 3D classification and segmentation. arXiv:1612.00593 (2016)
P. Viola, M.J. Jones, Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Article Google Scholar
F.R.M. Al-Osaimi, M. Bennamoun, 3D face surface analysis and recognition based on facial surface features, in 3D Face Modeling, Analysis and Recognition (Wiley, 2013), pp. 39–76
Google Scholar
Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: closing the gap to human-level performance in face verification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
E.G. Ortiz, A. Wright, M. Shah, Face recognition in movie trailers via mean sequence sparse representation-based classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)
Google Scholar
S.A.A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, R. Togneri, Efficient image set classification using linear regression based image reconstruction, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017)
Google Scholar
H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura et al., ICDAR 2015 competition on robust reading, in 13th International Conference on Document Analysis and Recognition (2015)
Google Scholar
Q. Ye, D. Doermann, Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)
Article Google Scholar
M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)
Article MathSciNet Google Scholar
T. Wang, D.J. Wu, A. Coates, A.Y. Ng, End-to-end text recognition with convolutional neural networks, in 21st International Conference on Pattern Recognition (2012)
Google Scholar
B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010)
Google Scholar
B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proceedings of the IEEE Conference IEEE Conference on Computer Vision and Pattern Recognition (2010)
Google Scholar
H. Chen, S.S. Tsai, G. Schroth, D.M. Chen, R. Grzeszczuk, B. Girod, Robust text detection in natural images with edge-enhanced maximally stable extremal regions, in 18th IEEE International Conference on Image Processing (2011)
Google Scholar
L. Neumann, J. Matas, Real-time scene text localization and recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
L. Neumann, J. Matas, Scene text localization and recognition with oriented stroke detection, in Proceedings of the IEEE International Conference on Computer Vision (2013)
Google Scholar
Q. Zhu, M.-C. Yeh, K.-T. Cheng, Multimodal fusion using learned text concepts for image categorization, in Proceedings of the 14th ACM International Conference on Multimedia (2006)
Google Scholar
S. Karaoglu, J.C. Van Gemert, T. Gevers, Object reading: text recognition for object recognition, in European Conference on Computer Vision (ECCV) (2012)
Chapter Google Scholar
Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in Advances in Neural Information Processing Systems (2014)
Google Scholar
D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)
Article Google Scholar
S. Mattoccia, S. Giardino, A. Gambini, Accurate and efficient cost aggregation strategy for stereo correspondence based on approximated joint bilateral filtering, in Asian Conference on Computer Vision (2010)
Google Scholar
R.A. Hamzah, H. Ibrahim, Literature survey on stereo vision disparity map algorithms. J. Sens. (2015)
Google Scholar
Y. Li, D.P. Huttenlocher, Learning for stereo vision using the structured support vector machine, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2008)
Google Scholar
A. Spyropoulos, N. Komodakis, P. Mordohai, Learning to detect ground control points for improving the accuracy of stereo matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
A. Saxena, M. Sun, A.Y. Ng, Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)
Article Google Scholar
K. Karsch, C. Liu, S.B. Kang, Depth extraction from video using non-parametric sampling, in European Conference on Computer Vision (2012)
Google Scholar
L. Ladicky, J. Shi, M. Pollefeys, Pulling things out of perspective, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
J. Zbontar, Y. LeCun, Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(2), 1–32 (2016)
MATH Google Scholar
D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
S.N. Parizi, J.G. Oberlin, P.F. Felzenszwalb, Reconfigurable models for scene recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
D. Lin, C. Lu, R. Liao, J. Jia, Learning important spatial pooling regions for scene classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in Advances in Neural Information Processing Systems (2014)
Google Scholar
M. Hayat, S.H. Khan, M. Bennamoun, S. An, A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans. Image Process. 25(10), 4829–4841 (2016)
Article MathSciNet Google Scholar
S.H. Khan, M. Hayat, M. Bennamoun, R. Togneri, F.A. Sohel, A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans. Image Process. 25(7), 3372–3383 (2016)
Article MathSciNet Google Scholar
L. Herranz, S. Jiang, X. Li, Scene recognition with CNNs: objects, scales and dataset bias, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E.I.-C.N. Erdem, F. Keller, A. Muscat, B. Plank, Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)
Article Google Scholar
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding the long-short term memory model for image caption generation, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
X. Chen, C. Lawrence Zitnick, Mind’s eye: a recurrent visual representation for image caption generation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: describing novel object categories without paired training data, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
K. Kafle, C. Kanan, Visual question answering: datasets, algorithms, and future challenges. arXiv:1610.01465 (2016)
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh, VQA: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu, Are you talking to a machine? Dataset and methods for multilingual image question, in Advances in Neural Information Processing Systems (2015)
Google Scholar
D. Harris, S. Harris, Digital Design and Computer Architecture (Morgan Kaufmann, 2010), p. 129
Google Scholar
Q. Wu, C. Shen, P. Wang, A. Dick, A. van den Hengel, Image captioning and visual question answering based on attributes and external knowledge, in IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
Google Scholar
N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from RGBD images, in European Conference on Computer Vision (ECCV) (2012)
Chapter Google Scholar
C. Ye, Y. Yang, C. Fermuller, Y. Aloimonos, What can I do around here? Deep functional scene understanding for cognitive robots. arXiv:1602.00032 (2016)
G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models: combining models for holistic scene understanding, in Advances in Neural Information Processing Systems (2009)
Google Scholar
C. Li, A. Kowdle, A. Saxena, T. Chen, Towards holistic scene understanding: feedback enabled cascaded classification models, in Advances in Neural Information Processing Systems (2010)
Google Scholar
J. Yao, S. Fidler, R. Urtasun, Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
S. H. Khan, B Mohammed, F. Sohel, R. Togneri, Automatic shadow detection and removal from a single image. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 431–446 (2016)
Article Google Scholar
U. Asif, M. Bennamoun, F. Sohel, Simultaneous dense scene reconstruction and object labeling, in IEEE International Conference on Robotics and Automation (ICRA) (2016)
Google Scholar
U. Asif, M. Bennamoun, F.A. Sohel, RGB-D object recognition and grasp detection using hierarchical cascaded forests. IEEE Trans. Robot. (2017)
Google Scholar
R. Jayadevan, S.R. Kolhe, P.M. Patil, U. Pal, Automatic processing of handwritten bank cheque images: a survey. Int. J. Doc. Anal. Recognit. (IJDAR) 15(4), 267–296 (2012)
Article Google Scholar
G. Dreyfus, Neural Networks: Methodology and Applications (Springer Science & Business Media, 2005)
Google Scholar

Download references

Acknowledgements

This work is partially supported by SIRF Scholarship from the University of Western Australia (UWA) and Australian Research Council (ARC) Grant DP150100294.

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, The University of Western Australia, Crawley, Australia
Uzair Nadeem, Syed Afaq Ali Shah & Mohammed Bennamoun
Discipline of Information Technology, Mathematics & Statistics, Murdoch University, Perth, Australia
Ferdous Sohel
Department of Electrical, Electronics and Computer Engineering, The University of Western Australia, Crawley, Australia
Roberto Togneri

Authors

Uzair Nadeem
View author publications
You can also search for this author in PubMed Google Scholar
Syed Afaq Ali Shah
View author publications
You can also search for this author in PubMed Google Scholar
Ferdous Sohel
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Togneri
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Bennamoun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Bennamoun .

Editor information

Editors and Affiliations

Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
Sanjiban Sekhar Roy
University of Canberra, Bruce, ACT, Australia
Dharmendra Sharma
Department of Civil Engineering, National Institute of Technology Patna, Patna, Bihar, India
Pijush Samui

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nadeem, U., Shah, S.A.A., Sohel, F., Togneri, R., Bennamoun, M. (2019). Deep Learning for Scene Understanding. In: Balas, V., Roy, S., Sharma, D., Samui, P. (eds) Handbook of Deep Learning Applications. Smart Innovation, Systems and Technologies, vol 136. Springer, Cham. https://doi.org/10.1007/978-3-030-11479-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-11479-4_2
Published: 26 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11478-7
Online ISBN: 978-3-030-11479-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics