Abstract
With the progress in the field of computer vision, we are moving closer and closer towards the ultimate aim of human like vision for machines. Scene understanding is an essential part of this research. It seeks the goal that any image should be as understandable and decipherable for computers as it is for humans. The stall in the progress of the different components of scene understanding, due to the limitations of the traditional algorithms, has now been broken by the induction of neural networks for computer vision tasks. The advancements in parallel computational hardware has made it possible to train very deep and complex neural network architectures. This has vastly improved the performances of algorithms for all the different components of scene understanding. This chapter analyses these contributions of deep learning and also presents the advancements of high level scene understanding tasks, such as caption generation for images. It also sheds light on the need to combine these individual components into an integrated system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A.E. Johnson, M. Hebert, Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)
D.G. Lowe, Object recognition from local scale-invariant features, in The proceedings of the IEEE International Conference on Computer Vision (1999)
L. Wang, D.-C. He, Texture classification using texture spectrum. Pattern Recognit. 23(8), 905–910 (1990)
F. Tombari, S. Salti, L. Di Stefano, Unique signatures of histograms for local surface description, in European Conference on Computer Vision (Berlin, Heidelberg, 2010)
S.A.A. Shah, M. Bennamoun, F. Boussaid, Performance evaluation of 3D local surface descriptors for low and high resolution range image registration, in International Conference on Digital lmage Computing: Techniques and Applications (2014)
Y. Guo, F.A. Sohel, M. Bennamoun, J. Wan, M. Lu, RoPS: a local feature descriptor for 3D rigid objects based on rotational projection statistics, in International Conference on Communications, Signal Processing, and Their Applications (2013)
L. Bo, X. Ren, D. Fox, Depth kernel descriptors for object recognition, in IEEE/RSJ International Conference on Intelligent Robots and Systems (2011)
Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, 3D object recognition in cluttered scenes with local surface features: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2270–2287 (2014)
L. Deng, A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)
M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification. IEEE Trans. Pattern Anal. Mach. Intell. 37(4), 713–727 (2015)
S.A.A. Shah, M. Bennamoun, F. Boussaid, Iterative deep learning for image set based face and object recognition. Neurocomputing 174, 866–874 (2016)
A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012)
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
C.R. Qi, H. Su, K. Mo, L.J. Guibas, PointNet: deep learning on point sets for 3D classification and segmentation. arXiv:1612.00593 (2016)
P. Viola, M.J. Jones, Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
F.R.M. Al-Osaimi, M. Bennamoun, 3D face surface analysis and recognition based on facial surface features, in 3D Face Modeling, Analysis and Recognition (Wiley, 2013), pp. 39–76
Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: closing the gap to human-level performance in face verification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
E.G. Ortiz, A. Wright, M. Shah, Face recognition in movie trailers via mean sequence sparse representation-based classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)
S.A.A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, R. Togneri, Efficient image set classification using linear regression based image reconstruction, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017)
H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura et al., ICDAR 2015 competition on robust reading, in 13th International Conference on Document Analysis and Recognition (2015)
Q. Ye, D. Doermann, Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)
M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)
T. Wang, D.J. Wu, A. Coates, A.Y. Ng, End-to-end text recognition with convolutional neural networks, in 21st International Conference on Pattern Recognition (2012)
B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010)
B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proceedings of the IEEE Conference IEEE Conference on Computer Vision and Pattern Recognition (2010)
H. Chen, S.S. Tsai, G. Schroth, D.M. Chen, R. Grzeszczuk, B. Girod, Robust text detection in natural images with edge-enhanced maximally stable extremal regions, in 18th IEEE International Conference on Image Processing (2011)
L. Neumann, J. Matas, Real-time scene text localization and recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
L. Neumann, J. Matas, Scene text localization and recognition with oriented stroke detection, in Proceedings of the IEEE International Conference on Computer Vision (2013)
Q. Zhu, M.-C. Yeh, K.-T. Cheng, Multimodal fusion using learned text concepts for image categorization, in Proceedings of the 14th ACM International Conference on Multimedia (2006)
S. Karaoglu, J.C. Van Gemert, T. Gevers, Object reading: text recognition for object recognition, in European Conference on Computer Vision (ECCV) (2012)
Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in Advances in Neural Information Processing Systems (2014)
D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)
S. Mattoccia, S. Giardino, A. Gambini, Accurate and efficient cost aggregation strategy for stereo correspondence based on approximated joint bilateral filtering, in Asian Conference on Computer Vision (2010)
R.A. Hamzah, H. Ibrahim, Literature survey on stereo vision disparity map algorithms. J. Sens. (2015)
Y. Li, D.P. Huttenlocher, Learning for stereo vision using the structured support vector machine, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2008)
A. Spyropoulos, N. Komodakis, P. Mordohai, Learning to detect ground control points for improving the accuracy of stereo matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
A. Saxena, M. Sun, A.Y. Ng, Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)
K. Karsch, C. Liu, S.B. Kang, Depth extraction from video using non-parametric sampling, in European Conference on Computer Vision (2012)
L. Ladicky, J. Shi, M. Pollefeys, Pulling things out of perspective, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
J. Zbontar, Y. LeCun, Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(2), 1–32 (2016)
D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in Proceedings of the IEEE International Conference on Computer Vision (2015)
S.N. Parizi, J.G. Oberlin, P.F. Felzenszwalb, Reconfigurable models for scene recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
D. Lin, C. Lu, R. Liao, J. Jia, Learning important spatial pooling regions for scene classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in Advances in Neural Information Processing Systems (2014)
M. Hayat, S.H. Khan, M. Bennamoun, S. An, A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans. Image Process. 25(10), 4829–4841 (2016)
S.H. Khan, M. Hayat, M. Bennamoun, R. Togneri, F.A. Sohel, A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans. Image Process. 25(7), 3372–3383 (2016)
L. Herranz, S. Jiang, X. Li, Scene recognition with CNNs: objects, scales and dataset bias, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E.I.-C.N. Erdem, F. Keller, A. Muscat, B. Plank, Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding the long-short term memory model for image caption generation, in Proceedings of the IEEE International Conference on Computer Vision (2015)
X. Chen, C. Lawrence Zitnick, Mind’s eye: a recurrent visual representation for image caption generation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: describing novel object categories without paired training data, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
K. Kafle, C. Kanan, Visual question answering: datasets, algorithms, and future challenges. arXiv:1610.01465 (2016)
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh, VQA: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu, Are you talking to a machine? Dataset and methods for multilingual image question, in Advances in Neural Information Processing Systems (2015)
D. Harris, S. Harris, Digital Design and Computer Architecture (Morgan Kaufmann, 2010), p. 129
Q. Wu, C. Shen, P. Wang, A. Dick, A. van den Hengel, Image captioning and visual question answering based on attributes and external knowledge, in IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from RGBD images, in European Conference on Computer Vision (ECCV) (2012)
C. Ye, Y. Yang, C. Fermuller, Y. Aloimonos, What can I do around here? Deep functional scene understanding for cognitive robots. arXiv:1602.00032 (2016)
G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models: combining models for holistic scene understanding, in Advances in Neural Information Processing Systems (2009)
C. Li, A. Kowdle, A. Saxena, T. Chen, Towards holistic scene understanding: feedback enabled cascaded classification models, in Advances in Neural Information Processing Systems (2010)
J. Yao, S. Fidler, R. Urtasun, Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
S. H. Khan, B Mohammed, F. Sohel, R. Togneri, Automatic shadow detection and removal from a single image. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 431–446 (2016)
U. Asif, M. Bennamoun, F. Sohel, Simultaneous dense scene reconstruction and object labeling, in IEEE International Conference on Robotics and Automation (ICRA) (2016)
U. Asif, M. Bennamoun, F.A. Sohel, RGB-D object recognition and grasp detection using hierarchical cascaded forests. IEEE Trans. Robot. (2017)
R. Jayadevan, S.R. Kolhe, P.M. Patil, U. Pal, Automatic processing of handwritten bank cheque images: a survey. Int. J. Doc. Anal. Recognit. (IJDAR) 15(4), 267–296 (2012)
G. Dreyfus, Neural Networks: Methodology and Applications (Springer Science & Business Media, 2005)
Acknowledgements
This work is partially supported by SIRF Scholarship from the University of Western Australia (UWA) and Australian Research Council (ARC) Grant DP150100294.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Nadeem, U., Shah, S.A.A., Sohel, F., Togneri, R., Bennamoun, M. (2019). Deep Learning for Scene Understanding. In: Balas, V., Roy, S., Sharma, D., Samui, P. (eds) Handbook of Deep Learning Applications. Smart Innovation, Systems and Technologies, vol 136. Springer, Cham. https://doi.org/10.1007/978-3-030-11479-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-11479-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11478-7
Online ISBN: 978-3-030-11479-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)