Skip to main content

Deep Learning for Scene Understanding

  • Chapter
  • First Online:
Handbook of Deep Learning Applications

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 136))

Abstract

With the progress in the field of computer vision, we are moving closer and closer towards the ultimate aim of human like vision for machines. Scene understanding is an essential part of this research. It seeks the goal that any image should be as understandable and decipherable for computers as it is for humans. The stall in the progress of the different components of scene understanding, due to the limitations of the traditional algorithms, has now been broken by the induction of neural networks for computer vision tasks. The advancements in parallel computational hardware has made it possible to train very deep and complex neural network architectures. This has vastly improved the performances of algorithms for all the different components of scene understanding. This chapter analyses these contributions of deep learning and also presents the advancements of high level scene understanding tasks, such as caption generation for images. It also sheds light on the need to combine these individual components into an integrated system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A.E. Johnson, M. Hebert, Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)

    Article  Google Scholar 

  2. D.G. Lowe, Object recognition from local scale-invariant features, in The proceedings of the IEEE International Conference on Computer Vision (1999)

    Google Scholar 

  3. L. Wang, D.-C. He, Texture classification using texture spectrum. Pattern Recognit. 23(8), 905–910 (1990)

    Article  Google Scholar 

  4. F. Tombari, S. Salti, L. Di Stefano, Unique signatures of histograms for local surface description, in European Conference on Computer Vision (Berlin, Heidelberg, 2010)

    Chapter  Google Scholar 

  5. S.A.A. Shah, M. Bennamoun, F. Boussaid, Performance evaluation of 3D local surface descriptors for low and high resolution range image registration, in International Conference on Digital lmage Computing: Techniques and Applications (2014)

    Google Scholar 

  6. Y. Guo, F.A. Sohel, M. Bennamoun, J. Wan, M. Lu, RoPS: a local feature descriptor for 3D rigid objects based on rotational projection statistics, in International Conference on Communications, Signal Processing, and Their Applications (2013)

    Google Scholar 

  7. L. Bo, X. Ren, D. Fox, Depth kernel descriptors for object recognition, in IEEE/RSJ International Conference on Intelligent Robots and Systems (2011)

    Google Scholar 

  8. Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, 3D object recognition in cluttered scenes with local surface features: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2270–2287 (2014)

    Article  Google Scholar 

  9. L. Deng, A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)

    Article  Google Scholar 

  10. M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification. IEEE Trans. Pattern Anal. Mach. Intell. 37(4), 713–727 (2015)

    Article  Google Scholar 

  11. S.A.A. Shah, M. Bennamoun, F. Boussaid, Iterative deep learning for image set based face and object recognition. Neurocomputing 174, 866–874 (2016)

    Article  Google Scholar 

  12. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012)

    Google Scholar 

  13. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  14. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  15. C.R. Qi, H. Su, K. Mo, L.J. Guibas, PointNet: deep learning on point sets for 3D classification and segmentation. arXiv:1612.00593 (2016)

  16. P. Viola, M.J. Jones, Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)

    Article  Google Scholar 

  17. F.R.M. Al-Osaimi, M. Bennamoun, 3D face surface analysis and recognition based on facial surface features, in 3D Face Modeling, Analysis and Recognition (Wiley, 2013), pp. 39–76

    Google Scholar 

  18. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: closing the gap to human-level performance in face verification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  19. E.G. Ortiz, A. Wright, M. Shah, Face recognition in movie trailers via mean sequence sparse representation-based classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)

    Google Scholar 

  20. S.A.A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, R. Togneri, Efficient image set classification using linear regression based image reconstruction, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017)

    Google Scholar 

  21. H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  22. J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  23. F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  24. D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura et al., ICDAR 2015 competition on robust reading, in 13th International Conference on Document Analysis and Recognition (2015)

    Google Scholar 

  25. Q. Ye, D. Doermann, Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)

    Article  Google Scholar 

  26. M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)

    Article  MathSciNet  Google Scholar 

  27. T. Wang, D.J. Wu, A. Coates, A.Y. Ng, End-to-end text recognition with convolutional neural networks, in 21st International Conference on Pattern Recognition (2012)

    Google Scholar 

  28. B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010)

    Google Scholar 

  29. B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proceedings of the IEEE Conference IEEE Conference on Computer Vision and Pattern Recognition (2010)

    Google Scholar 

  30. H. Chen, S.S. Tsai, G. Schroth, D.M. Chen, R. Grzeszczuk, B. Girod, Robust text detection in natural images with edge-enhanced maximally stable extremal regions, in 18th IEEE International Conference on Image Processing (2011)

    Google Scholar 

  31. L. Neumann, J. Matas, Real-time scene text localization and recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)

    Google Scholar 

  32. L. Neumann, J. Matas, Scene text localization and recognition with oriented stroke detection, in Proceedings of the IEEE International Conference on Computer Vision (2013)

    Google Scholar 

  33. Q. Zhu, M.-C. Yeh, K.-T. Cheng, Multimodal fusion using learned text concepts for image categorization, in Proceedings of the 14th ACM International Conference on Multimedia (2006)

    Google Scholar 

  34. S. Karaoglu, J.C. Van Gemert, T. Gevers, Object reading: text recognition for object recognition, in European Conference on Computer Vision (ECCV) (2012)

    Chapter  Google Scholar 

  35. Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  36. D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  37. D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)

    Article  Google Scholar 

  38. S. Mattoccia, S. Giardino, A. Gambini, Accurate and efficient cost aggregation strategy for stereo correspondence based on approximated joint bilateral filtering, in Asian Conference on Computer Vision (2010)

    Google Scholar 

  39. R.A. Hamzah, H. Ibrahim, Literature survey on stereo vision disparity map algorithms. J. Sens. (2015)

    Google Scholar 

  40. Y. Li, D.P. Huttenlocher, Learning for stereo vision using the structured support vector machine, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2008)

    Google Scholar 

  41. A. Spyropoulos, N. Komodakis, P. Mordohai, Learning to detect ground control points for improving the accuracy of stereo matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  42. A. Saxena, M. Sun, A.Y. Ng, Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)

    Article  Google Scholar 

  43. K. Karsch, C. Liu, S.B. Kang, Depth extraction from video using non-parametric sampling, in European Conference on Computer Vision (2012)

    Google Scholar 

  44. L. Ladicky, J. Shi, M. Pollefeys, Pulling things out of perspective, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  45. J. Zbontar, Y. LeCun, Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(2), 1–32 (2016)

    MATH  Google Scholar 

  46. D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  47. S.N. Parizi, J.G. Oberlin, P.F. Felzenszwalb, Reconfigurable models for scene recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)

    Google Scholar 

  48. D. Lin, C. Lu, R. Liao, J. Jia, Learning important spatial pooling regions for scene classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  49. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  50. M. Hayat, S.H. Khan, M. Bennamoun, S. An, A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans. Image Process. 25(10), 4829–4841 (2016)

    Article  MathSciNet  Google Scholar 

  51. S.H. Khan, M. Hayat, M. Bennamoun, R. Togneri, F.A. Sohel, A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans. Image Process. 25(7), 3372–3383 (2016)

    Article  MathSciNet  Google Scholar 

  52. L. Herranz, S. Jiang, X. Li, Scene recognition with CNNs: objects, scales and dataset bias, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  53. R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E.I.-C.N. Erdem, F. Keller, A. Muscat, B. Plank, Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)

    Article  Google Scholar 

  54. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  55. X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding the long-short term memory model for image caption generation, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  56. X. Chen, C. Lawrence Zitnick, Mind’s eye: a recurrent visual representation for image caption generation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  57. L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: describing novel object categories without paired training data, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  58. K. Kafle, C. Kanan, Visual question answering: datasets, algorithms, and future challenges. arXiv:1610.01465 (2016)

  59. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh, VQA: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  60. H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu, Are you talking to a machine? Dataset and methods for multilingual image question, in Advances in Neural Information Processing Systems (2015)

    Google Scholar 

  61. D. Harris, S. Harris, Digital Design and Computer Architecture (Morgan Kaufmann, 2010), p. 129

    Google Scholar 

  62. Q. Wu, C. Shen, P. Wang, A. Dick, A. van den Hengel, Image captioning and visual question answering based on attributes and external knowledge, in IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

    Google Scholar 

  63. N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from RGBD images, in European Conference on Computer Vision (ECCV) (2012)

    Chapter  Google Scholar 

  64. C. Ye, Y. Yang, C. Fermuller, Y. Aloimonos, What can I do around here? Deep functional scene understanding for cognitive robots. arXiv:1602.00032 (2016)

  65. G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models: combining models for holistic scene understanding, in Advances in Neural Information Processing Systems (2009)

    Google Scholar 

  66. C. Li, A. Kowdle, A. Saxena, T. Chen, Towards holistic scene understanding: feedback enabled cascaded classification models, in Advances in Neural Information Processing Systems (2010)

    Google Scholar 

  67. J. Yao, S. Fidler, R. Urtasun, Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)

    Google Scholar 

  68. S. H. Khan, B Mohammed, F. Sohel, R. Togneri, Automatic shadow detection and removal from a single image. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 431–446 (2016)

    Article  Google Scholar 

  69. U. Asif, M. Bennamoun, F. Sohel, Simultaneous dense scene reconstruction and object labeling, in IEEE International Conference on Robotics and Automation (ICRA) (2016)

    Google Scholar 

  70. U. Asif, M. Bennamoun, F.A. Sohel, RGB-D object recognition and grasp detection using hierarchical cascaded forests. IEEE Trans. Robot. (2017)

    Google Scholar 

  71. R. Jayadevan, S.R. Kolhe, P.M. Patil, U. Pal, Automatic processing of handwritten bank cheque images: a survey. Int. J. Doc. Anal. Recognit. (IJDAR) 15(4), 267–296 (2012)

    Article  Google Scholar 

  72. G. Dreyfus, Neural Networks: Methodology and Applications (Springer Science & Business Media, 2005)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by SIRF Scholarship from the University of Western Australia (UWA) and Australian Research Council (ARC) Grant DP150100294.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Bennamoun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Nadeem, U., Shah, S.A.A., Sohel, F., Togneri, R., Bennamoun, M. (2019). Deep Learning for Scene Understanding. In: Balas, V., Roy, S., Sharma, D., Samui, P. (eds) Handbook of Deep Learning Applications. Smart Innovation, Systems and Technologies, vol 136. Springer, Cham. https://doi.org/10.1007/978-3-030-11479-4_2

Download citation

Publish with us

Policies and ethics