Skip to main content

Computer Vision for Image Understanding: A Comprehensive Review

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1066))

Abstract

Computer Vision has its own Turing test: Can a machine describe the contents of an image or a video in the way a human being would do? In this paper, the progress of Deep Learning for image recognition is analyzed in order to know the answer to this question. In recent years, Deep Learning has increased considerably the precision rate of many tasks related to computer vision. Many datasets of labeled images are now available online, which leads to pre-trained models for many computer vision applications. In this work, we gather information of the latest techniques to perform image understanding and description. As a conclusion we obtained that the combination of Natural Language Processing (using Recurrent Neural Networks and Long Short-Term Memory) plus Image Understanding (using Convolutional Neural Networks) could bring new types of powerful and useful applications in which the computer will be able to answer questions about the content of images and videos. In order to build datasets of labeled images, we need a lot of work and most of the datasets are built using crowd work. These new applications have the potential to increase the human machine interaction to new levels of usability and user’s satisfaction.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Techopedia: https://www.techopedia.com/definition/32309/computer-vision. 03 May 2019

  2. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  3. Eickhoff, C., de Vries, A.: How crowdsourcable is your task. In: Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 11–14 (2011)

    Google Scholar 

  4. Draper, R., Hunt, D.: Smart Robots, A Handbook of Intelligent Robotic System. Springer, Heidelberg (1985)

    Google Scholar 

  5. Li-Jia, L., Fei-Fei, L.: What, where and who? Classifying events by scene and object recognition. In: Proceedings/IEEE International Conference on Computer Vision (2007)

    Google Scholar 

  6. SUN dataset: https://groups.csail.mit.edu/vision/SUN/hierarchy.html. 26 Mar 2019

  7. Fei-Fei, L., Iyer, A., Koch, C., Perona, P.: What do we see in a glance of a scene? J. Vis. 7(1), 10, 1–29 (2007). http://journalofvision.org/7/1/10/. https://doi.org/10.1167/7.1.10

    Article  Google Scholar 

  8. Coursera, Université nationale de recherche, École des hautes études en sciences économiques. https://www.coursera.org/learn/deep-learning-in-computer-vision/home/welcome. 12 Mar 2019

  9. Fei-Fei, L., Fergus, R., Torralba, A.: Recognizing and learning object categories. In: Short Course CVPR, International Conference on Computer Vision (2007)

    Google Scholar 

  10. Recognizing and Learning Object Categories course. http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html. 25 Mar 2019

  11. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 806–813 (2014)

    Google Scholar 

  12. The Paris Dataset. http://www.robots.ox.ac.uk/~vgg/data/parisbuildings/. 24 Mar 2019

  13. VisLab – Computer and Robot Vision Laboratory. http://vislab.isr.ist.utl.pt/datasets/#hda. 25 Mar 2019

  14. ADE20 K dataset. http://groups.csail.mit.edu/vision/datasets/ADE20K/. 26 Mar 2019

  15. SUN360 panorama database. http://people.csail.mit.edu/jxiao/SUN360/index_high.html. 24 Mar 2019

  16. The Places Audio Caption Corpus. https://groups.csail.mit.edu/sls/downloads/placesaudio/index.cgi. 25 Mar 2019

  17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  18. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  19. Candamo, J., Shreve, M., Goldgof, D.B., Sapper, D.B., Kasturi, R.: Understanding transit scenes: a survey on human behavior-recognition algorithms. IEEE Trans. Intell. Transp. Syst. 11, 206–224 (2010)

    Article  Google Scholar 

  20. Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: A survey. Comput. Vis. Image Underst. 11, 281–307 (2008)

    Article  Google Scholar 

  21. Trucco, E., Plakas, K.: Video tracking: a concise survey. IEEE J. Oceanic Eng. 31, 520–529 (2006)

    Article  Google Scholar 

  22. Imagenet large scale visual recognition challenge 2013 (ilsvrc2013): http://www.imagenet.org/challenges/LSVRC/2013/. 13 Mar 2019

  23. IBM Watson demonstration website. https://www.ibm.com/watson/services/visual-recognition/demo/#demo. 10 May 2019

  24. Microsoft Caption Bot. https://www.captionbot.ai/. 10 May 2019

  25. Kaggle’s “Dog breed identification” kernel. https://www.kaggle.com/kerneler/starter-dog-breed-identification-0c8eb184-8. 10 May 2019

  26. Fei-Fei, L., Perona, P.: A Bayesian hierarchy model for learning natural scene categories. In: CVPR (2005)

    Google Scholar 

  27. Torralba, A., Fergus, R., Freeman, W.: 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)

    Article  Google Scholar 

  28. Tiny Images dataset. http://groups.csail.mit.edu/vision/TinyImages/. 01 Mar 2019

  29. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Advances in Neural Information Processing Systems (NIPS) 27 (2014)

    Google Scholar 

  30. Cross-Modal Places database. http://projects.csail.mit.edu/cmplaces/. 23 Feb 2019

  31. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering (2015)

    Google Scholar 

  32. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Conference Paper at NIPS (2015)

    Google Scholar 

  33. Ren, S., He, K., Girshick, R., Sun, J.: FasterR-CNN: towards real-time object detection with region proposal networks (2016)

    Google Scholar 

  34. Liang, M., Hu, X., Zhang, B.: Convolutional neural networks with intra-layer recurrent connections for scene labeling. In: Proceeding NIPS’15 Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 1, pp. 937–945 (2015)

    Google Scholar 

  35. Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. NIPS DeepLearning Workshop 201 (2014)

    Google Scholar 

  36. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. TACL (2015)

    Google Scholar 

  37. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  38. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)

    Google Scholar 

  39. Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)

  40. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  41. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)

    Google Scholar 

  42. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)

  43. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: ACL pp. 479–488 (2014)

    Google Scholar 

  44. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312 (2014)

  45. Lebret, R., Pinheiro, P.O., Collobert, R.: Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 (2014)

  46. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)

    Google Scholar 

  47. Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)

    Google Scholar 

  48. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)

    Google Scholar 

  49. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

    Google Scholar 

  50. Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)

  51. Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.-C.: Joint video and text parsing for understanding events and answering queries. MultiMedia IEEE 21(2), 42–70 (2014)

    Article  Google Scholar 

  52. Benchmark of Deep Learning Representations for Visual Recognition. http://www.csc.kth.se/cvap/cvg/DL/ots/. 23 Feb 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marlon-Santiago Viñán-Ludeña .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jácome-Galarza, LR., Realpe-Robalino, MA., Chamba-Eras, LA., Viñán-Ludeña, MS., Sinche-Freire, JF. (2020). Computer Vision for Image Understanding: A Comprehensive Review. In: Botto-Tobar, M., León-Acurio, J., Díaz Cadena, A., Montiel Díaz, P. (eds) Advances in Emerging Trends and Technologies. ICAETT 2019. Advances in Intelligent Systems and Computing, vol 1066. Springer, Cham. https://doi.org/10.1007/978-3-030-32022-5_24

Download citation

Publish with us

Policies and ethics