What Vision Can, Can’t and Should Do

  • Michael ZillichEmail author
Part of the Cognitive Systems Monographs book series (COSMOS, volume 22)


Computer vision has come a long way since its beginnings. In this chapter, we review some of the recent successes, which seem to indicate that many aspects of vision have indeed been solved and that the way should now be paved for robotic systems that can operate freely in the real world. On closer inspection though that is not the case just yet. A set of specialised solutions in different sub areas, however impressive individually, does not constitute a unified theory of vision. We point out some of the problems of current approaches, most notably lack of abstraction and dealing with uncertainty. Finally, we suggest what research should and should not focus on in order to advance on a broader basis.


Computer Vision Rapid Serial Visual Presentation Depth Sensor Structure From Motion Scene Reconstruction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research leading to these results has received funding from the European Community’s Seventh Framework Programme FP7/2007-2013 under grant agreement No. 215181, CogX No. 600623, STRANDS the Austrian Science Foundation (FWF) under grant agreement No. I513-N23. vision@home No. TRP 139-N23, InSitu.


  1. Agarwal S, Snavely N, Simon I, Seitz SM, Szeliski R (2009) Building Rome in a day. In: Proceedings of the international conference on computer vision, pp 72–79Google Scholar
  2. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) SURF: speeded up robust features. Comput Vis Image Underst 110(3):346–359CrossRefGoogle Scholar
  3. Biederman I (1987) Recognition-by-components: a theory of human image understanding. Psychol Rev 94(2):115–147CrossRefGoogle Scholar
  4. Biegelbauer G, Vincze M, Wohlkinger W (2010) Model-based 3D object detection: efficient approach using superquadrics. Mach Vis Appl 21:497–516CrossRefGoogle Scholar
  5. Binford TO (1971) Visual perception by computer. In: Proceedings of the IEEE conference on systems and controlGoogle Scholar
  6. Björkman M, Kragic D (2010) Active 3D scene segmentation and detection of unknown objects. In: 2010 IEEE international conference on robotics and automation, pp 3114–3120Google Scholar
  7. Brooks R (1983) Model-based 3-D interpretations of 2-D images. IEEE Trans Pattern Anal Mach Intell 5(2):140–150CrossRefGoogle Scholar
  8. Chestnutt J, Kagami S, Nishiwaki K, Kuffner J, Kanade T (2007) GPU-accelerated real-time 3D tracking for humanoid locomotion. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systemsGoogle Scholar
  9. Choi C, Christensen HI (2010) Real-time 3D model-based tracking using edge and keypoint features for robotic manipulation. In: IEEE international conference on robotics and automation, pp 4048–4055Google Scholar
  10. Clowes MB (1971) On seeing things. Artif Intell 2(1):79–116CrossRefGoogle Scholar
  11. Collet A, Berenson D, Srinivasa SS, Ferguson D (2009) Object recognition and full pose registration from a single image for robotic manipulation. In: Proceedings of the IEEE international conference on robotics and automation, pp 3534–3541Google Scholar
  12. Cummins M, Newman P (2010) Appearance-only SLAM at large scale with FAB-MAP 2.0. Int J Robot Res 30(9):1100–1123Google Scholar
  13. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, vol 2, pp 886–893Google Scholar
  14. Davison AJ, Reid ID, Molton ND, Stasse O (2007) Monoslam: real-time single camera SLAM. IEEE Trans Pattern Anal Mach Intell 29(6):1052–1067CrossRefGoogle Scholar
  15. Dickinson S (2009) The evolution of object categorization and the challenge of image abstraction. In: Dickinson S, Leonardis A, Schiele B, Tarr M (eds) Object categorization: computer and human vision perspectives. Cambridge University Press, Cambridge, pp 1–37Google Scholar
  16. Dickinson S, Pentland A, Rosenfeld A (1992) 3-D shape recovery using distributed aspect matching. In: IEEE Trans Pattern Anal Mach Intell 14(2):174–198Google Scholar
  17. Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611CrossRefGoogle Scholar
  18. Ferrari V, Tuytelaars T, Van Gool LJ (2006) Simultaneous object recognition and segmentation from single or multiple model views. Int J Comput Vis 67(2):159–188CrossRefGoogle Scholar
  19. Fitzgibbon AW, Zisserman A (2000) Multibody structure and motion: 3-D reconstruction of independently moving objects. In: Proceedings of the European conference on computer vision, Springer, pp 891–906Google Scholar
  20. Fitzpatrick P, Metta G (2003) Grounding vision through experimental manipulation. Philos Trans Math Phys Eng Sci 361(1811):2165–2185CrossRefMathSciNetGoogle Scholar
  21. Gordon I, Lowe DG (2006) What and where: 3D object recognition with accurate pose. In: Ponce J, Hebert M, Schmid C, Zisserman A (eds) Toward category-level object recognition, Springer, Heidelberg, pp 67–82 (chap What and w)Google Scholar
  22. Hager GD, Wegbreit B (2011) Scene parsing using a prior world model. Int J Robot Res (12):1477–1507Google Scholar
  23. Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: IEEE international conference on computer visionGoogle Scholar
  24. Hoiem D, Efros A, Hebert M (2006) Putting objects in perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2137–2144Google Scholar
  25. Huffman D (1971) Impossible objects as nonsense sentences. Machine intelligence 6. Edinburgh University Press, EdinburghGoogle Scholar
  26. Intraub H (1981) Rapid conceptual identification of sequentially presented pictures. J Exp Psychol Hum Percept Perform 7:604–610CrossRefGoogle Scholar
  27. Klein G, Murray D (2006) Full-3D edge tacking with a particle filter. Proc Br Mac Vision Conf 3:1119–1128Google Scholar
  28. Klein G, Murray D (2007) Parallel tracking and mapping for small AR workspaces. In: Proceedings of sixth IEEE and ACM international symposium on mixed and augmented reality (ISMAR), Nara, Japan, pp 225–234Google Scholar
  29. Kraft D, Pugeault N, Baseski E, Popovic M, Kragic D, Kalkan S, Wörgötter F, Krüger N (2008) Birth of the object: detection of objectness and extraction of object shape through object action complexes. Int J Humanoid Rob 5(2):247–265CrossRefGoogle Scholar
  30. Leibe B, Schiele B (2003) Interleaved object categorization and segmentation. In: Proceedings of the British machine vision conferenceGoogle Scholar
  31. Lepetit V, Fua P (2005) Monocular model-based 3D tracking of rigid objects: a survey. Found Trends Comput Graphics vision 1(1):1–89CrossRefGoogle Scholar
  32. Lourakis MIA, Argyros AA (2009) SBA: a software package for generic sparse bundle adjustment. ACM Trans Math Software 36(1):1–30CrossRefMathSciNetGoogle Scholar
  33. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the international conference on computer vision, pp 1150–1157Google Scholar
  34. Lowe DG (1987) Three-dimensional object recognition from single two-dimensional images. Artif Intell 31(3):355–395CrossRefGoogle Scholar
  35. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110CrossRefGoogle Scholar
  36. Marr D, Nishihara H (1978) Representation and recognition of the spatial organization of three-dimensional shapes. Proc R Soc Lond B 200(1140):269–294CrossRefGoogle Scholar
  37. Marr D (1982) Vision: a computational investigation into the human representation and processing of visual information. W. H. Freeman, San FranciscoGoogle Scholar
  38. Matas J, Chum O, Martin U, Pajdla T (2002) Robust wide baseline stereo from maximally stable extremal regions. Proc Br Mach Vision Conf 1:384–393Google Scholar
  39. Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. Int J Comput Vision 60(1):63–86CrossRefGoogle Scholar
  40. Mörwald T, Kopicki M, Stolkin R, Wyatt J, Zurek S, Zillich M, Vincze M (2011) Predicting the unobservable: visual 3D tracking with a probabilistic motion model. In: Proceeedings of the IEEE international conference on robotics and automation, pp 1849–1855Google Scholar
  41. Mörwald T, Prankl J, Richtsfeld A, Zillich M, Vincze M (2010) BLORT—the blocks world robotic vision toolbox. In: Best practice in 3D perception and modeling for mobile manipulation (in conjunction with ICRA 2010)Google Scholar
  42. Murphy-Chutorian E, Trivedi MM (2008) Particle filtering with rendered models: a two pass approach to multi-object 3D tracking with the GPU. In: CVPR workshop on computer vision on GPU’s (CVGPU), pp 1–8Google Scholar
  43. Nevatia R, Binford TO (1977) Description and recognition of curved objects. Artif Intell 8:77–98CrossRefzbMATHGoogle Scholar
  44. Newcombe RA, Davison AJ (2010) Live dense reconstruction with a single moving camera. In: IEEE conference on computer vision and pattern recognition, pp 1498–1505Google Scholar
  45. Nistér D, Naroditsky O, Bergen J (2006) Visual odometry for ground vehicle applications. J Field Rob 23(1):3–20Google Scholar
  46. Ozden KE, Schindler K, Gool LV (2010) Multibody structure-from-motion in practice. IEEE Trans Pattern Anal Mach Intell 32:1134–1141CrossRefGoogle Scholar
  47. Özuysal M, Lepetit V, Fleuret F, Fua P (2006) Feature harvesting for tracking-by-detection. Proc Eur Conf Comput Vision 3953:592–605Google Scholar
  48. Özuysal M, Fua P, Lepetit V (2007) Fast keypoint recognition in ten lines of code. In: IEEE Conference on computer vision and pattern recognition, pp 1–8Google Scholar
  49. Pan Q, Reitmayr G, Drummond T (2009) ProFORMA: probabilistic feature-based on-line rapid model acquisition. In: Proceedinge of the British machine vision conference, pp 1–11Google Scholar
  50. Pilet J, Lepetit V, Fua P (2007) Fast non-rigid surface detection, registration and realistic augmentation. Int J Comput Vision 76(2):109–122Google Scholar
  51. Richtsfeld A, Mörwald T, Prankl J, Zillich M, Vincze M (2012) Segmentation of unknown objects in indoor environments. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systemsGoogle Scholar
  52. Roberts LG (1965) Machine perception of three-dimensional solids. In: Tippett JT (ed) Optical and electro-optical information processing. MIT Press, Cambridge, pp 159–197Google Scholar
  53. Rosten E, Drummond T (2006) Machine learning for high-speed corner detection. In: Prococeedings of the 9th European conference on computer vision, pp 430–434Google Scholar
  54. Rusu RB, Blodow N, Marton ZC, Beetz M (2009) Close-range scene segmentation and reconstruction of 3D point cloud maps for mobile manipulation in human environments. In: Proceedsings of the IEEE/RSJ international conference on intelligent robots and systems, pp 1–6Google Scholar
  55. Sánchez JR, Álvarez H, Borro D (2010) Towards real time 3D tracking and reconstruction on a GPU using Monte Carlo simulations. In: 9th IEEE international symposium on mixed and augmented reality (ISMAR), pp 185–192Google Scholar
  56. Sloman A (1978) The computer revolution in philosophy: philosophy, science and models of mind. Harvester Press (and Humanities Press), HassocksGoogle Scholar
  57. Sloman A (1989) On designing a visual system: towards a gibsonian computational model of vision. J Exp Theoret AI 1:289–337CrossRefGoogle Scholar
  58. Snavely N, Seitz SM, Szeliski R (2006) Photo tourism: exploring photo collections in 3D. In: SIGGRAPH Conference Proceedings, pp 835–846Google Scholar
  59. Thorpe SJ, Imbert M (1989) Biological constraints on connectionist modelling. In: Connectionism in Perspective. Elsevier, Amsterdam, pp 63–92Google Scholar
  60. Thrun S, Burgard W, Fox D (2005) Probabilistic robotics. MIT Press, CambridgeGoogle Scholar
  61. Tola E, Lepetit V, Fua P (2008) A fast local descriptor for dense matching. In: IEEE conference on computer vision and pattern recognition, pp 1–8Google Scholar
  62. Ückermann A, Haschke R, Ritter H (2012) Real-time 3D segmentation of cluttered scenes for robot grasping. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systemsGoogle Scholar
  63. Waltz D (1975) Understanding line drawings of scenes with shadows. In: Winston PH (ed) The psychology of computer vision. McGraw-Hill, New York, pp 19–91Google Scholar
  64. Welke K, Issac J, Schiebener D, Asfour T, Dillmann R (2010) Autonomous acquisition of visual multi-view object representations for object recognition on a humanoid robot. In: Proceedings of the IEEE international conference on robotics and automation, pp 2012–2019Google Scholar
  65. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-L1 optical flow. Pattern Recogn 4713:214–223CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Vienna University of TechnologyViennaAustria

Personalised recommendations