Multimedia Tools and Applications

, Volume 78, Issue 10, pp 13767–13786 | Cite as

A unifying representation for pixel-precise distance estimation

  • Simone Bianco
  • Marco BuzzelliEmail author
  • Raimondo Schettini


We propose a new representation of distance information that is independent from any specific acquisition device, based on the size of portrayed subjects. In this alternative description, each pixel of an image is associated with the size, in real life, of what it represents. Using our proposed representation, datasets acquired with different devices can be effortlessly combined to build more powerful models, and monocular distance estimation can be performed on images acquired from devices that were never used during training. To assess the advantages of the proposed representation, we used it to train a fully convolutional neural network that predicts with pixel-precision the size of different subjects depicted in the image, as a proxy for their distance. Experimental results show that our representation, allowing the combination of heterogeneous training datasets, makes it possible for the trained network to gain better results at test time.


Distance estimation Depth estimation Perspective geometry Convolutional neural network 



We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.


  1. 1.
    Battiato S, Farinella GM, Gallo G, Giudice O (2018) On-board monitoring system for road traffic safety analysis. Comput Ind 98:208–217CrossRefGoogle Scholar
  2. 2.
    Bianco S, Buzzelli M, Mazzini D, Schettini R (2017) Deep learning for logo recognition. Neurocomputing 245:23–30CrossRefGoogle Scholar
  3. 3.
    Bianco S, Buzzelli M, Schettini R (2018) Multiscale fully convolutional network for image saliency. J Electron Imaging 27:27 – 27 – 10Google Scholar
  4. 4.
    Burgos-Artizzu XP, Ronchi MR, Perona P (2014) Distance estimation of an unknown person from a portrait. In: European conference on computer vision. Springer, pp 313–327Google Scholar
  5. 5.
    Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223Google Scholar
  6. 6.
    Dong X, Zhang F, Shi P (2014) A novel approach for face to camera distance estimation by monocular vision. Int J Innov Comput Inf Control 10(2):659–669Google Scholar
  7. 7.
    Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp 2366–2374Google Scholar
  8. 8.
    Elgammal A, Duraiswami R, Harwood D, Davis LS (2002) Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc IEEE 90(7):1151–1163CrossRefGoogle Scholar
  9. 9.
    Ens J, Lawrence P (1993) An investigation of methods for determining depth from focus. IEEE Trans Pattern Anal Mach Intell 15(2):97–108CrossRefGoogle Scholar
  10. 10.
    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2011) The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.
  11. 11.
    Flores A, Christiansen E, Kriegman D, Belongie S (2013) Camera distance from face images. In: International symposium on visual computing. Springer, pp 513–522Google Scholar
  12. 12.
    Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237CrossRefGoogle Scholar
  13. 13.
    Godard C, Mac Aodha O, Brostow GJ (2016) Unsupervised monocular depth estimation with left-right consistency. arXiv:1609.03677
  14. 14.
    Gossan S, Ott C (2012) Methods of measuring astronomical distancesGoogle Scholar
  15. 15.
    Harkness L (1977) Chameleons use accommodation cues to judge distance. Nature 267(5609):346–349CrossRefGoogle Scholar
  16. 16.
    Hirschmuller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. In: 2005. CVPR 2005. IEEE computer society conference onComputer vision and pattern recognition, vol 2. IEEE, pp 807–814Google Scholar
  17. 17.
    Hochberg CB, Hochberg JE (1952) Familiar size and the perception of depth. J Psychol 34(1):107–114CrossRefGoogle Scholar
  18. 18.
    Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80(1):3–15CrossRefGoogle Scholar
  19. 19.
    Hong D, Tavanapong W, Wong J, Oh J, De Groen PC (2014) 3d reconstruction of virtual colon structures from colonoscopy images. Comput Med Imaging Graph 38(1):22–33CrossRefGoogle Scholar
  20. 20.
    Howard IP, Rogers BJ (1995) Binocular vision and stereopsis. Oxford University Press, OxfordGoogle Scholar
  21. 21.
    Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 89–96Google Scholar
  22. 22.
    Li B, Shen C, Dai Y, van den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1119–1127Google Scholar
  23. 23.
    Liu F, Shen C, Lin G, Reid I (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38 (10):2024–2039CrossRefGoogle Scholar
  24. 24.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440Google Scholar
  25. 25.
    Marotta J, Perrot T, Nicolle D, Servos P, Goodale M (1995) Adapting to monocular vision: grasping with one eye. Exp Brain Res 104(1):107–114CrossRefGoogle Scholar
  26. 26.
    Mendelson AL, Papacharissi Z (2010) Look at us: collective narcissism in college student facebook photo galleries. Netw self: Identity, Commun Cult Soc Netw Sites 1974:1–37Google Scholar
  27. 27.
    Neven D, De Brabandere B, Georgoulis S, Proesmans M, Van Gool L (2017) Fast scene understanding for autonomous driving. arXiv:1708.02550
  28. 28.
    Prados E, Faugeras O (2006) Shape from shading. In: Handbook of mathematical models in computer vision, pp 375–388Google Scholar
  29. 29.
    Ranftl R, Vineet V, Chen Q, Koltun V (2016) Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4058–4066Google Scholar
  30. 30.
    Rodrigues DG, Grenader E, Nos FdS, Dall’Agnol MdS, Hansen TE, Weibel N (2013) Motiondraw: a tool for enhancing art and performance using kinect. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 1197–1202Google Scholar
  31. 31.
    Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3234–3243Google Scholar
  32. 32.
    Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light. In: 2003. Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition. IEEE, vol 1, pp i–iGoogle Scholar
  33. 33.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  34. 34.
    Spinello L, Arras KO (2011) People detection in rgb-d data. In: 2011 IEEE/RSJ international conference on Intelligent robots and systems (IROS). IEEE, pp 3838–3843Google Scholar
  35. 35.
    Subbarao M, Surya G (1994) Depth from defocus: a spatial domain approach. Int J Comput Vis 13(3):271–294CrossRefGoogle Scholar
  36. 36.
    Torralba A, Oliva A (2002) Depth estimation from image structure. IEEE Trans Pattern Anal Mach Intell 24(9):1226–1238CrossRefGoogle Scholar
  37. 37.
    Uhrig J, Cordts M, Franke U, Brox T (2016) Pixel-level encoding and depth layering for instance-level semantic labeling. In: German conference on pattern recognition. Springer International Publishing, pp 14–25Google Scholar
  38. 38.
    Wedel A, Franke U, Klappstein J, Brox T, Cremers D, et al. (2006) Realtime depth estimation and obstacle detection from monocular video. Lect Notes Comput Sci 4174:475CrossRefGoogle Scholar
  39. 39.
    Yonas A, Pettersen L, Granrud CE (1982) Infants’ sensitivity to familiar size as information for distance. Child Dev 53(5):1285–1290CrossRefGoogle Scholar
  40. 40.
    Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimed 19(2):4–10CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Dipartimento di Informatica, Sistemistica e ComunicazioneUniversità degli Studi di Milano-BicoccaMilanItaly

Personalised recommendations