Advertisement

3D Visual Object Detection from Monocular Images

  • Qiaosong WangEmail author
  • Christopher Rasmussen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11844)

Abstract

3D visual object detection is a fundamental requirement for autonomous vehicles. However, accurately detecting 3D objects was until recently a quality unique to expensive LiDAR ranging devices. Approaches based on cheaper monocular imagery are typically incapable of identifying 3D objects. In this paper, we propose a novel approach to predict accurate 3D bounding box locations on monocular images. We first train a generative adversarial network (GAN) to perform monocular depth estimation. The ground truth training depth data is obtained via depth completion on LiDAR scans. Next, we combine both depth and appearance data into a birds-eye-view representation with height, density and grayscale intensity as the three feature channels. Finally, We train a convolutional neural network (CNN) on our feature map leveraging bounding boxes annotated on corresponding LiDAR scans. Experiments show that our method performs favorably against baselines.

Keywords

3D object detection Depth estimation Monocular vision 

References

  1. 1.
    Complex yolo with uncertainty. https://github.com/wl5/complex_yolo_3d
  2. 2.
    pykitti open source utility library. https://github.com/utiasSTARS/pykitti
  3. 3.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium, pp. 265–283 (2016)Google Scholar
  4. 4.
    Ali, W., Abdelkarim, S., Zidan, M., Zahran, M., Sallab, A.E.: YOLO3D: end-to-end real-time 3D oriented object bounding box detection from LiDAR point cloud. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11131, pp. 716–728. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-11015-4_54CrossRefGoogle Scholar
  5. 5.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings Annual Conference on Computational Learning Theory, pp. 92–100. ACM (1998)Google Scholar
  6. 6.
    Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR, pp. 1907–1915 (2017)Google Scholar
  7. 7.
    Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: IROS, pp. 681–687. IEEE (2015)Google Scholar
  8. 8.
    El Sallab, A., Sobh, I., Zidan, M., Zahran, M., Abdelkarim, S.: YOLO4D: a spatio-temporal approach for real-time multi-object detection and classification from LiDAR point clouds (2018)Google Scholar
  9. 9.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)CrossRefGoogle Scholar
  10. 10.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)Google Scholar
  11. 11.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, vol. 2, p. 7 (2017)Google Scholar
  12. 12.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)Google Scholar
  13. 13.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_23CrossRefGoogle Scholar
  14. 14.
    He, D., et al.: Dual learning for machine translation. In: NIPS, pp. 820–828 (2016)Google Scholar
  15. 15.
    Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Conference on Computer Graphics and Interactive Techniques. ACM (2001)Google Scholar
  16. 16.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017)Google Scholar
  17. 17.
    Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. IJRR 34(4–5), 705–724 (2015)Google Scholar
  18. 18.
    Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01270-0_39CrossRefGoogle Scholar
  19. 19.
    Mal, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: ICRA, pp. 1–8. IEEE (2018)Google Scholar
  20. 20.
    Montemerlo, M., et al.: Junior: the stanford entry in the urban challenge. J. Field Robot. 25(9), 569–597 (2008)CrossRefGoogle Scholar
  21. 21.
    Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: CVPR, pp. 918–927 (2018)Google Scholar
  22. 22.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR, pp. 7263–7271 (2017)Google Scholar
  23. 23.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  24. 24.
    Simon, M., Milz, S., Amende, K., Gross, H.-M.: Complex-YOLO: an Euler-region-proposal for real-time 3D object detection on point clouds. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 197–209. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-11009-3_11CrossRefGoogle Scholar
  25. 25.
    Socher, R., Huval, B., Bath, B., Manning, C.D., Ng, A.Y.: Convolutional-recursive deep learning for 3D object classification. In: NIPS, pp. 656–664 (2012)Google Scholar
  26. 26.
    Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, vol. 5, p. 6 (2015)Google Scholar
  27. 27.
    Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. arXiv preprint arXiv:1812.07179 (2018)
  28. 28.
    Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: CVPR, pp. 117–126 (2016)Google Scholar
  29. 29.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer and Information SciencesUniversity of DelawareNewarkUSA

Personalised recommendations