Advertisement

Image-to-Voxel Model Translation with Conditional Adversarial Networks

  • Vladimir A. KnyazEmail author
  • Vladimir V. Kniaz
  • Fabio Remondino
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11129)

Abstract

We present a single-view voxel model prediction method that uses generative adversarial networks. Our method utilizes correspondences between 2D silhouettes and slices of a camera frustum to predict a voxel model of a scene with multiple object instances. We exploit pyramid shaped voxel and a generator network with skip connections between 2D and 3D feature maps. We collected two datasets VoxelCity and VoxelHome to train our framework with 36,416 images of 28 scenes with ground-truth 3D models, depth maps, and 6D object poses. We made the datasets publicly available (http://www.zefirus.org/Z_GAN). We evaluate our framework on 3D shape datasets to show that it delivers robust 3D scene reconstruction results that compete with and surpass state-of-the-art in a scene reconstruction with multiple non-rigid objects.

Keywords

Conditional GAN Voxel model 6D pose estimation 

Notes

Acknowledgments

The reported study was funded by Russian Foundation for Basic Research (RFBR) according to the research project N\(\mathrm {^{o}}\) 17-29-04509 and the Russian Science Foundation (RSF) according to the research project N\(\mathrm {^{o}}\) 16-11-00082.

We would like to thank the volunteers, who acted as statues for 3D models for class “human”: Zoya Kniaz, Lena Metelkina, Nastya Metelkina, and Anya Metelkina. Also we would like to thank the authors of 3D CAD models for the dataset: Artyom Bordodymov and Pyotr Moshkantsev.

References

  1. 1.
    Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. (2018).  https://doi.org/10.1007/s11263-018-1070-xCrossRefGoogle Scholar
  2. 2.
    Balntas, V., Doumanoglou, A., Sahin, C., Sock, J., Kouskouridas, R., Kim, T.: Pose guided RGBD feature learning for 3D object pose estimation. In: 2017 IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October, pp. 3876–3884 (2017).  https://doi.org/10.1109/ICCV.2017.416
  3. 3.
    Balntas, V., Doumanoglou, A., Sahin, C., Sock, J., Kouskouridas, R., Kim, T.K.: Pose guided RGBD feature learning for 3D object pose estimation. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  4. 4.
    Behl, A., Hosseini Jafari, O., Karthik Mustikovela, S., Abu Alhaija, H., Rother, C., Geiger, A.: Bounding boxes, segmentations and object coordinates: how important is recognition for 3D scene flow estimation in autonomous driving scenarios? In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  5. 5.
    Brachmann, E., et al.: DSAC - differentiable RANSAC for camera localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  6. 6.
    Brachmann, E., Rother, C.: Learning less is more - 6D camera localization via 3D surface regression. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  7. 7.
    Brock, A., Lim, T., Ritchie, J., Weston, N.: Generative and discriminative voxel modeling with convolutional neural networks. pp. 1–9 December 2016. https://nips.cc/Conferences/2016. workshop contribution; Neural Inofrmation Processing Conference : 3D Deep Learning, NIPS; Conference date: 05–12-2016 Through 10–12-2016
  8. 8.
    Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. CoRR abs/1512.03012 (2015)Google Scholar
  9. 9.
    Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3D–R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016)Google Scholar
  10. 10.
    Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.: Recovering 6D object pose and predicting next-best-view in the crowd. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 3583–3592 (2016).  https://doi.org/10.1109/CVPR.2016.390
  11. 11.
    Drost, B., Ulrich, M., Bergmann, P., Hartinger, P., Steger, C.: Introducing MVTec ITODD - a dataset for 3D object recognition in industry. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, October 2017Google Scholar
  12. 12.
    El-Hakim, S.: A flexible approach to 3D reconstruction from single images. In: ACM SIGGRAPH, vol. 1, pp. 12–17 (2001)Google Scholar
  13. 13.
    Engel, J., Stueckler, J., Cremers, D.: Large-scale direct slam with stereo cameras (2015)Google Scholar
  14. 14.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2009)CrossRefGoogle Scholar
  15. 15.
    Firman, M., Mac Aodha, O., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  16. 16.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  17. 17.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_29CrossRefGoogle Scholar
  18. 18.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  19. 19.
    Heinly, J., Schonberger, J.L., Dunn, E., Frahm, J.M.: Reconstructing the world* in six days *(as captured by the Yahoo 100 million image dataset). In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  20. 20.
    Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-37331-2_42CrossRefGoogle Scholar
  21. 21.
    Hodaň, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)Google Scholar
  22. 22.
    Hodan, T., Haluza, P., Obdrzálek, S., Matas, J., Lourakis, M.I.A., Zabulis, X.: T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects. In: 2017 IEEE Winter Conference on Applications of Computer Vision WACV 2017, Santa Rosa, CA, USA, 24–31 March 2017, pp. 880–888 (2017).  https://doi.org/10.1109/WACV.2017.103
  23. 23.
    Hodaň, T., Matas, J., Obdržálek, Š.: On evaluation of 6D object pose estimation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 606–619. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_52CrossRefGoogle Scholar
  24. 24.
    Hodaň, T., Michel, F., Sahin, C., Kim, T.K., Matas, J., Rother, C.: SIXD Challenge 2017. http://cmp.felk.cvut.cz/sixd/challenge_2017/. Accessed 01 July 2018
  25. 25.
    Hoppe, C., Klopschitz, M., Donoser, M., Bischof, H.: Incremental surface extraction from sparse structure-from-motion point clouds. In: Proceedings of the British Machine Vision Conference 2013, pp. 94:1–94:11, British Machine Vision Association (2013)Google Scholar
  26. 26.
    Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. ACM Trans. Graph. 34(4), 87:1–87:10 (2015)CrossRefGoogle Scholar
  27. 27.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. IEEE (2017)Google Scholar
  28. 28.
    Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, University of Cambridge, Cambridge, United Kingdom, pp. 2938–2946. IEEE, February 2015Google Scholar
  29. 29.
    Kniaz, V.V.: Robust vision-based pose estimation algorithm for an UAV with known gravity vector. ISPRS-Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. XLI-B5, 63–68 (2016).  https://doi.org/10.5194/isprs-archives-XLI-B5-63-2016CrossRefGoogle Scholar
  30. 30.
    Knyaz, V., Zheltov, S.: Accuracy evaluation of structure from motion surface 3D reconstruction. In: Proceedings of SPIE, vol. 10332, pp. 10332-1–10332-10 (2017).  https://doi.org/10.1117/12.2272021
  31. 31.
    Knyaz, V.A., et al.: Deep learning of convolutional auto-encoder for image matching and 3d object reconstruction in the infrared range. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 2155–2164 (2017).  https://doi.org/10.1109/ICCVW.2017.252
  32. 32.
    Krull, A., Brachmann, E., Nowozin, S., Michel, F., Shotton, J., Rother, C.: PoseAgent: budget-constrained 6d object pose estimation via reinforcement learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  33. 33.
    Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: fine pose estimation. In: ICCV (2013)Google Scholar
  34. 34.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. IEEE (2015)Google Scholar
  35. 35.
    Ma, M., Marturi, N., Li, Y., Leonardis, A., Stolkin, R.: Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recogn. 76(11), 506–521 (2017)Google Scholar
  36. 36.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR, pp. 3061–3070 (2015)Google Scholar
  37. 37.
    Paszke, A., et al.: Automatic differentiation in pyTorch (2017)Google Scholar
  38. 38.
    Poiesi, F., Locher, A., Chippendale, P., Nocerino, E., Remondino, F., Van Gool, L.: Cloud-based collaborative 3D reconstruction using smartphones. In: the 14th ACM European Conference on Visual Media Production (CVMP), pp. 1–9. ACM Press, New York (2017)Google Scholar
  39. 39.
    Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointNets for 3D object detection from RGB-D data. arXiv preprint arXiv:1711.08488 (2017)
  40. 40.
    Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 3848–3856 (2017).  https://doi.org/10.1109/ICCV.2017.413
  41. 41.
    Remondino, F., Nocerino, E., Toschi, I., Menna, F.: A critical review of automated photogrammetric processing of large datasets. ISPRS - Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. 42, 591–599 (2017). XLII-2/W5.  https://doi.org/10.5194/isprs-archives-XLII-2-W5-591-2017CrossRefGoogle Scholar
  42. 42.
    Remondino, F., Roditakis, A.: Human figure reconstruction and modeling from single image or monocular video sequence. In: 2003 Fourth International Conference on 3-D Digital Imaging and Modeling, 3DIM 2003, pp. 116–123. IEEE October 2003Google Scholar
  43. 43.
    Remondino, F., El-Hakim, S.: Image-based 3D modelling: a review. Photogram. Rec. 21(115), 269–291 (2006)CrossRefGoogle Scholar
  44. 44.
    Richter, S.R., Roth, S.: Matryoshka networks: predicting 3D geometry via nested shape layers. arXiv.org, April 2018
  45. 45.
    Rock, J., Gupta, T., Thorsen, J., Gwak, J., Shin, D., Hoiem, D.: Completing 3D object shape from one depth image. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2484–2493. University of Illinois at Urbana-Champaign, Urbana, IEEE, October 2015Google Scholar
  46. 46.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  47. 47.
    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  48. 48.
    Shin, D., Fowlkes, C., Hoiem, D.: Pixels, voxels, and views: a study of shape representations for single view 3D object shape prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  49. 49.
    Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013, pp. 2930–2937. IEEE Computer Society, Washington (2013).  https://doi.org/10.1109/CVPR.2013.377
  50. 50.
    Sock, J., Kim, K.I., Sahin, C., Kim, T.K.: Multi-task deep networks for depth-based 6D object pose and joint registration in crowd scenarios. arXiv.org, June 2018
  51. 51.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  52. 52.
    Sun, X., et al.: Pix3D: dataset and methods for single-image 3D shape modeling. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  53. 53.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. arXiv.org, November 2015
  54. 54.
    Tefera, Y., Poiesi, F., Morabito, D., Remondino, F., Nocerino, E., Chippendale, P.: 3DNOW: image-based 3D reconstruction and modeling via web. ISPRS - Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. 1097–1103 (2018). XLII-2.  https://doi.org/10.5194/isprs-archives-XLII-2-1097-2018CrossRefGoogle Scholar
  55. 55.
    Tejani, A., Kouskouridas, R., Doumanoglou, A., Tang, D., Kim, T.: Latent-class hough forests for 6 DoF object pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 119–132 (2018).  https://doi.org/10.1109/TPAMI.2017.2665623CrossRefGoogle Scholar
  56. 56.
    Valentin, J., et al.: Learning to navigate the energy landscape. In: Proceedings - 2016 4th International Conference on 3D Vision, 3DV 2016, University of Oxford, Oxford, United Kingdom, pp. 323–332. IEEE, December 2016Google Scholar
  57. 57.
    Walas, K., Nowicki, M., Ferstl, D., Skrzypczynski, P.: Depth data fusion for simultaneous localization and mapping - RGB-DD SLAM. In: 2016 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, MFI 2016, Baden-Baden, Germany, 19–21 September 2016, pp. 9–14 (2016).  https://doi.org/10.1109/MFI.2016.7849459
  58. 58.
    Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W.T., Tenenbaum, J.B.: MarrNet: 3D shape reconstruction via 2.5D sketches. arXiv.org November 2017
  59. 59.
    Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling, pp. 82–90 (2016)Google Scholar
  60. 60.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2014)Google Scholar
  61. 61.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision (2016). papers.nips.cc
  62. 62.
    Yang, B., Rosa, S., Markham, A., Trigoni, N., Wen, H.: 3D object dense reconstruction from a single depth view. arXiv preprint arXiv:1802.00411 (2018)
  63. 63.
    Yang, B., Wen, H., Wang, S., Clark, R., Markham, A., Trigoni, N.: 3D object reconstruction from a single depth view with adversarial learning. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, October 2017Google Scholar
  64. 64.
    Zheng, B., Zhao, Y., Yu, J.C., Ikeuchi, K., Zhu, S.C.: Beyond point clouds: scene understanding by reasoning geometry and physics. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.State Research Institute of Aviation Systems (GosNIIAS)MoscowRussia
  2. 2.Moscow Institute of Physics and Technology (MIPT)DolgoprudnyRussia
  3. 3.3D Optical Metrology (3DOM) unit, Bruno Kessler Foundation (FBK)TrentoItaly

Personalised recommendations