Image-to-Voxel Model Translation with Conditional Adversarial Networks
Abstract
We present a single-view voxel model prediction method that uses generative adversarial networks. Our method utilizes correspondences between 2D silhouettes and slices of a camera frustum to predict a voxel model of a scene with multiple object instances. We exploit pyramid shaped voxel and a generator network with skip connections between 2D and 3D feature maps. We collected two datasets VoxelCity and VoxelHome to train our framework with 36,416 images of 28 scenes with ground-truth 3D models, depth maps, and 6D object poses. We made the datasets publicly available (http://www.zefirus.org/Z_GAN). We evaluate our framework on 3D shape datasets to show that it delivers robust 3D scene reconstruction results that compete with and surpass state-of-the-art in a scene reconstruction with multiple non-rigid objects.
Keywords
Conditional GAN Voxel model 6D pose estimationNotes
Acknowledgments
The reported study was funded by Russian Foundation for Basic Research (RFBR) according to the research project N\(\mathrm {^{o}}\) 17-29-04509 and the Russian Science Foundation (RSF) according to the research project N\(\mathrm {^{o}}\) 16-11-00082.
We would like to thank the volunteers, who acted as statues for 3D models for class “human”: Zoya Kniaz, Lena Metelkina, Nastya Metelkina, and Anya Metelkina. Also we would like to thank the authors of 3D CAD models for the dataset: Artyom Bordodymov and Pyotr Moshkantsev.
References
- 1.Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int. J. Comput. Vis. (2018). https://doi.org/10.1007/s11263-018-1070-xCrossRefGoogle Scholar
- 2.Balntas, V., Doumanoglou, A., Sahin, C., Sock, J., Kouskouridas, R., Kim, T.: Pose guided RGBD feature learning for 3D object pose estimation. In: 2017 IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October, pp. 3876–3884 (2017). https://doi.org/10.1109/ICCV.2017.416
- 3.Balntas, V., Doumanoglou, A., Sahin, C., Sock, J., Kouskouridas, R., Kim, T.K.: Pose guided RGBD feature learning for 3D object pose estimation. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
- 4.Behl, A., Hosseini Jafari, O., Karthik Mustikovela, S., Abu Alhaija, H., Rother, C., Geiger, A.: Bounding boxes, segmentations and object coordinates: how important is recognition for 3D scene flow estimation in autonomous driving scenarios? In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
- 5.Brachmann, E., et al.: DSAC - differentiable RANSAC for camera localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
- 6.Brachmann, E., Rother, C.: Learning less is more - 6D camera localization via 3D surface regression. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
- 7.Brock, A., Lim, T., Ritchie, J., Weston, N.: Generative and discriminative voxel modeling with convolutional neural networks. pp. 1–9 December 2016. https://nips.cc/Conferences/2016. workshop contribution; Neural Inofrmation Processing Conference : 3D Deep Learning, NIPS; Conference date: 05–12-2016 Through 10–12-2016
- 8.Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. CoRR abs/1512.03012 (2015)Google Scholar
- 9.Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3D–R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016)Google Scholar
- 10.Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.: Recovering 6D object pose and predicting next-best-view in the crowd. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 3583–3592 (2016). https://doi.org/10.1109/CVPR.2016.390
- 11.Drost, B., Ulrich, M., Bergmann, P., Hartinger, P., Steger, C.: Introducing MVTec ITODD - a dataset for 3D object recognition in industry. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, October 2017Google Scholar
- 12.El-Hakim, S.: A flexible approach to 3D reconstruction from single images. In: ACM SIGGRAPH, vol. 1, pp. 12–17 (2001)Google Scholar
- 13.Engel, J., Stueckler, J., Cremers, D.: Large-scale direct slam with stereo cameras (2015)Google Scholar
- 14.Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2009)CrossRefGoogle Scholar
- 15.Firman, M., Mac Aodha, O., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
- 16.Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
- 17.Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_29CrossRefGoogle Scholar
- 18.Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
- 19.Heinly, J., Schonberger, J.L., Dunn, E., Frahm, J.M.: Reconstructing the world* in six days *(as captured by the Yahoo 100 million image dataset). In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
- 20.Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42CrossRefGoogle Scholar
- 21.Hodaň, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)Google Scholar
- 22.Hodan, T., Haluza, P., Obdrzálek, S., Matas, J., Lourakis, M.I.A., Zabulis, X.: T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects. In: 2017 IEEE Winter Conference on Applications of Computer Vision WACV 2017, Santa Rosa, CA, USA, 24–31 March 2017, pp. 880–888 (2017). https://doi.org/10.1109/WACV.2017.103
- 23.Hodaň, T., Matas, J., Obdržálek, Š.: On evaluation of 6D object pose estimation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 606–619. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_52CrossRefGoogle Scholar
- 24.Hodaň, T., Michel, F., Sahin, C., Kim, T.K., Matas, J., Rother, C.: SIXD Challenge 2017. http://cmp.felk.cvut.cz/sixd/challenge_2017/. Accessed 01 July 2018
- 25.Hoppe, C., Klopschitz, M., Donoser, M., Bischof, H.: Incremental surface extraction from sparse structure-from-motion point clouds. In: Proceedings of the British Machine Vision Conference 2013, pp. 94:1–94:11, British Machine Vision Association (2013)Google Scholar
- 26.Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. ACM Trans. Graph. 34(4), 87:1–87:10 (2015)CrossRefGoogle Scholar
- 27.Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. IEEE (2017)Google Scholar
- 28.Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, University of Cambridge, Cambridge, United Kingdom, pp. 2938–2946. IEEE, February 2015Google Scholar
- 29.Kniaz, V.V.: Robust vision-based pose estimation algorithm for an UAV with known gravity vector. ISPRS-Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. XLI-B5, 63–68 (2016). https://doi.org/10.5194/isprs-archives-XLI-B5-63-2016CrossRefGoogle Scholar
- 30.Knyaz, V., Zheltov, S.: Accuracy evaluation of structure from motion surface 3D reconstruction. In: Proceedings of SPIE, vol. 10332, pp. 10332-1–10332-10 (2017). https://doi.org/10.1117/12.2272021
- 31.Knyaz, V.A., et al.: Deep learning of convolutional auto-encoder for image matching and 3d object reconstruction in the infrared range. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 2155–2164 (2017). https://doi.org/10.1109/ICCVW.2017.252
- 32.Krull, A., Brachmann, E., Nowozin, S., Michel, F., Shotton, J., Rother, C.: PoseAgent: budget-constrained 6d object pose estimation via reinforcement learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
- 33.Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: fine pose estimation. In: ICCV (2013)Google Scholar
- 34.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. IEEE (2015)Google Scholar
- 35.Ma, M., Marturi, N., Li, Y., Leonardis, A., Stolkin, R.: Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recogn. 76(11), 506–521 (2017)Google Scholar
- 36.Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR, pp. 3061–3070 (2015)Google Scholar
- 37.Paszke, A., et al.: Automatic differentiation in pyTorch (2017)Google Scholar
- 38.Poiesi, F., Locher, A., Chippendale, P., Nocerino, E., Remondino, F., Van Gool, L.: Cloud-based collaborative 3D reconstruction using smartphones. In: the 14th ACM European Conference on Visual Media Production (CVMP), pp. 1–9. ACM Press, New York (2017)Google Scholar
- 39.Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointNets for 3D object detection from RGB-D data. arXiv preprint arXiv:1711.08488 (2017)
- 40.Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 3848–3856 (2017). https://doi.org/10.1109/ICCV.2017.413
- 41.Remondino, F., Nocerino, E., Toschi, I., Menna, F.: A critical review of automated photogrammetric processing of large datasets. ISPRS - Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. 42, 591–599 (2017). XLII-2/W5. https://doi.org/10.5194/isprs-archives-XLII-2-W5-591-2017CrossRefGoogle Scholar
- 42.Remondino, F., Roditakis, A.: Human figure reconstruction and modeling from single image or monocular video sequence. In: 2003 Fourth International Conference on 3-D Digital Imaging and Modeling, 3DIM 2003, pp. 116–123. IEEE October 2003Google Scholar
- 43.Remondino, F., El-Hakim, S.: Image-based 3D modelling: a review. Photogram. Rec. 21(115), 269–291 (2006)CrossRefGoogle Scholar
- 44.Richter, S.R., Roth, S.: Matryoshka networks: predicting 3D geometry via nested shape layers. arXiv.org, April 2018
- 45.Rock, J., Gupta, T., Thorsen, J., Gwak, J., Shin, D., Hoiem, D.: Completing 3D object shape from one depth image. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2484–2493. University of Illinois at Urbana-Champaign, Urbana, IEEE, October 2015Google Scholar
- 46.Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
- 47.Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
- 48.Shin, D., Fowlkes, C., Hoiem, D.: Pixels, voxels, and views: a study of shape representations for single view 3D object shape prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
- 49.Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013, pp. 2930–2937. IEEE Computer Society, Washington (2013). https://doi.org/10.1109/CVPR.2013.377
- 50.Sock, J., Kim, K.I., Sahin, C., Kim, T.K.: Multi-task deep networks for depth-based 6D object pose and joint registration in crowd scenarios. arXiv.org, June 2018
- 51.Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
- 52.Sun, X., et al.: Pix3D: dataset and methods for single-image 3D shape modeling. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
- 53.Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. arXiv.org, November 2015
- 54.Tefera, Y., Poiesi, F., Morabito, D., Remondino, F., Nocerino, E., Chippendale, P.: 3DNOW: image-based 3D reconstruction and modeling via web. ISPRS - Int. Arch. Photogram. Remote Sens. Spat. Inf. Sci. 1097–1103 (2018). XLII-2. https://doi.org/10.5194/isprs-archives-XLII-2-1097-2018CrossRefGoogle Scholar
- 55.Tejani, A., Kouskouridas, R., Doumanoglou, A., Tang, D., Kim, T.: Latent-class hough forests for 6 DoF object pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 119–132 (2018). https://doi.org/10.1109/TPAMI.2017.2665623CrossRefGoogle Scholar
- 56.Valentin, J., et al.: Learning to navigate the energy landscape. In: Proceedings - 2016 4th International Conference on 3D Vision, 3DV 2016, University of Oxford, Oxford, United Kingdom, pp. 323–332. IEEE, December 2016Google Scholar
- 57.Walas, K., Nowicki, M., Ferstl, D., Skrzypczynski, P.: Depth data fusion for simultaneous localization and mapping - RGB-DD SLAM. In: 2016 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, MFI 2016, Baden-Baden, Germany, 19–21 September 2016, pp. 9–14 (2016). https://doi.org/10.1109/MFI.2016.7849459
- 58.Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W.T., Tenenbaum, J.B.: MarrNet: 3D shape reconstruction via 2.5D sketches. arXiv.org November 2017
- 59.Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling, pp. 82–90 (2016)Google Scholar
- 60.Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2014)Google Scholar
- 61.Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision (2016). papers.nips.cc
- 62.Yang, B., Rosa, S., Markham, A., Trigoni, N., Wen, H.: 3D object dense reconstruction from a single depth view. arXiv preprint arXiv:1802.00411 (2018)
- 63.Yang, B., Wen, H., Wang, S., Clark, R., Markham, A., Trigoni, N.: 3D object reconstruction from a single depth view with adversarial learning. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, October 2017Google Scholar
- 64.Zheng, B., Zhao, Y., Yu, J.C., Ikeuchi, K., Zhu, S.C.: Beyond point clouds: scene understanding by reasoning geometry and physics. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013Google Scholar