Keywords

1 Introduction

Does a voxel model with a shape \(128\times 128\times 1\) provide any information about a 3D object? If the XY plane of the voxel model is normal to a camera optical axis, the voxel model is similar to the object’s semantic segmentation.

Fig. 1.
figure 1

Results of our image-to-voxel translation based on generative adversarial network (GAN) and frustum voxel model. Input color image (left). Ground truth frustum voxel model slices colored as a depth map (middle). The voxel model output (right). (Color figure online)

Modern methods [27, 34] demonstrate the state-of-art results on the task of semantic segmentation. Although deep networks trained for segmentation provide the resolution of an input color image, the resolution of a voxel model output produced by modern networks is lower than the resolution of an input image [26, 53, 58, 59, 61].

We hypothesize that a pixel correspondence between an input color image and slices of a voxel model can improve the quality of fine details in a voxel output. The necessary correspondences are found using three interconnected steps: (1) we provide an aligned voxel model for each color image, (2) we use slices of a camera’s frustum to built the voxel space, (3) we use a generator network with skip connections [27, 46] to feed high-resolution image features to 3D deconvolutional layers of a generator network (Fig. 1).

It is challenging to predict a voxel model from a single color image. The color-to-voxel model translation problem has received a lot of scholar attention in recent time [9, 17, 26, 44, 48, 53, 58, 59, 61]. The trained models have demonstrated state-of-the-art results on large datasets with 3D object annotations [33, 60]. Main limitations of the existing models are a single object focused prediction and limited generalization ability.

A research project has recently been started by the authors. The project is focused on the development of a low-cost driver assistance system. We developed two new 3D shape datasets VoxelCity and VoxelHome to train our framework. The datasets include 36,416 images of 28 scenes with ground-truth 3D models, depth maps, and 6D object poses.

The results of the trained Z-GAN model are encouraging. We experimented with high-resolution voxel outputs of \(128\times 128\times 128\) and were able to predict the shape of multiple objects accurately. We evaluated our Z-GAN model using the Pascal 3D+ [60] and the IKEA [33] datasets. The comparison with the state-of-the-art has demonstrated that the Z-GAN effectively outperforms modern models in the number of reconstructed objects, the generalization ability, and the resolution of the output voxel model. The Z-GAN model can be used in the 3D vision applications such as robot vision, 6D pose estimation, and 3D model reconstruction.

The rest of the paper is organized as follows. Section 2 outlines modern approaches to voxel model reconstruction. In Sect. 3 we describe the structure of our VoxelCity and VoxelHome datasets. In Sect. 4 the developed conditional Z-GAN model is presented. Section 5 presents the evaluation of baselines and the developed model.

1.1 Contributions

The key contributions of this paper are: (1) the conditional adversarial volumetric Z-GAN framework for the generation of a voxel model from a single-view color image, (2) VoxelCity and VoxelHome datasets with 36,416 color images, ground-truth voxel models, depth maps and camera orientations of 21 outdoor and 7 indoor scenes, (3) an evaluation of baselines and our framework on 3D shape datasets.

2 Related Work

Generative Adversarial Networks. Recently proposed Generative Adversarial Networks (GANs) [18] provide a mapping from a random noise vector to a domain of the desired outputs (e.g., images, voxel models). GANs are gaining increasing attention in recent years. They provide encouraging results in tasks like image-to-image translation [27] and voxel model generation [59].

Single-Photo 3D Model Reconstruction. Accurate 3D reconstruction is challenging if only a single color image is provided. This problem was always of great interest for the research community [12, 42, 43] and in the last years many new approaches were proposed based on the use of deep learning [9, 17, 26, 44, 48, 53, 58, 59, 61]. While a number of methods were proposed for prediction of unobserved voxels from a single depth map [15, 51, 62,63,64], prediction of the voxel model of a complex scene from a single color (RGB) image is more ambiguous. Prior knowledge of 3D shape is required for the robust performance of a single-image method. Hence, most of the methods split the problem into two steps: object recognition and a 3D shape reconstruction. In [17] a deep learning method for a single image voxel model reconstruction was proposed. The method leverages an auto-encoder architecture for a voxel model prediction. While the model has demonstrated promising results, the resolution of the voxel model was limited to \(20\times 20\times 20\) elements. An approach that combines single-view and multi-view reconstruction modes was proposed in [9]. In [44] a new voxel decoder architecture was proposed that leverages voxel tube and shape layers to increase the resulting voxel model resolution. A comparison of surface-based and volumetric 3D model prediction is performed in [48].

3D shape synthesis from a latent space has received a lot of scholar attention recently [7, 17, 59]. Wu et al. have proposed a GAN model [59] for a voxel model generation (3D-GAN). The model was capable to predict voxel models with resolution \(64\times 64\times 64\) from a randomly sampled noise vector. 3D-GAN was used for a single-image 3D reconstruction using an approach proposed in [17]. While 3D models produced by the 3D-GAN model provided more details compared to [17], the generalization ability of the approach was insufficient to predict voxel models of previously unseen 3D shapes.

3D Shape Datasets. Multiple 3D shape datasets were designed [8, 33, 52, 60] for training deep models. Manual annotation was performed for the Pascal VOC dataset [14] to align a set of CAD models with color photos. The extended dataset was termed Pascal 3D+ [60]. While many models were trained using the Pascal 3D+ dataset, it provides a coarse correspondence between a 3D model and a photo. A large ShapeNet dataset [8] was collected to address the problems of shape recognition and generative modeling. However, training for single photo 3D model reconstruction is possible only with synthetic data. Hinterstoisser et al. have designed a large Linemod dataset [20] with aligned RGB-D data. The dataset is focused on object recognition in the indoor setting. The Linemod dataset was intensively used for training 6D pose estimation algorithms [2, 3, 5, 6, 10, 22, 23, 32, 35, 40, 50, 55]. In [21] a large dataset for 6D pose estimation of texture-less objects was developed. An MVTec ITODD dataset [11] addresses the challenging problem of 6D pose prediction in industrial application.

The 6D pose estimation has received a lot of scholar attention recently [2, 3, 5, 6, 10, 22, 23, 29, 31, 32, 35, 40, 50, 55]. Accurate estimation of camera pose relative to an object is of primary importance in such fields as an autonomous driving [1, 4, 36, 39] and Simultaneous Localization and Mapping (SLAM) [13, 57]. However, most of the datasets contain 3D data as LIDAR range scans. As no complete 3D shapes are provided in the existing datasets, they require an additional annotation for single-photo 3D reconstruction.

3 Dataset

We collected two new datasets VoxelCity and VoxelHome to train our Z-GAN model. The primary motivation for the creation of new datasets was an absence of large 3D shape datasets with pixel-level 3D object annotations. Annotations provided in Pascal 3D+ dataset [60] present CAD models of abstract classes that do not provide real silhouettes of 3D objects.

We capture multi-view images of scenes to generate our datasets (Sects. 3.2 and 3.3), composed of images, depth maps, reconstructed 3D models and ground-truth 3D CAD models. We recover 6D camera poses for each image and textured 3D models using state-of-the-art SfM algorithms [30, 38, 41, 54]. Then we manually annotated all objects in a scene to provide multi-object 6D poses. The SfM-based approach provides two benefits. Firstly, SfM models present real configuration of objects in a scene that provides a pixel-level correspondence between images and a 3D model (Fig. 2). Secondly, SfM provides a 6D camera pose for each image. We made the datasets compliant with the SIXD Challenge dataset format [24].

Fig. 2.
figure 2

Comparison of an alignment of 3D models in the Pix3D (a) and our VoxelHome (b) datasets and in the Pascal 3D+ (c) and our VoxelCity (d) datasets. Please note that (a) and (c) do not provide perfect alignment of the contours.

3.1 3D Model Generation Using SfM

The multi-view image-based 3D reconstruction pipeline, generally called Structure from Motion (SfM), based on the integration of photogrammetric and computer vision algorithms, has become in the last years a powerful and valuable approach for 3D modeling purposes. It generally ensures sufficient automation, low cost, efficient results and ease of use, even for non-expert users. SfM now successfully reconstructs scenes containing hundred thousands or even millions of images [19, 47]. Available online reconstruction services decoupled the user from a powerful hardware that carried out the reconstructions, only requiring to upload the images on a Cloud server [54]. Recently, online SfM methods demonstrated that it is possible to add new images to existing 3D reconstructions and build an incremental surface model [25, 38]. We manually created 3D CAD models using the coarse 3D models reconstructed using SfM as a baseline. We use the CAD models to generate the voxel models for network training.

3.2 VoxelCity

Our VoxelCity dataset includes 3D models of 21 scenes, composed of 18,836 color images with reconstructed 3D models, ground-truth 3D CAD models, depth maps and 6D poses of seven object classes: human, car, bicycle, truck, van. Examples of 3D scenes and object pose annotations for various object classes are presented in Fig. 3. Comparison to previous 3D shape datasets regarding outdoor scenes is presented in Table 1.

Table 1. Comparison to previous outdoor 3D shape dataset. The type of data provided is listed: dense (D), coarse (C).
Fig. 3.
figure 3

Examples of color images with 6D pose annotations and ground truth dense point clouds from our VoxelCity dataset. (Color figure online)

3.3 VoxelHome

Our VoxelHome dataset presents 3D models of 7 indoor scenes, composed of 17,580 color images with reconstructed 3D models, ground-truth 3D CAD models, depth maps and 6D poses of nine object classes: chair, table, armchair, sofa, stool, cupboard, vase, washing machine, oven. Examples of 3D models and object pose annotations for various object classes are presented in Fig. 4. We present comparison to previous datasets regarding indoor scenes in Table 2.

Table 2. Comparison to previous outdoor 3D shape dataset. The type of data provided is listed: dense (D), coarse (C).
Fig. 4.
figure 4

Examples of color images with 6D pose annotations and ground truth dense point clouds from our VoxelHome dataset.

4 Method

The aim of the present research is to apply conditional generative adversarial network to the color image-to-voxel model translation task. The straightforward approach is to change the network output from an image to a voxel model. However, the convergence of the training process is poor for such setting. We hypothesize that the performance can be improved if the voxel model will be aligned with the input image.

A depth map is an example of an aligned 3D representation of the color image. While the depth map provides the 3D shape only for the visible surface of objects, the voxel model encodes the complete 3D model of the scene. We use assumptions made by [58] as the starting point for our 3D model representation. To provide the aligned voxel model, we combine depth map representation with a voxel grid. We term the resulting 3D model as a Frustum Voxel model (Fruxel model).

4.1 Frustum Voxel Model

The main idea of the fruxel model is to provide precise alignment of voxel slices with contours of a color image. Such alignment can be achieved with a common voxel model if the camera has an orthographic projection and its optical axis coincides with the Z-axis of the voxel model (see Fig. 5, left). We generalize such alignment to the perspective projection. As the camera frustum is no longer corresponding to the cube voxel elements, we use sections of a pyramid.

Fig. 5.
figure 5

Comparison between voxel model (left) and the proposed frustum voxel model (right) with shape \(64\times 64\times 64\) fruxel elements.

Fruxel model representation provides multiple advantages. Firstly, each XY slice of the model is aligned with some contours on a corresponding color photo (some parts of them can be invisible). Secondly, a fruxel model encodes a shape of both visible and invisible surfaces. Hence, unlike the depth map, it contains complete information about the 3D shapes. In other words, the fruxel model is similar to theatre scenery composed of flat screens with drawings of objects that imitate perspective space. Please note, that while fruxel elements have different dimensions in object space, all slices of the fruxel model have the same number of fruxel elements (e.g., \(128\times 128\times 1\)).

A fruxel model is characterized by a following set of parameters: \(\{z_n,z_f,d,\alpha \}\), where \(z_n\) is a distance to a near clipping plane, \(z_f\) is a distance to a far clipping plane, d is the number of frustum slices, \(\alpha \) is a field of view of a camera.

While fruxel model provides contour correspondence with a color image, its interpretation by a human may be complicated. We consider fruxel models as a special representation of a voxel model optimized for the training of conditional adversarial networks. Nevertheless, a fruxel model can be converted into three common data types: (1) voxel model, (2) depth map, (3) object annotation.

A voxel model can be produced from the fruxel model by scaling each consequent layer slice by the coefficient k defined as follows:

$$\begin{aligned} k = \frac{z_n}{z_n + s_{z}}, \end{aligned}$$
(1)

where \(s_{z} = \frac{z_f - z_n}{d}\) is the size of the fruxel element along the Z-axis.

To generate a depth map P from the fruxel model, we multiply indices of the frontmost non-empty elements by the step \(s_z\).

$$\begin{aligned} P(x,y) = \mathrm{argmax}_i[F_i{(x,y)} = 1] \cdot s_z \end{aligned}$$
(2)

where P(xy) is an element of a depth map, \(F_i(x,y)\) vector of elements in a fruxel model at slice i.

An object annotation is equal to a product of all elements with given xy coordinates

$$\begin{aligned} A(x,y) = \prod ^{d}_{i=0} F(x,y,i) \end{aligned}$$
(3)

We use boolean operations to generate the fruxel model from a 3D scene. Firstly, we set the desired position of a virtual camera. After that, we find a boolean intersection between the 3D scene and XY slices of the frustum space. We render each intersection using white emission shader. We combine all slices in a single 3D array with dimensions \(w \times h \times d\), where w is the width, h is the height of the color image, d is the number of slices. We term the resulting 3D array as a fruxel model. We generate fruxel models for real photos using 3D models generated with the structure-from-motion (SfM) algorithm. The SfM approach provides an estimation of camera poses with respect to the reconstructed 3D model. We place the virtual camera in the estimated pose and render the slices of the reconstructed model.

4.2 Conditional Adversarial Networks

Generative Adversarial Networks (GAN) generate a signal \(\hat{B}\) for a given random noise vector z, \(G : z \rightarrow \hat{B}\) [18, 27]. Conditional GAN transforms an input image A and the vector z to an output \(\hat{B}\), \(G : \{A, z\} \rightarrow \hat{B}\). The input A can be an image that is transformed by the generator network G. The discriminator network D is trained to distinguish “real” signals from target domain B from the “fakes” \(\hat{B}\) produced by the generator. Both networks are trained simultaneously. Discriminator provides the adversarial loss that enforces the generator to produce “fakes” \(\hat{B}\) that cannot be distinguished from “real” signal B.

We train a generator \(G : \{A\} \rightarrow \hat{B}\) to synthesize a fruxel model \(\hat{B} \in \mathbb {R}^{w \times h \times d}\) conditioned by a color image \(A \in \mathbb {R}^{w \times h \times 3}\).

4.3 Z-GAN Framework

We use pix2pix [27] framework as a starting point to develop our Z-GAN model. We keep the encoder part of the generator unchanged. We change 2D convolution kernels with 3D deconvolution kernels to encode a correlation between neighbor slices along the Z-axis.

We keep the skip connections between the layers of the same depth that was proposed in the U-net model [46]. We believe that skip connections help to transfer high-frequency components of the input image to the high-frequency components of the 3D shape. The resulting architecture of our Z-GAN model is presented in Fig. 6.

Fig. 6.
figure 6

Z-GAN framework.

4.4 Volumetric Generator

The main idea of our volumetric generator G is to use the correspondence between silhouettes in a color image and slices of a fruxel model. We used the U-Net generator [46] as a starting point to develop our model. The original U-Net generator leverages skip connections between convolutional and deconvolutional layers of the same depth to transfer fine details from the source to the target domain effectively.

We added two contributions to the original U-Net model. Firstly, we replaced the 2D deconvolutional filters with 3D deconvolutional filters. Secondly, we modified the skip connections to provide the correspondence between shapes of 2D and 3D features. The outputs of 2D convolutional filters in the left (encoder) side of Z-Net generator are \(F_{2D} \in \mathbb {R}^{w\times h \times c}\) tensors, where wh is the width and the height of a feature map and c is the number of channels. The output of 3D deconvolutional filters in the right (decoder) side are \(F_{3D} \in \mathbb {R}^{w\times h \times d \times c}\) tensors. We use d copies of each channel of \(F_{2D}\) to fill the third dimension of \(F_{3D}\). We term this operation as “copy inflate”. The architecture of the generator is presented in Fig. 7.

Fig. 7.
figure 7

The architecture of the generator.

4.5 Volumetric Discriminator

We modify the PatchGAN discriminator [27] to process the 3D slices efficiently. The original PatchGAN discriminator is based on the assumption of the markovian independence of the local image patches. Therefore the discriminator penalizes the image structure only at the scale of local patches.

The PatchGAN discriminator consists of a stack of convolutional layers with a constant kernel size. The stride of each layer is balanced with a kernel size in such way that the layer output size remains corresponding to the size of the input image. In other words, each convolutional layer takes the input with the size equal to the size of the input color image and produces a feature map. The sequential application of the convolutions with constant kernel size increases the “aperture” of the discriminator. For example, sequential application of seven convolutional layers results in the feature “aperture” of 140 pixels.

Our Z-Patch discriminator has a similar structure to the PatchGAN discriminator [27]. We replaced all 2D convolutional layers with 3D convolutional layers to process 3D shapes.

5 Evaluation

We evaluate baseline models and our Z-GAN framework on a task of generation of a voxel model from a single-view color image. We use two 3D shape datasets for the evaluation: Pascal 3D+ [60] and Pix3D [52]. All datasets include real images with 6D object poses.

We use two metrics to provide a quantitative evaluation of 3D object reconstruction quality: (i) an Intersection over Union (IoU) metric to measure a difference between a ground-truth 3D model and an output of a method and (ii) a surface distance metric similar to [45] to evaluate an accuracy of camera pose estimation for the 3D-R2N2 and our Z-GAN models. We also provide images of resulting voxel models for qualitative evaluation.

5.1 Baselines

We compare our model with three baselines: 3D-R2N2 [9, 48], TL-network [17], and MarrNet [58]. To the best of our knowledge, there are no baselines to date capable of predicting voxel models of multiple objects from a single image. TL-network and MarrNet perform object-centered [48] prediction of voxel models with resolutions of \(20\times 20\times 20\) and \(128\times 128\times 128\). 3D-R2N2 provides a view-centered prediction with resolution \(32\times 32\times 32\). Our Z-GAN model predicts a view-centered fruxel model with resolution \(128\times 128\times 128\).

5.2 Training Details

Our Z-GAN framework was trained on the VoxelCity and VoxelHome datasets using PyTorch library [37]. We use VoxelCity dataset for the evaluation on Pascal 3D+ with fruxel model parameters \(\{z_n=2,z_f=12,d=128,\alpha =40^{\circ }\}\). For the evaluation on Pix3D dataset, we train our model on the VoxelHome dataset with fruxel model parameters \(\{z_n=0.5,z_f=5.5,d=128,\alpha =40^{\circ }\}\). The training was performed using the NVIDIA 1080 Ti GPU and took 11 h for G, D. For network optimization, we use minibatch SGD with an Adam solver. We set the learning rate to 0.0002 with momentum parameters \(\beta _1 = 0.5\), \(\beta _2 = 0.999\) similar to [27].

5.3 3D Reconstruction on Pascal 3D+

Qualitative Evaluation. We show results of single-view voxel model generation in Fig. 8. We use three object classes: car, bicycle, human. We selected 2,762 images from Pascal 3D+ image sets with a field of view similar to our trained model. We manually annotated the images with human 3D models from the ShapeNet dataset [8]. The qualitative evaluation demonstrates that models predicted by TL-network and MarrNet models have limited resolution and do not demonstrate new details compared to ground-truth models from the training set. While 3D-R2N2 shows more diversity in the output, it is capable of predicting only a single object in a scene. Our Z-GAN model produces voxel models of the whole scene with multiple object instances.

Fig. 8.
figure 8

An example of 3D reconstruction using TL-network,MarrNet,3D-R2N2 and Z-GAN on Pascal 3D+ [60] dataset and considering three object classes: car, bicycle, human.

Quantitative Evaluation. We evaluate the results of the proposed Z-GAN method in terms of IoU and surface distance in Tables 3 and 4.

Table 3. Intersection over union metric for different object classes for Pascal 3D+ images.
Table 4. Surface distance metric [45] for different object classes for Pascal 3D+ images.

5.4 3D Reconstruction on Pix3D

Qualitative Evaluation. Evaluation results of single-view voxel model generation are presented in Fig. 9. We use two object classes: chair and table. We selected 1,512 images from Pix3D image sets with a field of view similar to our model trained on VoxelHome dataset. We made the following conclusions from the qualitative evaluation. Firstly, TL-network predicts the object as the voxel model from the training set. While MarrNet tries to imitate the shape of the object in the input, it is confused on images with multiple objects. 3D-R2N2 reconstructs view-centered object voxel model but the resolution of the model is not enough to show details of multiple objects. Results of our Z-GAN model demonstrate fine object details and correct poses of multiple objects.

Fig. 9.
figure 9

Example of 3D reconstructions using the Pix3D dataset.

Quantitative Evaluation. We evaluate the results of the proposed Z-GAN method in terms of IoU and surface distance in Tables 5 and 6.

Table 5. Intersection over union metric for different object classes for Pix3D images.
Table 6. Surface distance metric [45] for different object classes for Pix3D images.

6 Conclusions

The paper presented a new approach based on conditional generative adversarial networks capable of prediction of a voxel model from a single image. We showed that conditional adversarial volumetric networks can generate voxel models of complex scenes with multiple objects. We demonstrated that skip connections between 2D convolutional and 3D deconvolutional layers facilitate reconstruction of fine details. Furthermore, models utilizing skip connections require less training parameters for high-quality reconstruction of cluttered scenes with multiple 3D shapes of different classes.

We developed a new Z-GAN framework for translation of a single color image to a voxel model of a scene. We collected two datasets VoxelCity and VoxelHome to train our model. Datasets include fine-grade scene models, color images, depth maps and 6D object poses. We evaluated baselines and our model on multiple 3D shape datasets to show that it achieves and surpasses the state-of-the-art in terms of the number of reconstructed objects and their details.