1 Introduction

Techniques for analyzing 3D shapes are becoming increasingly important due to the vast number of sensors that are capturing 3D data, as well as numerous computer graphics applications. In recent years a variety of deep architectures have been approached for classifying 3D shapes. These range from multiview approaches that render a shape from a set of views and deploy image-based classifiers, to voxel-based approaches that analyze shapes represented as a 3D occupancy grid, to point-based approaches that classify shapes represented as collection of points. However, there is relatively little work that studies the tradeoffs offered by these modalities and their associated techniques.

This paper aims to study three of these tradeoffs, namely the ability to generalize from a few examples, computational efficiency, and robustness to adversarial transformations. We pick a representative technique for each modality. For multiview representation we choose the Multiview CNN (MVCNN) architecture [31]; For voxel-based representation we choose the VoxNet [17, 37] constructed using convolutions and pooling operations on a 3D grid; For point-based representation we choose the PointNet architecture [32]. The analysis is done on the widely-used ModelNet40 shape classification benchmark [37].

Some of our analysis leads to surprising results. For example, with deeper architectures and a modification in the rendering technique that renders with black background and better centers the object in the image the performance of a vanilla MVCNN can be improved to 95.0% per-instance accuracy on the benchmark, outperforming several recent approaches. Another example is that while it is widely believed that the strong performance of MVCNN is due to the use of networks pretrained on large image datasets (e.g., ImageNet [26]), we find that even without such pretraining the MVCNN obtains 91.3% accuracy, outperforming several voxel-based and point-based counterparts that also do not rely on such pretraining. Furthermore, the performance of MVCNN remains at 93.6% even when trained with binary silhouettes (instead of shaded images) of shapes, suggesting that shading offer relatively little extra information on this benchmark for MVCNN.

We then systematically analyze the generalization ability of the models. First we analyze the accuracy of various models by varying the number of training examples per category. We find that the multiview approaches generalize faster obtaining near optimal performance with far fewer examples compared to the other approaches. We then analyze the role of initialization of these networks. As 3D shape datasets are currently lacking in comparison to large image datasets, we employ cross-modal distillation techniques [7, 10] to guide learning. In particular we use representations extracted from pretrained MVCNNs to guide learning of voxel-based and point-based networks. Cross-modal distillation improves the performance of VoxNet and PointNet, especially when training data is limited.

Finally we analyze the robustness of these classifiers to adversarial perturbations. While generating adversarial inputs to VoxNet and PointNet is straightforward, it is not the case for multiview methods due to the rendering step. To this end we design an end-to-end differentiable MVCNN that takes an input a voxel representation and generates a set of views using a differentiable renderer. We analyze the robustness of these networks by estimating the amount of perturbation needed to obtain a misclassification. We find that PointNet is more robust, while MVCNN and VoxNet are both easily fooled by adding a small amount of noise. This is similar to the observations in prior work of adversarial inputs for image-based networks [6, 16, 33]. Somewhat surprisingly ImageNet pretraining reduces the robustness of the MVCNNs to adversarial perturbations.

In summary, we performed a detailed analysis of several recently proposed approaches for 3D shape classification. This resulted in a new state-of-the-art of 95.0% on the ModelNet40 benchmark. The technical contributions include the use of cross-modal distillation for improving networks that operate on voxel-based and point-based representations and a novel approach for generating adversarial inputs for 3D shape classification for multiview approaches using a differentiable renderer. This allows us to directly compare and generate adversarial inputs for voxel-based and view-based methods using gradient-based techniques. The conclusion is that while PointNet architecture is less accurate, the use of orderless aggregation mechanism likely makes it more robust to adversarial perturbations compared to VoxNet and MVCNN, both of which are easily fooled.

Fig. 1.
figure 1

Different representations of the input shape. From the left (a) the shapes in the database are represented as triangle meshes, (b) the shape converted to a \(30^3\) voxel grid, (c) point cloud representation with 2048 points, and (d–f) the model rendered using Phong shading, and as depth and binary silhouette images respectively.

2 Method

This Section describes the protocol for evaluating the performance of various classifiers. We describe the dataset, performance metrics, and training setup in Sect. 2.1, followed by the details of the deep classifiers we consider in Sect. 2.2, and the approach for generating adversarial examples in Sect. 2.3.

2.1 3D Shape Classification

Classification Benchmark. All our evaluation is done on the ModelNet40 shape classification benchmark [37] following the standard training and test splits provided in the dataset. There are 40 categories with 9483 training models and 2468 test models. The numbers of models are not equal across classes hence we report both the per-instance and per-class accuracy on the test set. While most of the literature report results by training on the entire training set, some earlier work, notably [31] reports results on training and evaluation on a subset consisting of 80 training and 20 test examples per category.

Input Representations. The dataset presents each shape as a collection of triangles, hence it is important to describe the exact way in which these are converted to point clouds, voxels, and images for input to different network architectures. These inputs are visualized in Fig. 1 and described below:

  • Voxel representation. To get voxel representations we follow the dataset from [23] where models are discretized to a \(30\times 30\times 30\) occupancy grid. The data is available from the author’s page.

  • Point cloud. For point cloud representation we use the data from the PointNet approach [32] where 2048 points are uniformly sampled for each model.

  • Image representation. To generate multiple views of the model we use a setup similar to [31]. Since the models are assumed to be upright oriented a set of virtual cameras are placed at 12 radially symmetric locations, i.e. every 30\(^\circ \), facing the object center and at an elevation of 30\(^\circ \). Comparing to [31], we render the images with black background and set the field-of-view of the camera such that the object is bounded by image canvas and rendered as an image of size \(224\times 224\). A similar scheme was used to generate views for semantic segmentation of shapes in the Shape PFCN approach [12]. This had a non-negligible impact on the performance of the downstream models as discussed in Sect. 3.1. Given the setup we considered three different ways to render the models described below:

    1. 1.

      Phong shading, where images are rendered with the Phong reflection model [21] using Blender software [2]. The light and material setup is similar to the approach in [31].

    2. 2.

      Depth rendering, where only the depth value is recorded.

    3. 3.

      Silhouette rendering, where images are rendered as binary images for pixels corresponding to foreground.

Data Augmentation. Models in the dataset are upright oriented, but not consistently oriented along the axis, i.e., models could be rotated arbitrarily along the upright direction. Models that rely on voxel or point cloud input often benefit from rotation augmentation along the upright axis during training and testing. Similar to the multiview setting we consider models rotated by 30\(^\circ \) increments as additional data during training, and optionally aggregating votes across these instances at test time.

2.2 Classification Architectures

We consider the following deep architectures for shape classification.

Multiview CNN (MVCNN). The MVCNN architecture [31] uses rendered images of the model from different views as input. Each image is fed into a CNN with shared weights. A max-pooling layer across different views is used to perform an orderless aggregation of the individual representations followed by several non-linear layers for classification. While the original paper [31] used the VGG-M network [30] we also report results using:

  • The VGG-11 network, which is the model with configuration A from [30]. The view-pooling layer is added before the first fc layer.

  • Variants of residual networks proposed in [8] such as ResNet18, ResNet34, and ResNet50. The view-pooling layer is added before the final fc layer.

Voxel Network (VoxNet). The VoxNet was first proposed in several early works [17, 37] that uses convolution and pooling layers defined on 3D voxel grids. The early VoxNet models [17] used two 3D convolutional layers and 2 fully-connected layers. In our initial experiments we found the capacity of this network is limited. We also experimented with the deeper VoxNet architecture proposed in [32] which has five blocks of (conv3d-batchnorm-LeakyReLU) and includes batch normalization [11]. All conv3d layers have kernel size 5, stride 1 and channel size 32. The LeakyReLU has slope 0.1. Two fully-connected layers (fc-batchnorm-ReLU-fc) are added on top to obtain class predictions.

VoxMVCNN. We also consider a hybrid model that takes voxels as input and uses a MVCNN approach for classification using a differentiable renderer. To achieve this we make two simplifications. First, only six renderings are considered corresponding to viewing directions along six dimensions \((\pm x,\pm y,\pm z)\). Second, the rendering is approximated using the approach suggested in PrGAN [5] where line integrals are used to compute pixel shading color. For example the line integral of a volume occupancy grid V along the axis k is given by \(P((i,j),V) = 1-\exp (-\sum _k V(i,j,k))\). The idea is that the higher the sum of occupancy values along the axis, the closer the integral is to 1. The generated views for two models are shown in Fig. 2. The renderings generated this way approximate silhouette renderings as described earlier. The primary advantage of this rendering method is that it’s differentiable, and hence we use this model to analyze the robustness of the MVCNN architecture to adversarial inputs (described in Sect. 2.3).

Point Network (PointNet). We follow the same architecture as PointNet [32] that operates on point cloud representations of a model. The architecture applies a series of non-linear mappings individually to each input point and performs orderless aggregations using max-pooling operations. Thus the model is invariant to the order in which the points are presented and can directly operate on point clouds without additional preprocessing such as spatial partitioning, or graph construction. Additionally, some initial layers are used to perform spatial transformations (rotation, scaling, etc.) Despite its simplicity the model and its variants have been shown to be effective at shape classification and segmentation tasks [22, 32].

Training Details. All the MVCNN models are trained in two stages as suggested in [31]. The model is first trained as a single-image classification task where the view-pooling layer is removed, then trained to jointly classify all the views with view-pooling layer in the second stage. We use the Adam optimizer [14] with learning rate \(5\times 10^{-5}\) and \(1\times 10^{-5}\) for first and second stage respectively and each stage is trained with 30 epochs. The batch size is set to 64 and 96 (eight models with twelve views) for each stage and the weight decay parameter is set to 0.001. The VoxNet is trained with Adam optimizer with learning rate \(1\times 10^{-3}\) for 150 epochs. The batch size is set to 64 and weight decay parameter is 0.001. The VoxMVCNN is trained using the same procedure as MVCNN. For PointNet [32] we use the publicly available implementation released by the authors.

Fig. 2.
figure 2

Examples of rendered images for the VoxMVCNN architecture. The voxel input is rendered using the simplified technique described in Sect. 2.2 to generate 6 images which are processed using the MVCNN architecture [31] for classification.

2.3 Generating Adversarial Inputs

Adversarial examples to image-based deep neural networks have been thoroughly explored in the literature [6, 16, 19, 33]. However, there is no prior work that addresses adversarial examples to deep neural networks based on 3D representations. Here we want to investigate if adversarial shapes can be obtained from different 3D shape recognition models, and perhaps more importantly which 3D representation is more robust to adversarial examples. We define an adversarial example as follows. Consider \(\phi (\mathbf {s}, y)\) as the score that a classifier \(\phi \) gives to an input \(\mathbf {s}\) belonging to a class y. An adversarial example \(\mathbf {s^\prime }\) is a sample that is perceptually similar to \(\mathbf {s}\), but \(\mathrm {arg max}_y \phi (\mathbf {s}, y) \ne \mathrm {arg max}_y \phi (\mathbf {s^\prime }, y)\). It is known from [33] that an effective way to compute adversarial examples to image-based models is the following. Given a sample from the dataset \(\mathbf {s}\) and a class \(y^\prime \) that one wishes to maximize the score, an adversarial example \(\mathbf {s^\prime }\) can be computed as follows

$$\begin{aligned} \mathbf {s^\prime } = \mathbf {s} + \alpha \nabla _{\mathbf {s}}\phi (\mathbf {s}, y^\prime ) \end{aligned}$$

where \(\nabla _{\mathbf {s}}\) is the gradient of the classifier with respect to the input \(\mathbf {s}\), and \(\alpha \) is the learning rate. For many image models, this single step procedure is able to generate perceptually indistinguishable adversarial examples. However, for some of the examples we experimented, a single step is not enough to generate a misclassification. Thus, we employ the following iterative procedure based on [16]:

$$\begin{aligned} \mathbf {s^\prime _0} = \mathbf {s}, \quad \mathbf {s^\prime _{t+1}} = clip_{\mathbf {s}, \epsilon } \big \{\mathbf {s}_{t} + \alpha \nabla _{\mathbf {s}}\phi (\mathbf {s_{t}}, y^\prime )\big \} \end{aligned}$$
(1)

where \(clip_{\mathbf {s}, \epsilon }\{x\}\) is an operator that clips the values of x to make sure the result will be in the \(L_\infty \) \(\epsilon \)-neighborhood of \(\mathbf {s}\). Notice that this procedure is agnostic to the representation used by the input data \(\mathbf {s}\). Thus, we use the same method to generate adversarial examples for multiple modalities of 3D representation: voxels, point clouds, multi-view images. For voxel grids, we also clip the values of \(\mathbf {s}\) to make sure their values are in [0, 1]. For multi-view representations, we need to make sure that all views are consistent with each other. We address this issue by using the VoxMVCNN architecture that generates multiple views from the same object through a differentiable renderer, i.e. line integral, as described in Sect. 2.2.

3 Experiments

We begin by investigating the model generalization in Sect. 3.1. Section 3.2 analyzes the effect of different architectures and renderings for the MVCNN. Section 3.3 uses cross-modal distillation to improve the performance of VoxNet and PointNet. Section 3.4 compares the tradeoffs between different representations. Section 3.5 compares the robustness of different classifiers to adversarial perturbations. Finally, Sect. 3.6 puts the results presented in this paper in the context of prior work.

3.1 Learning from a Few Examples

One of the most desirable properties of a classifier is its ability to generalize from a few examples. We test this ability by evaluating the accuracy of different models as a function of training set size. We select the first \(M_k\) models in the training set for each class, where

$$ M_k=\min (N_k,\{10,20,40,80,160,320,889\}), $$

and \(N_k\) is the number of models in class k. The maximum number of models per-class in the training set of ModelNet40 is 889. Figure 3 shows the per-class and per-instance accuracy for three different models as a function of the training set size. The MVCNN with the VGG-11 architecture has better generalization than VoxNet and PointNet across all training set sizes. MVCNN obtains 77.8% accuracy using only 10 training models per class, while PointNet and VoxNet obtain 62.5% and 57.9% respectively. The performance of MVCNN is near optimal with 160 models per class, far fewer than PointNet and VoxNet. When using the whole dataset for training, MVCNN (95.0%) outperforms PointNet (89.1%) and VoxNet (85.6%) by a large margin.

Several improvements have been proposed for both point-based and voxel-based architectures. The best performing point-based models to the best of our knowledge is the Kd-Networks [15] which achieves 91.8% per-instance accuracy. For voxel-based models, O-CNN [35] uses sparse convolutions to handle higher resolution with Octave trees [18] and achieves 90.6% per-instance accuracy. However, all of them are far below the MVCNN approach. More details and comparison to the state-of-the-art are in Sect. 3.6.

Fig. 3.
figure 3

Classification accuracy as a function of training set size. MVCNN generalizes better than the other approaches. The two MVCNN curves correspond to variants with and without ImageNet pretraining.

3.2 Dissecting the MVCNN Architecture

Given the high performance of MVCNN we investigate what factors contribute to its performance as described next.

Effect of Model Architecture. The MVCNN model in [31] used VGG-M architecture. However a number of different image networks have since been proposed. We used different CNN architectures for MVCNN and report the accuracies in Table 1. All models have similar performance suggesting that MVCNN is robust across different CNN architectures. In Table 3 we also compare with the results using VGG-M and AlexNet. With the same shaded images and training subset, VGG-11 achieves 89.1% and VGG-M has 89.9% accuracy.

Table 1. Accuracy (%) of MVCNN with different CNN architectures. The VGG-11 architectures are on par with the residual network variants.

Effect of ImageNet Pretraining. MVCNN benefits from transfer learning from ImageNet classification task. However, even without ImageNet pretraining, the MVCNN achieves 91.3% per-instance accuracy (Table 2). This is higher than several point-based and voxel-based approaches. Figure 3 plots the performance of the MVCNN with VGG-11 network without ImageNet pretraining across training set sizes showing this trend is true throughout the training regime. In Sect. 3.3 we study if ImageNet pretraining can benefit such approaches using cross-modal transfer learning.

Table 2. Effect of ImageNet pretraining on the accuracy (%) of MVCNN. The VGG-11 architecture is used and the full training/test split of the ModelNet40 dataset is used.

Effect of Shape Rendering. We analyze the effect of different rendering approaches for input to a MVCNN model in Table 3. Sphere rendering proposed in [24] refers to rendering each point as a sphere and was shown to improve performance with AlexNet MVCNN architectures. We first compared the tight field-of-view rendering with black background in this work to the rendering in [31]. Since [31] only reported results on the 80/20 training/test split, we first compared the performance of the VGG-11 networks using images from [31]. The performance difference was negligible. However with our shaded images the performance of the VGG-11 network improves by more than 2%.

Using depth images, the per instance accuracy is 3.4% lower than using shaded images, but concatenating shaded images with depth images gives 1.2% improvement. Furthermore, we found the shading information only provides 1.4% improvements over the binary silhouette images. This suggests that most of the discriminative shape information used by the MVCNN approaches lie in the boundary of the object.

Table 3. Accuracy (%) of MVCNN with different rendering methods. The number in the brackets are the number of views used. 12 views are used if not specified.

3.3 Cross Modal Distillation

Knowledge distillation [4, 10] was proposed for model compression tasks. They showed the performance of the model can be improved by training to imitate the output of a more accurate model. This technique has also been applied on transferring rich representations across modalities. For example, a model trained with images can be used to guide learning of a model for depth images [7], or to a model for sound waves [1]. We investigate such techniques for learning across different 3D representations; Specifically from MVCNN model to PointNet and VoxNet models.

To do this we first train the ImageNet initialized VGG-11 network on the full training set. The logits (the last layer before the softmax) are extracted on the training set. A PointNet (or VoxNet) model is then trained to minimize

$$\begin{aligned} \sum _{i=1}^{n} \mathcal{L}\left( \sigma (\mathbf{z}_i), y_i \right) + \lambda \sum _{i=1}^{n} \mathcal{L} \left( \sigma \left( \frac{\mathbf{z}_i}{T} \right) , \sigma \left( \frac{\mathbf{x}_i}{T} \right) \right) \end{aligned}$$
(2)

where \(\mathbf{x}_i\) and \(\mathbf{z}_i\) are the logits from the MVCNN model and from the model being trained respectively, \(y_i\) is the class label of the input \(\mathbf {s}_i\), \(\sigma (x)\) is the softmax function, and \(\mathcal{L}\) is the cross-entropy loss \(\mathcal{L}(p,q)=-\sum {p_i\log q_i}\). T is the temperature for smoothing the targets. \(\lambda ,T\) are set by grid search for \(T\in [1,20],\lambda \in [1,100]\). For example, in PointNet the best hyper-parameters are \(T=20,\lambda =50\) when training set is small, and \(T=15,\lambda =10\) when the training set is larger. In VoxNet we set \(T=10,\lambda =100\) in all cases. Figure 4 shows the result of training VoxNet and PointNet with distillation. For VoxNet the per instance accuracy is improved from 85.6% to 87.4% with whole training set; For PointNet the accuracy is improved from 89.1% to 89.4%. The improvement is slightly bigger when there is less training data.

Fig. 4.
figure 4

Model distillation from MVCNN to VoxNet and PointNet. The accuracy is improved by 1.8% for VoxNet and 0.3% for PointNet with whole training set.

3.4 Tradeoffs Between Learned Representations

In this Section we analyze the tradeoffs between the different shape classifiers. Table 4 compares their speed, memory, and accuracy. The MVCNN model has more parameters and is slower, but the accuracy is 5.9% better than PointNet and 9.4% better than VoxNet. Even though the number of FLOPS are far higher for MVCNN the relative efficiency of 2D convolutions results in slightly longer evaluation time compared to VoxNet and PointNet.

We further use an ensemble model combining images, voxels, and point cloud representations. A simple way is to average the predictions from different models. As shown in Fig. 5, the ensemble of VoxNet and PointNet has better performance than using single model. However, the predictions from MVCNN dominate VoxNet and PointNet and gives no benefit for combining the predictions from other models with MVCNN. A more complex scheme where we trained a linear model on top of features extracted from penultimate layers of these networks did not provide any improvements either.

Table 4. Accuracy, speed and memory comparison of different models. Memory usage during training which includes parameters, gradients, and layer activations, for a batch size 64 is shown. Forward-pass time is also calculated with batch size 64 using PyTorch with a single GTX Titan X for all the models. The input resolutions are \(224\times 224\) for MVCNN, \(32^3\) for VoxNet, and 1024 points for PointNet. The accuracy numbers in brackets are for models trained with distillation as described in Sect. 3.3.
Fig. 5.
figure 5

Accuracy obtained by ensembling models. Left shows results by averaging the predictions while right shows results by training a linear model on the concatenated features extracted from different models.

3.5 Robustness to Adversarial Examples

In this section we analyze and compare the robustness of three shape classification models to adversarial examples. Adversarial examples are generated using the stochastic gradient ascent procedure described in Sect. 2.3. We search the threshold \(\epsilon \) from 0.001 to 0.9, and find the minimum value of \(\epsilon \) where we can generate an adversarial example in 1000 iterations with learning rate \(\alpha = 1\times 10^{-6}\).

To make a quantitative analysis between VoxMVCNN and VoxNet, we use the following procedure. Given an input \(\mathbf {s}\) and a classifier \(\phi (\mathbf {s}, y)\), the “hardest” target class is defined as \(y^\prime _{\mathbf {s},\phi (\mathbf {s}, y)} = \mathrm {arg min}_y \phi (\mathbf {s}, y)\). We select the first five models for each class from the test set. For each model, we select two target classes \(y^\prime _{\mathbf {s},\phi _1(\mathbf {s}, y)}\) and \(y^\prime _{\mathbf {s},\phi _2(\mathbf {s}, y)}\) where \(\phi _1\) is VoxMVCNN and \(\phi _2\) is VoxNet. There are total 400 pairs of test cases \((\mathbf {s},y^\prime )\). We say an adversarial example \(\mathbf {s^\prime }\) can be found when \(y^\prime = \mathrm {arg max}_y \phi (\mathbf {s^\prime }, y)\) with \(\epsilon \le 0.9\).

As shown in Table 5, we can generate 399 adversarial examples out of 400 test cases for VoxMVCNN, but only 370 adversarial examples for VoxNet. We then report the minimum \(\epsilon \) where adversarial examples can be found in each test case. The average \(\epsilon \) of VoxMVCNN is smaller than VoxNet, which suggests that VoxMVCNN is easier to be fooled by adversarial examples than VoxNet. We also use the VoxMVCNN model trained without ImageNet pre-training. Surprisingly, the model without pre-training is more robust to adversarial examples as the mean of \(\epsilon \) is bigger than the model with pre-training. As for PointNet, we can generate 379 adversarial examples. Note that the \(\epsilon \) value here is not comparable with other voxel representations, since in this case we are changing point coordinates instead of occupancy levels.

We also show some qualitative adversarial examples in Figs. 6 and 7. The voxels are scaled according to their occupancy level. In Fig. 6 the input and the target classes are the same for each row, where the target labels are cup, keyboard, bench, and door from top to bottom. In Fig. 7 we use the same input model but set different target class for each column. The differences of adversarial examples for VoxMVCNN are almost imperceptible, as the \(\epsilon \) is too small and the model is easy to be fooled. The classification accuracy of each model is shown in Table 5 as reference.

Table 5. Robustness of three models to adversarial examples. We generate adversarial examples with 400 test cases \((\mathbf {s},y^\prime )\) for each model. The \(\epsilon \) defines a \(L_\infty \) \(\epsilon \)-neighborhood of the values that are either point coordinates or voxel occupancy. We report the average of the minimum \(\epsilon \) where an adversarial example can be found. Bigger \(\epsilon \) means more robustness to adversarial examples. The classification accuracies are reported in the last row as reference. We use our implementation of PointNet in PyTorch [20] for generating adversarial examples.
Fig. 6.
figure 6

Adversarial examples of PointNet, VoxNet, and VoxMVCNN. The shapes are misclassified as cup, keyboard, bench, and door for each row from top to bottom. Voxel size represents occupancy level.

Fig. 7.
figure 7

Adversarial examples of PointNet, VoxNet, and VoxMVCNN. The shapes are misclassified as plant, monitor, desk, and door for each column from left to right.

3.6 Comparison to Prior Work

We compare our MVCNN result with prior works in Table 6. The results are grouped by the input type. For multi-view image-based models, our MVCNN achieves 95.0% per instance accuracy, which is the best result between all competing approaches. Rotation Net [13], which predicts the object pose and class labels at the same time, is 0.2% worse than our MVCNN. Dominant Set Clustering [34] works by clustering image features across views and pooling within the clusters. Its performance is 1.0% lower than RotationNet. MVCNN-MultiRes [23] is the most related to our work. They showed that MVCNN with sphere rendering can achieve better accuracy than voxel-based network, suggesting that there is room for improvement in VoxNet. Our VoxNet experiment corroborates to this conclusion. Furthermore, MVCNN-MultiRes uses images in multiple resolutions to boost its performance.

For point-based methods, PointNet [32] and DeepSets [38] use symmetric functions, i.e. max/mean pooling layers, to generate permutation invariant point cloud descriptions. DynamicGraph [36] builds upon PointNet by performing symmetric function aggregations on points within a neighborhood, instead of the whole set. Such neighborhood is computed dynamically by building nearest neighbor graph using distances defined in the feature space. Similarly, Kd-Networks [15] work by precomputing a graph induced by a binary spatial partitioning tree and use it to apply local linear operations. The best point-based method is 2.8% less accurate then our MVCNN.

For voxel-based methods, VoxNet [17] and 3DShapeNets [37] work by applying 3D convolutions on voxels. ORION [27] is based on VoxNet but predicts the orientation in addition to class labels. OctNet [25] and O-CNN [35] are able to process higher resolution grids by using an octree representation. FusionNet [9] combines the voxel and image representations to improve the performance to 90.8%. Our experiments in Sect. 3.4 suggests that since MVCNN already has 95.0% accuracy the benefit of combining different representations is not effective.

Table 6. Accuracy (%) of state-of-the-art methods with different 3D representations. PC refers to point clouds. The order is grouped by input type and sorted by accuracy.

4 Conclusion

We investigated on different representations and models for 3D shape classification task, which resulted in a new state-of-the-art on the ModelNet40 benchmark. We analyzed the generalization of MVCNN, PointNet, and VoxNet by varying the number of training examples. Our results indicate that multiview-based methods provide better generalizability and outperform other methods on the full dataset even without ImageNet pretraining or training with binary silhouettes. We also analyzed cross-modal distillation and showed improvements on VoxNet and PointNet by distilling knowledge from MVCNN. Finally, we analyzed the robustness of the models to adversarial perturbations and concluded that point-based networks are more robust to point perturbations, while multi-view and voxel-based networks can be fooled by imperceptible perturbations.