1 Introduction

Estimating 6-DoF object pose from images is a core problem for a wide range of applications including robotic manipulation, navigation, augmented reality and autonomous driving. While numerous methods appear in the literature [1, 2, 5, 11, 16, 25, 38, 40], scalability (to large numbers of objects) and accuracy continue to be critical issues that limit existing methods. Recent work has attempted to leverage the power of deep CNNs to surmount these limitations [15, 24, 26, 29, 34, 37, 41, 43]. One naive approach is to train a network to estimate the pose of each object of interest (Fig. 1 (a)). More recent approaches follow the principle of “object per output branch” (Fig. 1 (b)) whereby each object classFootnote 1 is associated with an output stream connected to a shared feature basis [15, 24, 29, 34, 43]. In both cases, the size of the network increases with the number of objects, which implies that large amounts of data are needed for each class to avoid overfitting. In this work, we present a multi-class pose estimation architecture (Fig. 1 (c)) which receives object images and class labels provided by a detection system and which has a single branch for pose prediction. As a result, our model is readily scalable to large numbers of object categories and works for unseen instances while providing robust and accurate pose prediction for each object.

Fig. 1.
figure 1

Illustration of different learning architectures for single-view object pose estimation: (a) each object is trained on an independent network; (b) each object is associated with one output branch of a common CNN root; and (c) our network with single output stream via class prior fusion. Figure (d) illustrates our multi-view, multi-class pose estimation framework where \(h_{m,k}\), the k-th pose hypothesis on view m, is first aligned to a canonical coordinate system and then matched against other hypotheses for pose voting and selection.

The ambiguity of object appearance and occlusion in cluttered scenes is another problem that limits the application of pose estimation in practice. One solution is to exploit additional views of the same instance to compensate for recognition failure from a single view. However, naive “averaging” of multiple single-view pose estimates in SE(3) [4] does not work due to its sensitivity to incorrect predictions. Additionally, most current approaches to multi-view 6-DoF pose estimation [6, 21, 32] do not address single-view ambiguities caused by object symmetry. This exacerbates the complexity of view fusion when multiple correct estimates from single views do not agree on SE(3). Motivated by these challenges, we demonstrate a new multi-view framework (Fig. 1 (d)) which selects pose hypotheses, computed from our single-view multi-class network, based on a distance metric robust to object symmetry.

In summary, we make following contributions to scalable and accurate pose estimation on multiple classes and multiple views:

  • We develop a multi-class CNN architecture for accurate pose estimation with three novel features: (a) a single pose prediction branch which is coupled with a discriminative pose representation in SE(3) and is shared by multiple classes; (b) a method to embed object class labels into the learning process by concatenating a tiled class map with convolutional layers; and (c) deep supervision with an object mask which improves the generalization from synthetic data to real images.

  • We present a multi-view fusion framework which reduces single-view ambiguity based a voting scheme. An efficient implementation is proposed to enable fast hypothesis selection during inference.

  • We show that our method provides state-of-the-art performance on public benchmarks including YCB-Video [43], JHUScene-50 [21] for 6-DoF object pose estimation [21, 43], and ObjectNet-3D for large-scale viewpoint estimation [41]. Further, we present a detailed ablative study on all benchmarks to empirically validate the three innovations in the single-view pose estimation network.

2 Related Work

We first review three categories of work on single-view pose estimation and then investigate recent progress on multi-view object recognition.

Template Matching. Traditional template-based methods compute 6-DoF pose of an object by matching image observations to hundreds or thousands of object templates that are sampled from a constrained viewing sphere [1, 11, 38, 40]. Recent approaches apply deep CNNs as end-to-end matching machines to improve the robustness of template matching [1, 18, 40]. Unfortunately, these methods do not scale well in general because the inference time grows linearly with the number of objects. Moreover, they generalize poorly to unseen object instances as shown in [1] and suffer from poor domain shift from synthetic to real images.

Bottom-Up Approaches. Given object CAD models, 6-DoF object pose can be inferred by registering a CAD model to part of a scene using coarse-to-fine ICP [46], Hough voting [36], RANSAC [27] and heuristic 3D descriptors [7, 31]. More principled approaches use random forests to infer local object coordinates for each image pixel based on hand-crafted features [2, 3, 25] or auto-encoders [5, 16]. However, local image patterns are ambiguous for objects with similar appearance, which prevents this line of work from being applied to generic objects and unconstrained background clutter.

Learning End-to-End Pose Machines. This class of work deploys deep CNNs to learn an end-to-end mapping from a single RGB or RGB-D image to object pose. [24, 26, 34, 41] train CNNs to directly predict the Euler angles of object instances and then apply them to unseen instances from the same object categories. Other methods decouple 6-DoF pose into rotation and translation components and infer each independently. SSD-6D [15] classifies an input into discrete bins of Euler angles and subsequently estimates 3D position by fitting 2D projections to a detected bounding box. PoseCNN [43] regresses rotation with a loss function that is robust to object symmetry, and follows this with a bottom-up approach to vote for the 3D location of the object center via RANSAC. In contrast to the above, our method formulates a discriminative representation of 6-DoF pose that enables predictions of both rotation and translation by a single forward pass of a CNN, while being scalable to hundreds of object categories.

Multi-view Recognition. In recent years, several multi-view systems have been developed to enhance 3D model classification [14, 33], 2D object detection [19, 28] and semantic segmentation [22, 35, 46]. For 6-DoF pose estimation, SLAM++ [32] is an early representative of a multi-view pose framework which jointly optimizes poses of both the detected object and the cameras. [22] computes object pose by registering 3D object models over an incrementally reconstructed scene via a dense SLAM system. These two methods are difficult to scale because they rely on  [27] whose running time grows linearly to the number of objects. A more recent method [6] formulates a probabilistic framework to fuse pose estimates from different views. However, it requires computation of marginal probability over all subsets of a given number of views, which is computationally prohibitive when the number of views and/or objects is large.

Fig. 2.
figure 2

Multi-class network architecture for a single view; the figure shows the actual number of layers used in our implementation. We note that the XYZ map which represents normalized 3D coordinates of each image pixel. If depth data is not available, this stream is omitted.

3 Single-View Multi-class Pose Estimation Network

In this section, we introduce a CNN-based architecture for multi-class pose estimation (Fig. 2). The input can be an RGB or RGB-D image region of interest (ROI) of an object provided by arbitrary object detection algorithm. The network outputs represent both the rotation R and the translation T of a 6-DoF pose (RT) in SE(3).

We first note that the a single rotation R relative to the camera corresponds to different object appearances in image domain when T varies. This issue has been discussed in [26] in the case of 1-D yaw angle estimation. To create a consistent mapping from the ROI appearance to (RT), we initially rectify the annotated pose to align to the current viewpoint as follows. We first compute the 3D orientation \(\varvec{v}\) towards the center of the ROI (xy): \(\varvec{v} = [(x - c_x) / f_x, (y - c_y) / f_y, 1]\), where \((c_x, c_y)\) is the 2D camera center and \(f_x, f_y\) are the focal lengths for X and Y axes. Subsequently, we compute rectified XYZ axes \([X_{\varvec{v}}, Y_{\varvec{v}}, Z_{\varvec{v}}]\) by aligning the Z axis [0, 0, 1] to \(\varvec{v}\).

$$\begin{aligned} X_{\varvec{v}} = [0, 1, 0] \times Z_{\varvec{v}}, \ Y_{\varvec{v}} = Z_{\varvec{v}} \times X_{\varvec{v}}, \ Z_{\varvec{v}} = \frac{\varvec{v}}{\Vert \varvec{v}\Vert _2} \end{aligned}$$
(1)

where symbol \(\times \) indicates the cross product of two vectors. Finally, we project (RT) onto \([X_{\varvec{v}}, Y_{\varvec{v}}, Z_{\varvec{v}}]\) and obtain the rectified pose \((\widetilde{R}, \widetilde{T})\): \(\widetilde{R} = R_{\varvec{v}} \cdot R\) and \(\widetilde{T} = R_{\varvec{v}} \cdot T\), where \(R_{\varvec{v}} = [X_{\varvec{v}};Y_{\varvec{v}};Z_{\varvec{v}}]\). We refer readers to the supplementary material for more details about the rectification step. When depth is available, we rectify the XYZ value of each pixel by \(R_{\varvec{v}}\) and construct a normalized XYZ map by centering the point cloud to the median along each axis.

Figure 2 illustrates the details of our network design. Two streams of convolutional layers receive RGB image and XYZ map respectively and the final outputs are bin and delta vectors (described below) for both rotation and translation (Sect. 3.1). These two streams are further merged with class priors (Sect. 3.2) and deeply supervised by object mask (Sect. 3.3). When depth data is not available, we simply remove the XYZ stream.

3.1 Bin & Delta Representation for SE(3)

Direct regression to object rotation R has been shown to be inferior to a classification scheme over discretized SO(3)Footnote 2 [15, 26, 30]. One common discretization of SO(3) is to bin along each Euler angle \((\alpha , \beta , \gamma )\) (i.e. yaw, pitch and roll) [15, 34]. However, this binning scheme yields a non-uniform tessellation of SO(3). Consequently, a small error on one Euler angle may be magnified and result in a large deviation in the final rotation estimate. In the following, we formulate two new bin & delta representations which uniformly partition both SO(3) and R(3). They are further coupled with a classification & regression scheme for learning discriminative pose features.

Almost Uniform Partition of SO(3). We first exploit the sampling technique developed by [44] to generate N rotations \(\{\hat{R}_1,...,\hat{R}_N\}\) that are uniformly distributed on SO(3). These N rotations are used as the centers of N rotation bins in SO(3). These are shared between different object classes. Given an arbitrary rotation matrix R, we convert it to a bin and delta pair \((\varvec{b}^R,\varvec{d}^R)\) based on \(\{\hat{R}_1,...,\hat{R}_N\}\). The bin vector \(\varvec{b}^R\) contains N dimensions where the i-th dimension \(\varvec{b}^R_i\) indicates the confidence of R belonging to bin i. \(\varvec{d}^R\) stores N rotations (i.e. quaternions in our implementation) where the i-th rotation \(\varvec{d}^R_i\) is the deviation from \(\hat{R}_i\) to R. During inference, we take the bin with maximum score and apply the corresponding delta value to the bin center to compute the final prediction. In training, we enforce a sparse confidence scoring scheme for \((\varvec{b}^R,\varvec{d}^R)\) to supervise the network:

$$\begin{aligned} \varvec{b}^R_i = {\left\{ \begin{array}{ll} \theta _1 &{} : i \in NN_1(R) \\ \theta _2 &{} : i \in NN_k(R) \setminus NN_1(R) \\ 0 &{} : \text {Otherwise} \\ \end{array}\right. }, \ \ \ \varvec{d}^R_i = {\left\{ \begin{array}{ll} R\cdot \hat{R}^T_i &{} : i \in NN_k(R) \\ 0 &{} : \text {Otherwise} \\ \end{array}\right. } \end{aligned}$$
(2)

where \(\theta _1 \gg \theta _2\) and \(NN_k(R)\) is the set of k nearest neighbors of R among \(\{\hat{R}_1,...,\hat{R}_N\}\) in terms of the geodesic distance \(d(R_1, R_2) = \frac{1}{2}\Vert \log (R_1^TR_2)\Vert _F\) between two rotations \(R_1\) and \(R_2\). Note that we design delta \(\varvec{d}_i\) to achieve \(R = \varvec{d}^R_i\cdot \hat{R}_i\) and not \(R = \hat{R}_i \cdot \varvec{d}^R_i\) because the former is numerically more stable. Specifically, if d is the prediction of \(d^R_i\) with error \(\delta \) such that \(d = \delta \cdot d^R_i\), the error of final prediction \(R'\) is also \(\delta \) because \(R' = d \cdot \hat{R}_i = \delta R\). If we define \(R=\hat{R}_i\cdot d^R_i\) instead, then \(R' = \hat{R}_i \cdot d = (\hat{R}_i \delta (\hat{R}_i)^{-1}) R\) and the error will be \(\hat{R}_i \delta (\hat{R}_i)^{-1}\). Thus, the \(\delta \) error of \(d^R_i\) may be magnified in the final rotation estimate R.

Gridding XYZ Axes. The translation vector is the 3D vector from the camera origin to the object center. To divide the translation space, we uniformly grid X, Y and Z axes independently. For RGB images, we align the X and Y axes to image coordinates and the Z axis is optical axis of the camera. We also rescale the ROI to a fixed scale for the CNN, so we further adjust the Z value of each pixel to \(Z'\) such that image scale is consistent to the depth value: \(Z' = Z \cdot \frac{s'}{s}\), where \(s'\) and s are image scales before and after rescaling, respectively. When depth data is available, the XYZ axes are simply chosen to be the coordinate axes of normalized point cloud.

We now discuss how to construct the bin & delta pair \((\varvec{b}^{T_x}, \varvec{d}^{T_x})\) for X axis; the Y and Z axes are done in the same way. We first create M non-overlapping bins of equal size \(\frac{s_{max} - s_{min}}{M}\) between \([s_{min}, s_{max}]\)Footnote 3. When the X value is lower than \(s_{min}\) (or larger than \(s_{max}\)), we assign it to the first (or last bin). During inference, we compute the X value by adding the delta to the bin center which has the maximum confidence score. During training, similar to Eq. 2, we compute \(\varvec{b}^{T_x}\) of an X value by finding its \(K'\) nearest neighbors among M bins. Then, we assign \(\theta '_1\) for the top nearest neighbor and \(\theta '_2\) for the remaining \(K-1\) neighbors (\(\theta '_1 \gg \theta '_2\)). Correspondingly, the delta values of the \(K'\) nearest neighbor bins are deviations from the bin centers to the actual X value and others are 0. Finally, we concatenate all bins and deltas of X, Y and Z axes: \(\varvec{b}^T=[\varvec{b}^{T_x},\varvec{b}^{T_y},\varvec{b}^{T_z}]\) and \(\varvec{d}^T=[\varvec{d}^{T_x},\varvec{d}^{T_y},\varvec{d}^{T_z}]\). One alternative way of dividing translation space is to grindingXYZ space. However, the total number of bins grows exponentially as M increases and we found no performance gain by doing so in practice.

3.2 Fusion of Class Prior

Many existing methods assume known object class labels, provided by a detection system, prior to pose analysis [1, 15, 24, 30, 43]. However, they ignore the class prior during training and only apply it during inference. Our idea is to directly incorporate this known class label into the learning process of convolutional filters for pose. This is partly inspired by prior work on CNN-based hand-eye coordination learning [20] where a tiled robot motor motion map is concatenated with one hidden convolutional layer for predicting the grasp success probability. Given the class label of the ROI, we create a one-hot vector where the entry corresponding to the class label is set to 1 and all others to 0. We further spatially tile this one-hot vector to form a 3D tensor with size \(H\times W\times C\), where C is the number of object classes and HW are height and width of a convolutional feature map at an intermediate layer chosen as part of the network design. As shown in Fig. 2, we concatenate this tiled class tensor with the last convolutional layers of both color and depth streams along the filter channel. Therefore, the original feature map is embedded with class labels at all spatial locations and the subsequent layers are able to model class-specific patterns for pose estimation. This is critical in teaching the network to develop compact class-specific filters for each individual object while taking advantage of a shared basis of low level features for robustness.

3.3 Deep Supervision with Object Segmentation

Due to limited availability of pose annotations on real images, synthetic CAD renderings are commonly used as training data for learning-based pose estimation methods [11, 15, 43]. We take this approach but, following [23], we also incorporate the deep supervision of an object mask at a hidden layer, (shown in Fig. 2) for additional regularization of the training process. We can view the object mask as an intermediate result for the final task of 6-DoF pose estimation. That is, good object segmentation is a prerequisite for the final success of pose estimation. Moreover, a precisely predicted object mask benefits a post-refinement step such as Iterative Closest Point (ICP).

To incorporate the mask with the feature and class maps (Sect. 3.2), we append one output branch for the object mask which contains one convolutional layer followed by two de-convolution layers with upsampling ratio 2. We assume that the object of interest dominates the input image so that only a binary mask (“1” indicates object pixel and “0” means background or other objects) is needed as an auxiliary cue. As such, the size of the output layer for binary segmentation prediction is fixed regardless of the number of object instances in database, which enables our method to scale well to large numbers of objects. Conversely, when multiple objects appear in a scene, we must rely on some detection system to “roughly” localize them in the 2D image first.

3.4 Network Architecture

The complete loss function for training the network consists of five loss components over the segmentation map, the rotation, and the three translation components:

$$\begin{aligned} \mathcal {L} = l_{seg} + l_{R_b}(\widetilde{\varvec{b}^R}, \varvec{b}^R) + l_{R_d}(\widetilde{\varvec{d}^R}, \varvec{d}^R) + \sum _{i\in \{X,Y,Z\}} \Big (l_{T_b}(\widetilde{\varvec{b}^{T_i}}, \varvec{b}^{T_i}) + l_{T_d}(\widetilde{\varvec{d}^{T_i}}, \varvec{d}^{T_i})\Big ) \end{aligned}$$
(3)

where \(\widetilde{\varvec{b}^R}\), \(\widetilde{\varvec{d}^R}\), \(\widetilde{\varvec{b}^{T_i}}\) and \(\widetilde{\varvec{d}^{T_i}}\) are the bin and delta estimates of the groundtruth \(\varvec{b}^R\), \(\varvec{d}^R\), \(\varvec{b}^{T_i}\) and \(\varvec{d}^{T_i}\), respectively. We apply cross-entropy softmax to segmentation loss \(l_{seg}\) on each pixel location and to the bin losses \(l_{R_b}\) and \(l_{T_b}\). We employ L2 losses for the delta values \(l_{R_d}\) and \(l_{T_d}\). All losses are simultaneously backpropagated to the network to update network parameters on each batch. For simplicity, we apply loss weight 1 for each loss term.

Each convolutional layer is coupled with a batch-norm layer [12] and ReLU. The size of all convolutional filters is \(3\times 3\). The output layer for each bin and delta is constructed with one global average pooling (GAP) layer followed by one fully connected (FC) layer with 512 neurons. We employ a dropout [17] layer before each downsampling of convolution with stride 2. We deploy 23 layers in total.

4 Multi-view Pose Framework

In this section, we present a multi-view framework which refines the outputs of our single-view network (Sect. 3) during the inference stage. We assume that camera pose of each frame in a sequence is known. In practice, camera poses can be provided by many SLAM systems such as Kinect Fusion [13].

Fig. 3.
figure 3

Top-K accuracies of our single-view pose network on YCB-Video [43].

4.1 Motivation

Recall that we can obtain top-K estimates from all subspaces in SE(3) including SO(3), X, Y, and Z spaces (Sect. 3.1). Therefore, we can compute \(K^4\) pose hypotheses by composing top-k results from all subspaces. In turn, we compute the top-K accuracy as the highest pose accuracy achieved among all \(K^4\) hypotheses. Fig. 3 shows the curve of top-K accuracies of our pose estimation network across all object instances, in terms of the mPCKFootnote 4 metric on YCB-Video benchmark [43]. We observe that pose estimation performance significantly improves when we initially increase K from 1 to 2 and almost saturates at \(K=4.\) This suggests that the inferred confidence score is ambiguous in only a small range, which makes sense especially for objects that have symmetric geometry or texture. The question is how we can resolve this ambiguity and further improve the pose estimation performance. We now present a multi-view voting algorithm that selects the correct hypothesis from the top-K hypothesis set.

4.2 Hypothesis Voting

To measure the difference between hypotheses from different views, we first transfer all hypotheses into view 1 using the known camera poses of all n views. We consider a hypothesis set \(\mathcal {H}=\{h_{1,1},\cdots ,h_{i,j},\cdots ,h_{n,K^4}\}\) from n views, where \(h_{i,j}\) indicates the pose hypothesis j in view i with respect to camera coordinate of view 1. To handle single-view ambiguity caused by symmetrical geometry, we test the consistency of “fit” to the observed data. More specifically, we employ the distance metric proposed by [11] to measure the discrepancy between two hypothesis \(h_1=(R_1, T_1)\) and \(h_2=(R_2, T_2)\):

$$\begin{aligned} D(h_1, h_2) = \frac{1}{m}\sum _{x_1\in \mathcal {M}} \min _{x_2\in \mathcal {M}} \Vert (R_1x_1 + T_1) - (R_2x_2 + T_2) \Vert _2 \end{aligned}$$
(4)

where \(\mathcal {M}\) denotes the set of 3D model points and \(m=|\mathcal {M}|\). \(D(h_1, h_2)\) yields small distance when 3D object occupancies under poses \(h_1\) and \(h_2\) are similar, even if \(h_1\) and \(h_2\) have large geodesic distance on SO(3). Finally, the voting score \(V(h_{i, j})\) for \(h_{i, j}\) is calculated as:

$$\begin{aligned} V(h_{i, j}) = \sum _{h_{p, q} \in \mathcal {H} \setminus h_{i, j}} \max \Big (\sigma - D(h_{i, j}, h_{p, q}), 0\Big ) \end{aligned}$$
(5)

where \(\sigma \) is the threshold for outlier rejection. We select the hypothesis with the highest vote score as the final prediction. Fig. 1 (d) illustrates this multi-view voting process.

Efficient Implementation. The above hypothesis voting algorithm is computationally expensive because the time complexity of Eq. 4 is at least \(O(m\log m)\) via a KDTree implementation. Our solution is to decouple translation and rotation components in Eq. 4 and approximate \(D(h_1, h_2)\) by \(\widetilde{D}(h_1, h_2)\):

$$\begin{aligned} \widetilde{D}(h_1, h_2) = \Vert T_1 - T_2\Vert _2 + \frac{1}{m}\sum _{x_1\in \mathcal {M}} \min _{x_2\in \mathcal {M}} \Vert R_1x_1 - R_2x_2 \Vert _2 \end{aligned}$$
(6)

In fact, \(\widetilde{D}(h_1, h_2)\) is an upper bound on \(D(h_1, h_2)\): \(D(h_1, h_2)\le \widetilde{D}(h_1, h_2)\) for any \(h_1\) and \(h_2\), because \(\Vert (R_1x_1 + T_1) - (R_2x_2 + T_2) \Vert _2 \le \Vert R_1x_1 - R_2x_2\Vert + \Vert T_1 - T_2\Vert \) based on the triangle inequality. Since the complexity of \(\Vert T_1 - T_2\Vert \) is O(1), we can focus on speeding up the computation of rotation distance \(\frac{1}{m}\sum _{x_1\in \mathcal {M}} \min _{x_2\in \mathcal {M}} \Vert R_1x_1 - R_2x_2 \Vert _2\). Our approach is to pre-compute a table of all pairwise distances between every two rotations from N uniformly sampled rotation bins \(\{\hat{R}_1,...,\hat{R}_N\}\) by [44]. For arbitrary \(R_1\) and \(R_2\), we search for their nearest neighbors \(\hat{R}_{N_1(R_1)}\) and \(\hat{R}_{N_1(R_2)}\) from \(\{\hat{R}_1,...,\hat{R}_N\}\). In turn, we approximate the rotation distance as follows:

$$\begin{aligned} \frac{1}{m}\sum _{x_1\in \mathcal {M}} \min _{x_2\in \mathcal {M}} \Vert R_1x_1 - R_2x_2 \Vert _2 \approx \frac{1}{m}\sum _{x_1\in \mathcal {M}} \min _{x_2\in \mathcal {M}} \Vert \hat{R}_{N_1(R_1)}x_1 - \hat{R}_{N_1(R_2)}x_2 \Vert _2 \end{aligned}$$
(7)

where the right hand side can be directly retrieved from the pre-computed distance table during inference. When N is large enough, the approximation error of Eq. 7 has little effect on our voting algorithm. In practice, we find the performance gain saturates when \(N\ge 1000\). Thus, the complexity of Eq. 7 is \(O(\log N)\) for nearest neighbor search, which is significantly smaller than \(O(m\log m)\) of Eq. 5 (\(m>>N\) in general).

5 Experiments

In this section, we empirically evaluate our method on three large-scale datasets: YCB-Video [43], JHUScene-50 [21] for 6-DoF pose estimation, and ObjectNet-3D [41] for viewpoint estimation. Further, we conduct an ablative study to validate our three innovations for the single-view pose network.

Evaluation Metric. For 6-DoF pose estimation, we follow the recently proposed metric “ADD-S” [43]. The traditional metric [11] considers a pose estimate h to be correct if \(D(h, h^*)\) in Eq. 4 is below a threshold with respect to the ground truth value \(h^*\). “ADD-S” improves this threshold-based metric by computing the area under the curve of the accuracy-threshold over different thresholds within a range (i.e. [0, 0.1]). We rename “ADD-S” as “mPCK” because it is essentially the mean of PCK accuracy [45]. For viewpoint estimation, we use Average Viewpoint Precision (AVP) used in PASCAL3D+ [42] and Average Orientation Similarity (AOS) used in KITTI [8].

Implementation Details. The number of nearest neighbors we use for soft binning is 4 for SO(3) and 3 for each of XYZ axes. We set binning scores as \(\theta _1=\theta '_1=0.7\) and \(\theta _2=\theta '_2=0.1\). The number of rotation bins is 60. For XYZ binning, we use 10 bins and \([s_{min}, s_{max}]=[-0.2, 0.2]\) for each axis when RGB-D data is used. For inference on RGB data, we use 20 bins, \([s_{min}, s_{max}]=[0.2, 0.8]\) for XY axes and 40 bins, \([s_{min}, s_{max}]=[0.5, 4.0]\) for Z axis. In multi-view voting, we set the distance threshold \(\sigma =0.02\) and the precomputed size of distance table as 2700. The input image to our single-view pose network is 64x64. The tiled class map is inserted at convolutional layer 15 with size \(H=W=16\). We use stochastic gradient descent with momentum 0.9 to train our network from scratch. The learning rate starts at 0.01 and decreases by one-tenth every 70000 steps. The batch size is 105 for YCB-Video and 100 for both JHUScene-50 and ObjectNet-3D. We construct each batch by mixing equal number of data from each class. We name our Multi-Class pose Network as “MCN”. The multi-view framework using n views is called as “MVn-MCN”. Since MCN also infers instance mask, we use it to extract object point clouds when depth data is available and then run ICP to refined estimated poses by registering the object mesh to extracted object clouds. We denote this ICP-based approach as “poseCNN+ICP”.

5.1 YCB-Video

YCB-Video dataset [43] contains 92 real video sequences for 21 object instances. 80 videos along with 80,000 synthetic images are used for training and 2949 key frames are extracted from the remaining 12 videos for testing. We fine tune the current state-of-the-art “mask-RCNN” [10] on the training set as the detection system. Following the same scenario in [43], we assume that one object appears at most once in a scene. Therefore, we compute the bounding box of a particular object by finding the one with highest detection score of that object. For our multi-view system, one view is coupled with 5 other randomly sampled views in the same sequence. Each view outputs top-3 results from each space of SO(3), X, Y and Z and in turn \(3^4=81\) pose hypotheses.

Table 1. mPCK accuracies achieved by different methods on YCB-Video dataset [43]. The last row indicates the average-per-instance of mPCKs of all instances.

Table 1 reports mPCK accuracies of our methods and variants of poseCNN [43] (denoted as “P-CNN”). All methods are trained and tested following the same experiment setting defined in [43]. We first observe that the multi-view framework (MV5-MCN) consistently improves the single-view network (MCN) across different instances and achieves the overall state-of-the-art performance. Such improvement is more significant on RGB data, where the mPCK margin between MV5-MCN and MCN is \(5.1\%\) which is much larger than the margin of \(1.0\%\) on RGB-D data for all instances. This is mainly because single-view ambiguity is more severe without depth data. Subsequently, MCN outperforms poseCNN by \(1.7\%\) on RGB and MCN+ICP is marginally better than poseCNN+ICP by \(0.2\%\) on RGB-D. We can see that MCN achieves more balanced performance than poseCNN across different instances. For example, poseCNN+ICP only obtains \(51.6\%\) on class “052_larger_clamp” which is \(24.4\%\) lower than the minimum accuracy of a single class by MCN+ICP. This can be mainly attributed to our class fusion design in learning discriminative class-specific feature so that similar objects can be well-separated in feature space (e.g. “051_large_clamp” and “052_larger_clamp”). We also observe that MCN is much inferior to PoseCNN on some instances such as foam brick. This is mainly caused by larger detection errors (less than 0.5 IoU with ground truth) on these instances.

We also run MCN over ground truth bounding boxes and the overall mPCKs are \(86.9\%\) on RGB (\(11.8\%\) higher than the mPCK on detected bounding boxes) and \(91.0\%\) on RGB-D (\(0.4\%\) higher the mPCK on detected bounding boxes). This indicates that MCN is sensitive to detection error on RGB while being robust on RGB-D data. The reason is that we rely on the image scale of bounding box to recover 3D translation for RGB input. In addition, we obtain high instance segmentation accuracyFootnote 5 of MCN across all object instances: \(89.9\%\) on RGB and \(90.9\%\) on RGB-D. This implies that MCN does actually learn the intermediate foreground mask as part of pose prediction. We refer readers for more numerical results in supplementary material, including segmentation accuracies, PCK curves of MCN and mPCK accuracies on groundtruth bounding box on individual instance. Last, we show some qualitative results in upper part of Fig. 4. We can see that MCN is capable of predicting object pose under occlusion and MV5-MCN further refines the MCN result.

5.2 JHUScene-50

JHUScene-50 [21] contains 50 scenes with diverse background clutter and severe object occlusion. Moreover, the target object set consists of 10 hand tool instances with similar appearance. Only textured CAD models are available during training and all 5000 real image frames comprise the test set. To cope with our pose learning framework, we simulate a large amount of synthetic data by rendering densely cluttered scenes similar to the test data, where objects are randomly piled on a table. We use UnrealCV [39] as the rendering tool and generate 100 k training images.

Table 2. mPCK accuracies of all objects in JHUScene-50 dataset [21]. The last row indicates the average-per-class of mPCKs of all object instances. Best results are highlighted in bold.

We compare MCN and MV5-MCN with the baseline method ObjRecRANSACFootnote 6 [27] in JHUScene-50 and one recent state-of-the-art pose manifold learning technique [1]Footnote 7. All methods are trained on the same synthetic training set and tested on the 5000 real image frames from JHUScene-50. We compute 3D translation for  [1] by following the same procedure used in [11]. We evaluate different methods on the ground truth locations of all objects. Table 2 reports mPCK accuracies of all methods. We can see that MCN significantly outperforms other comparative methods by a large margin, though MCN performs much worse than on YCB-Video mainly because of the severe occlusion and diverse cluttered background in JHUScene-50. Additionally, we observe that MV5-MCN is superior to MCN on both RGB and RGB-D data. The performance gain on RGB-D data achieved by MV5-MCN is much larger than the one on YCB-Video, especially for the hammer category due to the symmetrical 3D geometry. We visualize some results of MCN and MV5-MCN in the bottom of Fig. 4. The bottom-right example shows MV5-MCN corrects the orientation of MCN result which frequently occurs for hammer.

Table 3. Accuracies of object pose estimation on ObjectNet-3D benchmark [41]. All methods perform over the same set of detected bounding boxes estimated by Fast R-CNN [9]. Best results on both AOS and AVP metrics are shown in bold. For AVP, we also report \(\frac{\mathrm{AVP}}{\mathrm{mAP}}\) in parentheses.

5.3 ObjectNet-3D

To evaluate the scalability of our method, we conduct an experiment on ObjectNet-3D which consists of viewpoint annotation of 201, 888 instances from 100 object categories. In contrast to most existing benchmarks [11, 21, 43] which target indoor scenes and small objects, ObjectNet-3D covers a wide range of outdoor environments and diverse object categories such as airplane. We modify the MCN model by only using the rotation branch for viewpoint estimation and removing the deep supervision of object mask because the object mask is not available in ObjectNet-3D. To our knowledge, only [41] reports viewpoint estimation accuracy on this dataset, where a viewpoint regression branch is added along with bounding box regression in the Fast R-CNN architecture [9]. For the fair comparison, we use the same detection results for [41] as the input to MCN. Because ObjectNet-3D only provides detection results on the validation set, we train our model on the training split and test on the validation set. Table 3 reports the viewpoint estimation accuracies of different methods on the validation set,in terms of two different metrics AVP [42] and AOS [8]. The detection performance in mAP is the upperbound of AVP. The numbers in parentheses are the ratios of AVP versus mAP. We can see that MCN is significantly superior to the large-scale model [41] on both AOS and AVP, even if [41] actually optimizes the network hyper-parameters on the validation set. This shows that MCN can be scaled to a large-scale pose estimation problem. Moreover, object instances have little overlap between training and validation sets in ObjectNet-3D, which indicates that MCN can generalize to unseen object instances within a category.

Table 4. An ablative study of different variants of pose estimation architectures on YCB-Video, JHUScene-50 and ObjectNet-3D. We follow the same metrics as we evaluate in previous sections. For ObjectNet-3D, we report accuracies formatted as AOS / AVP. The “*” symbol indicates that no segmentation mask is used in training because it is unavailable in ObjectNet-3D.

5.4 Ablative Study

In this section, we empirically validate the three innovations introduced in MCN: bin & delta representation (“BD”), tiled class map (“TC”) and deep supervision of object segmentation (“Seg”). Additionally, we also inspect the baseline architectures: separate network for each object (“Sep-Net”) and separate output branch for each object (“Sep-Branch”), as shown in Figs. 1 (a) and (b) respectively. To remove the effect of using “BD”, we directly regress quaternion and translation (plain) as the comparison. Table 4 presents accuracies of different methods on all three benchmarks. We follow previous sections to report mPCK for YCB-Video and JHUScene-50, and AOS/AVP for ObjectNet-3D. Because ObjectNet-3D does not provide segmentation groundtruth, we remove module “Seg” in all analysis related to ObjectNet-3D. Also, we do not report accuracy of “Sep-Net” on ObjectNet-3D because it requires 100 GPUs for training. We have three main observations: 1. When removing any of the three innovations, pose estimation performance consistently decreases. Typically, “BD” is a more critical design than “Seg” and tiled class map because the removal of BD causes larger performance drop; 2. “Sep-Branch” coupled with “BD” and “Seg” appears to be the second best architecture, but it is still inferior to MCN especially on YCB-Video and ObjectNet-3D. Moreover, the model size of “Sep-Branch” grows rapidly with the increasing number of classes; 3. “Sep-Net” is expensive in training and it performs substantially worse than MCN because MCN exploits diverse data from different classes to reduce overfitting.

Fig. 4.
figure 4

Illustration of pose estimation results by MCN on YCB-Video (upper) and JHUScene-50 (bottom). The projected object mesh points that are transformed by pose estimates are highlighted by orange (YCB-Video) and pink (JHUScene-50). From left to right of each data, we show original ROI, MCN estimates on RGB, MCN estimates on RGB-D and MV5-MCN estimates on RGB-D. (color figure online)

6 Conclusion

We present a unified architecture for inferring 6-DoF object pose from single and multiple views. We first introduce a single-view pose estimation network with three innovations: a new bin & delta pose representation, the fusion of tiled class map into convolutional layers and deep supervision of object mask at intermediate layer. These modules enable a scalable pose learning architecture for large-scale object classes and unconstrained background clutter. Subsequently, we formulate a new multi-view framework for selecting single-view pose hypotheses while considering ambiguity caused by object symmetry. In the future, an intriguing direction is to embed the multi-view procedure into the training process to jointly optimize both single-view and multi-view performance. Also, the multi-view algorithm can be improved to maintain a fixed number of “good” hypotheses for any incremental update given a new frame.