1 Introduction

The problem of tracking CAD models in images is frequently encountered in contexts such as robotics, augmented reality (AR) and medical procedures. Usually, tracking has to be carried out in the full 6D pose, i.e. one seeks to retrieve both the 3D metric translation as well as the 3D rotation of the object in each frame. Another typical scenario is pose refinement, where an object detector provides a rough 6D pose estimate, which has to be corrected in order to provide a better fit (Fig. 1). The usual difficulties that arise include viewpoint ambiguities, occlusions, illumination changes and differences in appearance between the model and the object in the scene. Furthermore, for tracking applications the method should also be fast enough to cover large inter-frame motions.

Most related work based on RGB data can be roughly divided into sparse and region-based methods. The former methods try to establish local correspondences between frames [23, 40] and work well for textured objects, whereas latter ones exploit more holistic information about the object such as shape, contour or color [8, 27, 37, 38] and are usually better suited for texture-less objects. It is worth mentioning that mixtures of the two sets of methods have been proposed as well [6, 24, 30, 31]. Recently, methods that use only depth [34] or both modalities [10, 18, 21] have shown that depth can make tracking more robust by providing more clues about occlusion and scale.

Fig. 1.
figure 1

Exemplary illustration of our method. While (a) depicts an input RGB frame, (b) shows our four initial 6D pose hypotheses. For each obtained frame we refine each pose for a better fit to the scene. In (c) we show the final results after convergence. Note the rough pose initializations as well as the varying amount of occlusion the objects of interest undergo. (Color figure online)

This work aims to explore how RGB information alone can be sufficient to perform visual tasks such as 3D tracking and 6-Degree-of-Freedom (6DoF) pose refinement by means of a Convolutional Neural Network (CNN). While this has already been proposed for camera pose and motion estimation [19, 39, 41, 43], it has not been well-studied for the problem at hand.

As a major contribution we provide a differentiable formulation of a new visual loss that aligns object contours and implicitly optimizes for metric translation and rotation. While our optimization is inspired by region-based approaches, we can track objects of any texture or shape since we do not need to model global [18, 27, 37] or local appearance [11, 38]. Instead, we show that we can do away with these hand-crafted approaches by letting the network learn the object appearance implicitly. We teach the CNN to align contours between synthetic object renderings and scene images under changing illumination and occlusions and show that our approach can deal with a variety of shapes and textures. Additionally, our method allows to deal with geometrical symmetries and visual ambiguities without manual tweaking and is able to recover correct poses from very rough initializations.

Notably, our formulation is parameter-free and avoids typical pitfalls of hand-crafted tracking or refinement methods (e.g. via segmentation or correspondences + RANSAC) that require tedious tuning to work well in practice. Furthermore, like with depth-based approaches such as ICP, we are robust to occlusion and produce results which come close to RGB-D methods without the need for depth data, making it thus very applicable to the domains of AR, medical and robotics.

2 Related Work

Since the field of tracking and pose refinement is vast, we will only focus here on works that deal with CAD models in RGB data. Early methods in this field used either 2D-3D correspondences [29, 30] or 3D edges [9, 32, 35] and fit the model in an ICP fashion with iterative, projective update steps. Successive methods in this direction managed to obtain improved performance [6, 31]. Additionally, other works focused on tracking the contour densely via level-sets [3, 8].

Based on these works, [27] presented a new approach that follows the projected model contours to estimate the 6D pose update. In a follow-up work [26], the authors extended their method to simultaneously track and reconstruct a 3D object on a mobile phone in real-time. The authors from [37] improved the convergence behavior with a new optimization scheme and presented a real-time implementation on a GPU. Consequently, [38] showed how to improve the color segmentation by using local color histograms over time. Orthogonally, the work [18] approximates the model pose space to avoid GPU computations and enables real-time performance on a single CPU core. All these approaches share the property that they rely on hand-crafted segmentation methods that can fail in the case of sudden appearance changes or occlusion. We instead want to entirely avoid hand-crafting manual appearance descriptions.

Another set of works tries to combine learning with simultaneous detection and pose estimation in RGB. The method presented in [17] couples the SSD paradigm [22] with pose estimation to produce 6D pose pools per instance which are then refined with edge-based ICP. On the contrary, the approach from [5] uses auto-context Random Forests to regress object coordinates in the scene that are used to estimate poses. In [28] a method is presented that instead regresses the projected 3D bounding box and recovers the pose from these 2D-3D correspondences whereas the authors in [25] infer keypoint heatmaps that are then used for 6D pose computation. Similarly, the 3D Interpreter Network [42] infers heatmaps for categories and regresses projection and deformation to align synthetic with real imagery. In the work [10], a deep learning approach is used to track models in RGB-D data. Their work goes along similar grounds but we differ in multiple ways including data generation, energy formulation and their use of RGB-D data. In particular, we show that a naive formulation of pose regression does not work in the case of symmetry which is often the case for man-made objects.

We also find common ground with Spatial Transformer Networks in 2D [16] and especially 3D [2], where the employed network architecture contains a submodule to transform the 2D/3D input via a regressed affine transformation on a discrete lattice. Our network instead regresses a rigid body motion on a set of continuous 3D points to minimize the visual error.

3 Methodology

In this section we explain our approach to train a CNN to regress a 6D pose refinement from RGB information alone. We design the problem in such a way that we supply two color patches (\(\mathcal {S}\) and \(\mathcal {H}\)) to the network in order to infer a translational and rotational update. In Fig. 2 we depict our pipeline and show a typical scenario where we have a 6D hypothesis (coming from a detector or tracker) that is not correctly aligned. We want to estimate a refinement such that eventually the updated hypothesis overlaps perfectly with the real object.

Fig. 2.
figure 2

Schematic overview of the full pipeline. Given input image and pose hypothesis (Rt), we render the object, compute the center of the bounding box of the hypothesis (green point) and then cut out a scene patch \(\mathcal {S}\) and a render patch \(\mathcal {H}\). We resize both to \(224\times 224\) and feed them separately into pre-trained InceptionV4 layers to extract low-level features. Thereafter, we concatenate and compute high-level features before diverging into separate branches. Eventually, we retrieve our pose update as 3D translation and normalized 4D quaternion. (Color figure online)

3.1 Input Patch Sampling

We first want to discuss our patch extraction strategy. Provided a CAD model and a 6D pose estimate (Rt) in camera space, we create a rendering and compute the center of the associated bounding box of the hypothesis around which we subsequently extract \(\mathcal {S}\) and \(\mathcal {H}\). Since different objects have varying sizes and shapes it is important to adapt the cropping size to the spatial properties of the specific object. The most straightforward method would be to simply crop \(\mathcal {S}\) and \(\mathcal {H}\) with respect to a tight 2D bounding box of the rendered mask. However, when employing such metric crops, the network loses the ability to robustly predict an update along the Z-axis: indeed, since each crop would almost entirely fill out the input patch, no estimate of the difference in depth can be drawn. Due to this, we explicitly calculate the spatial extent in pixels at a minimum metric distance (with some added padding) and use this as a fixed-size ‘window’ into our scene. In particular, prior to training, we render the object from various different viewpoints, compute their bounding boxes, and take the maximum width or height of all produced bounding boxes.

3.2 Training Stage

To create training data we randomly sample a ground truth pose \((R^*, t^*)\) of the object in camera coordinates and render the object with that pose onto a random background to create a scene image. To learn pose refinement, we perturb the true pose to get a noisy version (Rt) and render a hypothesis image. Given those two images, we cut out patches \(\mathcal {S}\) and \(\mathcal {H}\) with the strategy mentioned above.

The Naive Approach. Provided these patches, we now want to infer a separate correction \((R_\varDelta , t_\varDelta )\) of the perturbed pose (Rt) such that

$$\begin{aligned} R^* = R_\varDelta \cdot R, \quad t^* = t + t_\varDelta . \end{aligned}$$
(1)

Due to the difficulty of optimizing in SO(3) we parametrize via unit quaternions \(q^*, q, q_\varDelta \) to define a regression problem, i.e. similar to what [20] proposed for camera localization or [10] for model pose tracking:

$$\begin{aligned} \min _{q_\varDelta ,t_\varDelta } \big | \big | q^* - \frac{q_\varDelta }{||q_\varDelta ||} \big | \big | + \gamma \cdot \big | \big | t^* - t_\varDelta \big | \big | \end{aligned}$$
(2)

In essence, this energy weighs the numerical error in rotation against the one in translation by means of the hyper-parameter \(\gamma \) and can be optimized correctly when solutions are unique (as is the case, e.g., of camera pose regression). Unfortunately, the above formulation only works for injective relations where an input image pair gets always mapped to the same transformation. In the case of one-to-many mappings, i.e. an image pair can have multiple correct solutions, the optimization does not converge since it is pulled into multiple directions and regresses the average instead. In the context of our task, visual ambiguity is common for most man-made objects because they are either symmetric or share the same appearance from multiple viewpoints. For these objects there is a large (sometimes infinite) set of refinement solutions that yield the same visual result. In order to regress \(q_\varDelta \) and \(t_\varDelta \) under ambiguity, we therefore propose an alternative formulation.

Proxy Loss for Visual Alignment. Instead of explicitly minimizing an ambiguous error in transformation, we strive to minimize an unambiguous error that measures similarity in appearance. We thus treat our search for the pose refinement parameters as a subproblem inside another proxy loss that optimizes for visual alignment. While there are multiple ways to define a similarity measure, we seek one that fulfills the following properties: (1) invariant to symmetric or indistinguishable object views, (2) robust to color deviation, illumination change and occlusion as well as (3) smooth and differentiable with respect to the pose.

To fulfill the first two properties we propose to align the object contours. Tracking the 6D pose of objects via projective contours has been presented before [18, 27, 37] but, to the best of our knowledge, has not so far been introduced in a deep learning framework. Contour tracking allows to reduce the difficult problem of 3D geometric alignment to a simpler task of 2D silhouette matching by moving through a distance transform, avoiding explicit correspondence search. Furthermore, a physical contour is not affected by deviations in coloring or lighting which makes it even more appealing for pure RGB methods. We refer to Fig. 3 for a training example and the visualization of the contours we align.

Fig. 3.
figure 3

Visualization of our training procedure. In (a) and (b) we show the two image patches that constitute one training sample and the input to our network. We highlight for the reader the contours for which we seek the projective alignment from white to red. In (c) we see the initial state of training with no refinement together with the distance transform of the scene \(\mathcal {D}_\mathcal {S}\) and the projection of 3D sample points \(V_\mathcal {H}\) from the initial 6D hypothesis. Finally, in (d) we can see the refinement after convergence. (Color figure online)

Fulfilling smoothness and differentiability is more difficult. An optimization step for this energy requires to render the object with the current pose hypothesis for contour extraction, estimate the similarity with the target contour and back-propagate the error gradient such that the refined hypothesis’ projected contour is closer in the next iteration. Unfortunately, back-propagating through a rendering pipeline is non-trivial (due to, among others, z-buffering and rasterization). We therefore propose here a novel formulation to drive the network optimization successfully through the ambiguous 6D solution space. We employ an idea, introduced in [18], that allows us to use an approximate contour for optimization without iterative rendering. When creating a training sample, we use the depth map of the rendering to compute a 3D point cloud in camera space and sample a sparse point set on the contour, denoted as \(V := \{v \in \mathbb {R}^3 \}\). The idea is then to transform these contour points with the current refinement estimate \((q_\varDelta ,t_\varDelta )\), followed by a projection into the scene. This mimics a rendering plus contour extraction at no cost and allows for back-propagation.

For a given training sample with input patch pair \((\mathcal {S}, \mathcal {H})\), a distance transform of the scene contour \(\mathcal {D}_\mathcal {S}\) and hypothesis contour points \(V_\mathcal {H}\), we define the loss

$$\begin{aligned} \mathcal {L}(q_\varDelta , t_\varDelta , \mathcal {D}_\mathcal {S}, V_\mathcal {H}) := \sum _{v \in V_\mathcal {H}} \mathcal {D}_\mathcal {S} \bigg [\pi \big ( q_\varDelta \cdot v \cdot q^{-1}_\varDelta + t_\varDelta \big ) \bigg ] \end{aligned}$$
(3)

with \(q^{-1}_\varDelta \) being the conjugate quaternion. With the formulation above we also free ourselves from any \(\gamma \)-balancing issue between quaternion and translation magnitudes as in a standard regression formulation.

Minimizing the above loss with a gradient descent step forces a step towards the 0-level set of the distance transform. We basically tune the network weights to rotate and translate the object in 6D to maximize the projected contour overlap. While this works well in practice, we have observed that for certain objects and stronger pose perturbations the optimization can get stuck in local minima. This occurs when our loss drives the contour points into a configuration where the distance transform allows them to settle in local valleys. To remedy this problem we introduce a bi-directional loss formulation that simultaneously aligns the contours of hypothesis as well as scene onto each other, coupled and constrained by the same pose update. We thus have an additional term that runs into the opposite direction:

$$\begin{aligned} \mathcal {L} := \mathcal {L}(q_\varDelta , t_\varDelta , \mathcal {D}_\mathcal {S}, V_\mathcal {H}) + \mathcal {L}(q^{-1}_\varDelta , -t_\varDelta , \mathcal {D}_\mathcal {H}, V_\mathcal {S}). \end{aligned}$$
(4)

This final loss \(\mathcal {L}\) does not only alleviate the locality problem but has also shown to lead to faster training overall. We therefore chose this energy for all experiments.

3.3 Network Design and Implementation

We give a schematic overview of our network structure in Fig. 2 and provide here more details. In order to ensure fast inference, our network follows a fully-convolutional design. The network is fed with two \(224\times 224\times 3\) input patches representing the cropped scene image \(\mathcal {S}\) and cropped render image \(\mathcal {H}\). Both patches run in separate paths through the first levels of an InceptionV4 [33] instance to extract low-level features. Thereafter we concatenate the two feature tensors, down-sample by employing max-pooling as well as a strided \(3\times 3\) convolution, and concatenate the results again. After two Inception-A blocks we branch off into two separate paths for the regression of rotation and translation. In each we employ two more Inception-A blocks before down-sampling by another strided \(3\times 3\) convolution. The resulting tensors are then convolved with either a \(6\times 6\times 4\) kernel to regress a 4D quaternion or a \(6\times 6\times 3\) kernel to predict a 3D update translation vector.

Initial experiments showed clearly that training the network from scratch made it impossible to bridge the domain gap between synthetic and real images. Similarly to [13, 17] we found that the network focused on specific appearance details of the rendered CAD models and the performance on real imagery collapsed drastically. Synthetic images usually possess very sharp edges and clear corners. Since the first layers learn low-level features they overfit quickly to this perfect rendered world during training. We therefore copied the first five convolutional blocks from a pre-trained model and froze their parameters. We show the improvements in terms of generalization to real data in the supplement.

Further, we initialize the final regression layers such that the bias equals identity quaternion and zero translation whereas the weights are given a small Gaussian noise level of \(\sigma =0.001\). This ensures that we start refinement from a neutral pose, which is crucial for the evaluation of the projective visual loss.

While our approach produces very good refinements in a single shot we decided to also implement an iterative version where we run the pose refinement multiple times until the regressed update falls under a threshold.

4 Evaluation

We ran our method with TensorFlow 1.4 [1] on a i7-5820K@3.3GHz with an NVIDIA GTX 1080. For all experiments we ran the training with 100k iterations, a batch size of 16 and ADAM with a learning rate of \(3 \cdot 10^{-4}\). Furthermore, we fixed the number of 3D contour points per view to \(|V_\mathcal {S}| = |V_\mathcal {H}| = 100\). Additionally, our method is real-time capable since one iteration requires approximately 25 ms during testing.

To evaluate our method, we carried out experiments on three, both synthetic and real, datasets and will convey that our method can come close to RGB-D based approaches. In particular, the first dataset, referred to as ‘Hinterstoisser’, was introduced in [12] and consists of 15 sequences each possessing approximately 1000 images with clutter and mild occlusion. Only 13 of these provide water-tight CAD models and we therefore, like others before us, skip the other two sequences. The second one, which we refer to as ‘Tejani’, was proposed in [36] and consists of six mostly semi-symmetric, textured objects each undergoing different levels of occlusion. In contrast to the first two real datasets, the latter one, referred to as ‘Choi’ [7], consists of four synthetic tracking sequences.

In essence, we will first conduct some self-evaluation in which we illustrate our convergence properties with respect to different degrees of pose perturbation on real data. Then we show our method when applied to object tracking on ‘Choi’. As a second application, we compare our approach to a variety of other state-of-the-art RGB and RGB-D methods by conducting experiments in pose refinement on ‘Hinterstoisser’, the ‘Occlusion’ dataset and ‘Tejani’. Finally, we depict some failure cases and conclude with a qualitative category-level experiment.

4.1 Pose Perturbation

We study the convergence behavior of our method by taking correct poses, applying a perturbation by a certain amount and measure how well we can refine back to the original pose. To this end, we use the ‘Hinterstoisser’ dataset since it provides a lot of variety in terms of both colors and shapes. For each frame of a particular sequence we perturb the ground truth pose either by an angle or by a translation vector. In Fig. 4 we illustrate our results for the ‘ape’ and the ‘bvise’ objects and kindly refer the reader to the supplement for all graphs. In particular, we report our results for increasing degrees of angular perturbations from 5\(^{\circ }\) to 45\(^{\circ }\) and for increasing translation perturbations from 0 to 1 relative to the object’s diameter. We define divergence if the refined rotation is above 45\(^{\circ }\) in error or the refined translation larger than half of the object’s diameter and we employ 10 iterative steps to maximize our possible precision.

Fig. 4.
figure 4

Top: Perturbation results for two objects from [12] for increasing rotation and translation levels. Bottom: Qualitative results from the same experiment.

In general, our method can recover poses very robustly even under strong perturbations. Even for the extreme case of rotating the ‘bvise’ with 45\(^{\circ }\) we can refine back to an error less than 5\(^{\circ }\) in more than 60% of all trials, and to an error less than 10\(^{\circ }\) in more than 80% of all runs. Additionally, our approach only diverged for less than 1%. However, for the more difficult ‘ape’ object our numbers worsen. In particular, in almost 50% of the cases we were not able to rotate back the object to an error of less than 10%. Yet, this can be easily explained by the object’s appearance. The ‘ape’ is a rather small object with poor texture and non-distinctive shape, which does not provide enough information to hook onto whereas the ‘bvise’ is large and rich in appearance. It is noteworthy that the actual divergence behavior in rotation is similar for both and that the visual alignment for the ‘ape’ is often very good despite the error in pose.

The translation error correlates almost linearly between initial and final pose. We also observe an interesting tendency starting from perturbation levels at around 0.6 after which the results can be divided up into two distinct sets: either the pose diverges or the error settles on a certain level. This implies that certain viewpoints are easy to align as long as they have a certain visual overlap to begin with, rather independent of how strong we perturb. Other views instead are more difficult with higher perturbations and diverge from some point on.

Fig. 5.
figure 5

Left: Translation (mm) and rotation (degrees) errors on Choi for PCL’s ICP, Choi and Christensen (C&C)[7], Krull [21], Tan [34], Kehl [18], Tjaden [37] and our method. Right: Comparing [37] (left) to us (right) using only RGB.

4.2 Tracking

As a first use case we evaluated our method as a tracker on the ‘Choi’ benchmark [7]. This RGB-D dataset consists of four synthetic sequences and we present detailed numbers in Fig. 5. Note that all other methods utilize depth information. We decided for this dataset because it is very hard for RGB-only methods: it is poor in terms of color and the objects are of (semi-)symmetric nature. To provide an interesting comparison we also qualitatively evaluated against our tracker implementation of [37]. While their method is usually robust for texture-less objects it diverges on 3 sequences which we show and for which we provide reasoningFootnote 1 in Fig. 5 and in the supplementary material. In essence, except for the ‘Milk’ sequence we can report very good results. The reason why we performed comparably bad on the ‘Milk’ resides in the fact that our method already treats it as a rather symmetric object. Thus, sometimes it rotates the object along its Y-axis, which has a negative impact on the overall numbers. In particular, while already being misaligned, the method still tries to completely fill the object into the scene, thus, it slightly further rotates and translates the object. Referring to the remaining objects, we can easily outperform PCL’s ICP for all objects and also Choi and Christensen [7] for most of the cases. Compared to Krull [21], which is a learned RGB-D approach, we perform better for some values and worse for others. Note that our translation error along the Z-axis is quite high. Since the difference in pixels is almost nonexistent when the object is moved only a few millimeters, it is almost impossible to estimate the exact distance of the object without leveraging depth information. This has also been discussed in [15] and is especially true for CNNs due to pooling operations.

Table 1. VSS scores for each sequence of [12] with poses initialized from SSD-6D [17]. The first three rows are provided by [17]. We evidently outperform 2D-based ICP by a large margin and are on par with 3D-based ICP.
Table 2. Refinement scores with poses initialized from SSD-6D [17]. Left: Average ADD scores on ‘Hinterstoisser’ [12] (top) and ‘Occlusion’ [4] (bottom). Right: VSS scores on ‘Tejani’. We compare our visual loss to naive pose regression as well as two state-of-the-art trackers for the case of RGB [37] and RGB-D [18].

4.3 Detection Refinement

This set of experiments analyzes our performance in a detection scenario where an object detector will provide rough 6D poses and the goal is to refine them. We decided to use the results from SSD-6D [17], an RGB-based detection method, that outputs 2D detections with a pool of 6D pose estimates each. The authors publicly provide their trained networks and we use them to detect and create 6D pose estimates which we feed into our system. Tables 1 and 2 (a) and (b) depict our results for the ‘Hinterstoisser’, ‘Occlusion’ and the ‘Tejani’ dataset using different metrics. We maximally ran 5 iterations of our method, yet, we also stopped if the last update was less than 1.5\(^{\circ }\) and 7.5 mm. Since our method is particularly strong at recovering from bad initializations, we employ the same RGB-verification strategy as SSD-6D. However, we apply it before conducting the refinement, since in contrast to them, we can also deal with imperfect initializations, as long as they are not completely misaligned. We report our errors with the VSS metric (which is VSD from [14] with \(\tau =\infty \)) that calculates a visual 2D error as the pixel-wise overlap between the renderings of ground truth pose and estimated pose. Furthermore, to compare better to related work, we also use the ADD score [12] to measure a 3D metrical error as the average point cloud deviation between real pose and inferred pose when transformed into the scene. A pose is counted as correct if the deviation is less than a \(\frac{1}{10}\)th of the object diameter.

Fig. 6.
figure 6

Comparison on Tejani between (from left to right) our visual loss, mean squared error loss, the RGB-D tracker from [18] and the RGB tracker from [37].

Referring to ‘Hinterstoisser’ with the VSS metric, we can strongly improve the state-of-the-art for most objects. In particular, for the case of RGB only, we can report an average VSS score of 83%, which is an improvement of impressive and can thus successfully bridge the gap between RGB and RGB-D in terms of pose accuracy.

Except for the ‘cam’ and the ‘cat’ object our results are on par with or even better than SSD-6D + 3D refinement. ICP relies on good correspondences and robust outlier removal which in turn requires very careful parameter tuning. Furthermore, ICP is often unstable for rougher initializations. In contrast, our method learns refinement end-to-end and is more robust since it adapts to the specific properties of the object during training. However, due to this, our method requires meshes of good quality. Hence, similar to SSD-6D we have especially problems for the ‘cam’ object since the model appearance strongly differs from the real images which exacerbates training. Also note that their 3D refinement strategy uses ICP for each pose in the pool, followed by a verification over depth normals to decide for the best pose. Our method instead uses a simple check over image gradients to pick the best.

With respect to the ADD metric we fall slightly behind the other state-of-the-art RGB methods [5, 28]. We got the 3D-ICP refined poses from the SSD-6D authors and analyzed the errors in more detail in Table 2(a). We see again that we have bigger errors along the Z-axis, but less errors along X and Y. Unfortunately, the ADD metric penalizes this deviation overly strong. Interestingly, [5, 28] have better scores and we reason this to come from two facts. The datasets are annotated via ICP with 3D models against depth data. Unfortunately, inaccurate intrinsics and the sensor registration error between RGB and D leads to an inherent mismatch where the ICP 6D pose does not always align perfectly in RGB. Purely synthetic RGB methods like ours or [17] suffer from (1) a domain gap in terms of texture/shape and (2) the dilemma that better RGB performance can worsen results when comparing to that ‘true’ ICP pose. We suspect that [5, 28] can learn this registration error implicitly since they train on real RGB cut-outs with associated ICP pose information and thus avoid both problems. We often observe that our visually-perfect alignments in RGB fail the ADD criterion and we show examples in the supplement. Since our loss actually optimizes a form of VSS to maximize contour overlap, we can expect the ADD scores to go up only when perfect alignment in color equates perfect alignment in depth.

Fig. 7.
figure 7

Qualitative category-level experiment where we train our network on a specific set of mugs and bowls and track hitherto unseen models. The first frame depicts very rough initialization while the next frames show some intermediate refined poses throughout the sequence. The supplement shows the full video.

Eventually, referring to the ‘Occlusion’ dataset, we can report a strong improvement compared to the original numbers from SSD-6D, despite the presence of strong occlusion. In particular, while the rotational error decreased by approximately \(8^{\circ }\), the translational error dropped by 4 mm along ‘X’ and ‘Y’ axes and by 28 mm along ‘Z’. Thus, we can increase ADD from 6.2% up to 28.5%, which demonstrates that we can deal with strong occlusion in the scene.

For ‘Tejani’ we decided to show the improvement over networks trained with a standard regression loss (MSE). Additionally, we re-implemented the RGB tracker from [37] and were kindly provided with numbers from the authors of the RGB-D tracker from [18] (see Fig. 6). Since the dataset mostly consists of objects with geometric symmetry, we do not measure absolute pose errors here but instead report our numbers with the VSS metric. The MSE-trained networks constantly underperform since the dataset models are of symmetric nature which in turn leads to a large difference of 14% in comparison to our visual loss. This result stresses the importance of correct symmetry entangling during training. The RGB tracker was not able to refine well due to the fact that the color segmentation was corrupted by either occlusions or imperfect initialization. The RGB-D tracker, which builds on the same idea, performed better because it uses the additional depth channel for segmentation and optimization.

4.4 Category-Level Tracking

We were curious to find out whether our approach can generalize beyond a specific CAD model, given that many objects from the same category share similar appearance and shape properties. To this end, we conducted a final qualitative experiment (see Fig. 7) where we collected a total of eight CAD models of cups, mugs and a bowl and trained simultaneously on all. During testing we then used this network to track new, unseen models from the same category. We were surprised to see that the approach has indeed learned to metrically track previously unseen but nonetheless similar structures. While the poses are not as accurate as for the single-instance case, it seems that one can indeed learn the projective relation of structure and how it changes under 6D motion, provided that at least the projection functions (i.e. camera intrinsics) are constant. We show the full sequence in the supplementary material.

Fig. 8.
figure 8

Two prominent failure cases: Occlusion (left pair) and objects of very similar colors and shapes (right pair) can negatively influence the regression.

4.5 Failure Cases

Figure 8 illustrates two known failure cases where the left image of each pair represents initialization and the right image the refined result. Although we train with occlusion certain occurrences can worsen our refinement nonetheless. While two ‘milk’ instances were refined well despite occlusion, the left ‘milk’ instance could not be recovered correctly. The network assumes the object to end at the yellow pen and only maximizes the remaining pixel-wise overlap. Besides occlusion, objects of similar color and shape can in rare cases lead to confusion. As shown in the right pair, the network mistakenly assumed the stapler, instead of the cup, to be the real object of interest.

5 Conclusion

We believe to have presented a new approach towards 6D model tracking in RGB with the help of deep learning and we demonstrated the power of our approach on multiple datasets and for the scenarios of pose refinement and for instance/category tracking. Future work will include investigation towards generalization to other domains, e.g. the suitability towards visual odometry.