Keywords

1 Introduction

If there is one topic that has been obsessively drawing the attention of the computer vision research community, it has to be object detection. Object detectors are the heart of complex models able to interact with and understand our world. However, to enable a true interaction we need not only a precise localization but also an accurate pose estimation of the object. That is, just a bounding box does not help a robot to grasp an object: it needs to know a viewpoint estimate of the object to facilitate the inference of the visual affordance.

Since 2006, in parallel with the enormous progress in object detection, there have been appearing different approaches which go further and propose to solve the 3D generic object localization and pose estimation problem (e.g. [117]). But in this ecosystem the fauna exhibits a high level of heterogeneity. Some approaches decouple the object localization and pose estimation tasks, while some do not. There is no consensus either at considering the pose estimation as a discrete or continuous problem. Different datasets, with different experimental setups and even different evaluation metrics have been proposed along the way.

This paper wants to bring this situation under attention. We believe that to make progress, it is now time to consolidate the work, comparing different models proposed and drawing some general conclusions. Therefore, in the spirit of the work of Hoiem et al. [18] for the diagnosis of object detectors, we here propose a thorough diagnosis of pose estimation errors.

Our work mainly provides a publicly available diagnostic toolFootnote 1 to take full advantage of the results reported by state-of-the-art models in the PASCAL 3D+ dataset [19]. This can be considered our first contribution (Sect. 2). Specifically, our diagnosis first analyzes the influences of the main object characteristics (e.g. visibility of parts, size, aspect ratio) on the detection and pose estimation performance. We also provide a detailed study of the impact of the different types of false positive pose estimations. Our procedure considers up to five different evaluation metrics, which are carefully analyzed with the aim of identifying their pros and cons.

Our second contribution consists in offering a detailed diagnosis of four state-of-the-art models [4, 14, 19, 20] (Sect. 3).

We end the paper with our prescription for success. This is our last contribution and the topic of Sect. 4. There, we offer a comparative analysis of the different approaches, identifying the main weaknesses and suggesting directions for improvement. We even show how to use the information provided by our diagnostic tool to improve the results of two of the models, as an example.

Many studies have been proposed for the analysis of errors in object localization only [18, 2123]. Simply in [20], some pose estimation error modes for the Viewpoints&Keypoints (V&K) model are analyzed. But this analysis is restricted to the setting where the localization and pose estimation are not analyzed simultaneously. We present here a more thorough comparison for four different approaches, and two different setting: pose estimations over the ground truth (GT) bounding boxes (BBs), and a simultaneous detection and pose estimation.

Overall, our main objective with this work is to provide a new insight into the problem of object detection and pose estimation, facilitating other researchers in the hard task of developing more precise solutions.

2 Diagnostic Tool

2.1 Dataset and Object Detection and Pose Estimation Models

Different datasets for the evaluation of simultaneous object localization and pose estimation have been proposed, e.g. [2, 1113, 24]. Although most of them have been widely used during the last decade, one can rapidly identify their most important limitation: objects do not appear in the wild. Other important issues are: (a) background clutter is often limited and therefore methods trained on these datasets cannot generalize well to real-world scenarios; (b) some of these datasets do not include occluded or truncated objects; (c) finally, only a few object classes are annotated, being the number of object instances and the number of viewpoints covered with the annotation small too.

To overcome these limitations, the PASCAL 3D+ dataset [19] has been proposed. It is a challenging dataset for 3D object detection, which augments 11 rigid categories of PASCAL VOC 2012 [25] with 3D annotations. Furthermore, more images are added for each category from ImageNet [26], attaining on average more than 3,000 object instances per class. Analyzing the viewpoint annotation distribution for the training and test sets, shown in Fig. 1, it can be observed that the dataset covers all the viewpoints, although the annotation seems to be biased towards frontal poses. Since its release in 2014, the PASCAL 3D+ has experienced a great acceptance by the research community (e.g. [1, 5, 14, 16, 20]). We can affirm that it is rapidly becoming the de facto benchmark for the experimental validation of object detection and pose estimation methods.

Fig. 1.
figure 1

Viewpoint distribution (in terms of azimuth). F: frontal. F-L: frontal-left. L: Left. L-RE: left-rear. RE: rear. RE-R: rear-right. R: right. R-F: right-frontal.

We apply our diagnostic tool to four different approaches [4, 14, 19, 20]. All these models or provide the code or have officially submitted the results to PASCAL 3D+. In any case, these solutions have been selected not only because they define the state-of-the-art on PASCAL 3D+, like V&K [20], but also because they are representative for different approaches towards the pose estimation problem, allowing a variety of different interesting analyses: hand-crafted features based [4, 14, 19] vs. deep learning models [20]; Hough Forest (HF) voting models [14] against deformable part models (DPM) [4, 19] and template models [20].

We have two DPM based approaches. VDPM [19] simply modifies DPM such that each mixture component represents a different viewpoint. For DPM-VOC+VP [4] a structured labeling problem for the learning is proposed, where a viewpoint variable for each mixture component of the DPM is used. We also include in the study the Boosted Hough Forest (BHF) model [14]. It is a Hough voting approach able to perform a simultaneous object detection and pose estimation. As it is usual [27, 28], we incorporate a verification step using the faster R-CNN model [29] trained on the PASCAL VOC 2007, in order to re-score the detections of BHF and augment its recall. Finally, we diagnose V&K [20], a CNN based architecture for the prediction of the viewpoint. For the object localization, V&K relies on the R-CNN [30] detector. This is the only method, that does not perform a simultaneous object detection and pose estimation.

2.2 Diagnosis Details and Evaluation Metrics

We offer a complete diagnosis which is split into two analyses. The first one focuses only on the viewpoint estimation performance, assuming the detections are given by the GT bounding boxes. In the second one, the performance for the simultaneous object detection and viewpoint estimation task is evaluated.

Our diagnostic tool analyzes the frequency and impact of different types of false positives, and the influence on the performance of the main object characteristics. Analyzing the different types of false pose estimations of the methods, we can gather very interesting information to improve them. Since it is difficult to characterize the error modes for generic rotations, we restrict our analysis to only the predicted azimuth. We discretize the azimuth angle into K bins, such that the bin centers have an equidistant spacing of \(\frac{2\pi }{K}\). Thus, we define the following types of error modes. Opposite viewpoint error, which measures the effect of flipped estimates (e.g. confusion between frontal and rear views of a car). Nearby viewpoint errors. Nearby pose bins are confused due to they are very correlated in terms of appearance. Finally, the Other rotation errors, which include the rest of false positives.

With respect to the impact of the main object characteristic, we use the definitions provided in [18]. In particular, the following characteristic are considered in our study: occlusion/truncation, which indicates whether the object is occluded/truncated or not; object size and aspect ratio, which organizes the objects in different sets, depending on their size or aspect ratio; visible sides, which indicates if the object is in frontal, rear or side view position; and part visibility, which marks whether a 3D part is visible or not. For the object size, we measure the pixel area of the bounding box. We assign each object to a size category, depending on the object’s percentile size within its object category: extra-small (XS: bottom 10\(\,\%\)); small (S: next 20\(\,\%\)); large (L: next 80\(\,\%\)); extra-large (XL: next 100\(\,\%\)). Likewise, for the aspect ratio, objects are categorized into extra-tall (XT), tall (T), wide (W), and extra-wide (XW), using the same percentiles.

Finally, we consider essential to incorporate into the diagnostic tool an adequate evaluation metric for the problem of simultaneous object localization and pose estimation. Traditionally, these two tasks have been evaluated separately, an aspect which complicates the development of a fair and meaningful comparison among the competing methods. For instance, a method with a very low average precision (AP) in detection, can offer an excellent mean angle error in the task of viewpoint estimation. How can we then compare these models?

In order to overcome this problem, our diagnostic tool considers three metrics, all evaluating simultaneously the pose estimation and object detection performance. They all have an associated precision/recall (prec/rec) curve. First, for the problem of detection and discrete viewpoint estimation, we use Pose Estimation Average Precision (PEAP) [3]. PEAP is obtained as the area under the corresponding prec/rec curve by numerical integration. In contrast to AP, for PEAP, a candidate detection can only qualify as a true positive if it satisfies the PASCAL VOC [25] intersection-over-union criterion for the detection and provides the correct viewpoint class estimate. We also use Average Viewpoint Precision (AVP) [19], which is similar to PEAP, with the exception that for the recall of its associated prec/rec curve, it uses the true positives according to the detection criterion only. The third metric is the Average Orientation Similarity (AOS) [24], which corrects the precision with a cosine similarity term, using the difference in angle between the GT and the estimation. Finally, we report the mean angle error (MAE) and median angle error (MedError) [2, 20], which do not consider the object localization performance.

3 Diagnostic

3.1 Pose Estimation Performance over the GT

We start the diagnosis of the different models by analyzing their performance estimating the viewpoint of an object when the GT BBs are given. We run the models over the cropped images, using the detector scores to build the detection raking. The main results are shown in Table 1, which clearly demonstrates that the deep learning method [20] performs significantly better than all hand-crafted features based models [14, 19]. VDPM exhibits a performance, in terms of AOS and MedError, slightly better than BHF. However, BHF achieves better MAE than VDPM. If we now compare these models using AVP or PEAP, VDPM is clearly superior. This fact reveals that VDPM is able to report a higher number of accurate pose estimations. These results also reveal that AVP and PEAP are more severe than AOS, penalizing harder the errors on the pose estimation. This aspect makes the other metrics more appropriate than AOS to establish meaningful comparisons between different approaches. Interestingly, this conclusion is reinforced when a random pose assignment is used and evaluated in terms of AOS, see Table 1 RAND model. This approach reports a very high AOS of 52.2, compared to the 10.8 or 1.0 for AVP and PEAP, respectively.

Table 1. Pose estimation with GT. Viewpoint threshold is \(\frac{\pi }{12}\) for AVP and PEAP.

Types of false pose estimations. Figure 2 shows the frequency and impact on the performance of each type of false positive. For this figure a pose estimation is considered as: (a) correct if its error is \({<}15^\circ \); (b) opposite if its pose error is \({>}165^\circ \); (c) nearby if its pose error is \(\in [15^\circ , 30^\circ ]\); (d) other for the rest of situations. The message of this analysis is clear: errors with opposite viewpoints are not the main problem for any of the three models, being the highest confusions with others. However, here we show that the DPM-based methods are more likely to show opposite errors, as it has been shown in [3]. Overall, the large visual similarity between opposite views for some classes, and the unbalancedness of the training set (see Fig. 1) have a negative impact on DPM-based models.

The deep learning model, V&K [20], seems to exhibit the hights confusion between nearby viewpoints. This error type is above \(25\,\%\) for V&K, while for the other methods [14, 19] it does not exceed the \(15\,\%\). This fact, a priori, may seem to be a good property of V&K, since its error distribution is concentrated on small values. However, these nearby errors are treated as false positives for the AVP and PEAP metrics, hence reducing the performance.

Fig. 2.
figure 2

Pie chart: percentage of errors that are due to confusions with opposite, nearby or other viewpoints, and correct estimations. Bar graphs: pose performance in terms of AOS (left) and AVP (right). Blue bar displays the overall AOS or AVP. Green bar displays AOS or AVP improvement by removing all confusions of one type: OTH (other errors); NEAR (nearby viewpoints); OPP (opposite viewpoints). Brown Bar displays AOS or AVP improvement by correcting all estimations of one type: OTH, NEAR or OPP. (Color figure online)

If we focus now the attention on the evaluation metrics, i.e. the bar graphs in Fig. 2, we observe that the nearby errors have a negative impact on the AOS metric. If we proceed to remove all the estimations of this type (see green bar), the performance decreases. Furthermore, if we correct these errors (see brown bar), AOS does not significantly improve for any method. In contrast, the AVP metric always improves when any error type is removed or corrected.

Impact of object characteristics. Figure 3 provides a summary of the sensitivity to each characteristic and the potential impact on improving pose estimation robustness. The worst-performing and best-performing combinations for each object characteristic are averaged over the 11 categories. The difference between the best and the worst performance indicates sensitivity; the difference between the best and the overall indicates the potential impact. Figure 3 shows that the three methods are very sensitive to occlusion and truncation properties of the objects, but the impact is very small. The reduction of performances for VDPM and V&K are higher than for BHF, indicating that they perform worse for occluded objects than BHF. Remember that BHF is a part-based approach, an aspect that increases the robustness to occlusions and truncations.

All models show sensitivity to the object size. BHF is trained cropping and rescaling all training objects to the same size. Therefore, this model is not very robust to changes in the size of the test objects, but it works well with (extra) large objects (see Fig. 4(b)). As it is described in [20], the effect of small and extra small objects on V&K is very significant. The worst performance is exhibited by the VPDM model, which has difficulties to work with both (extra) small and (extra) large objects.

Fig. 3.
figure 3

Summary of sensitivity and impact of object characteristics. We show AOS of the highest performing and lowest performing subsets within each characteristic (occ-trn: occlusion/truncation, size: object size, asp: aspect ratio, side: visible sides and part: part visibility). Dashed line is overall AOS.

All models are sensitive to the aspect ratio of the objects. Since the mixture component concept of VDPM is closely related to the aspect ratio, this characteristic has more negative effect on this approach. V&K and VDPM present difficulties to work with tall and extra tall objects, while BHF does not (see Fig. 4(c)). VDPM works poorly for wide and extra wide categories, while BHF is the only one that improves its performance working with these aspect ratios. Note that these aspect ratios are the most common on the training set (\(82\,\%\) of the training objects).

Part visibility exhibits a very high impact (roughly 0.102 for VDPM, 0.136 for BHF and 0.051 for V&K). Due to the learning process based on local object parts, BHF is the most sensitive to this object property. In general (see Fig. 5), we have observed that the parts that are most likely to be visible, have a positive impact over the pose performance, and they present a negative effect when they are barely visible. But a high level of visibility does not imply that these parts are going to be the most discriminative. For instance, in the sofa class, the seat bottom parts (p2 and p3) are the most visible, but the models are more sensitive to the back parts (p5 and p6). For car, the wheels (the first 4 parts) and the frontal lights (p9 and p10) are the parts least visible, but while the wheels seem not to affect the performance, the frontal lights do. There are some exceptions between the behavior of the different models. For aeroplanes, the wings (p2 and p5) are not important for VDPM and V&K, while they are for BHF. BHF and V&K vary in a similar way with the parts of the diningtable.

Fig. 4.
figure 4

Effect of object characteristics. Dashed lines are overall AOS. (a) Effect of visible sides. fr: Frontal, re: Rear and side: Side. ‘1’ visible; ‘0’ no visible. (b) Effect of object sizes. (c) Effect of aspect ratios.

Fig. 5.
figure 5

Effect of visible parts. Visible parts: ‘1’ = visible; ‘0’ = no visible.

One interpretation of our results is that the analyzed estimators do well on the most common modes of appearance (e.g. side or frontal views), but fail when a characteristic such as viewpoint is unusual. All models show a strong preference for the frontal views. The main problem for all the approaches seems to be how to achieve a precise pose estimation when the rear view is visible. Overall, VDPM and V&K are more robust than BHF to the bias towards frontal viewpoints.

3.2 Simultaneous Object Detection and Pose Estimation

It is time now for our second diagnosis: joint object localization and pose estimation. Table 2 shows a detailed comparison of all the methods. Note we report now the AP for the detection, and then the AOS, AVP and PEAP metrics.

V&K [20] again reports the best average performance. Interestingly, AVP and PEAP reveal that all methods exhibit lower loss of pose estimation performance than working with GT BBs (see Tables 1 and 2). This indicates that all the models are able to report more accurate pose estimations when the detections are given by a detector approach, instead of with the GT annotations. In other words: the good pose estimations seem to be associated to clear or easy detections. Take into account that when the GT BBs are used, many difficult and truncated or occluded objects, which might have not been detected, are considered.

It is clearly the excellent performance of R-CNN [30] detector, which makes V&K the winner (observe the high AP for some categories). This suggests an intriguing question. Being the V&K model the only one that does not consider the localization and pose estimation jointly, is it adequate to decouple the detection and pose estimation tasks? We get back to this question in Sect. 4.

Note that the BHF performance for object detection is far from the state-of-the-art in the PASCAL 3D+ dataset. This conclusion is not new: already Gall et al. [31] manifested that Hough transform-based models struggle with the variation of the data that contains many truncated examples.

Table 2. Simultaneous object detection and pose estimation comparison.

Does Table 2 offer the same conclusions we obtained before for the AOS metric and its bias towards the detection performance? First, it seems clear that AOS tends to be closer to AP than the rest of metrics. Second, while in terms of detection VDPM is better than DPM-VOC+VP, for the pose estimation task the DPM-VOC+VP is able to report a better performance, according to AVP and PEAP. Moreover, Fig. 7 corroborates this fact too. However, in terms of AOS, the VDPM is better. This is a contradiction, which again reveals that AOS is biased towards the detection performance, while AVP and PEAP are more restrictive, penalizing harder the errors in the estimation of the poses.

Figure 6 shows an analysis of the influence of the overlap criterion in all the metrics. For this overlap criterion we follow the PASCAL VOC formulation: to be considered a true positive, the area of overlap between the predicted BB and GT BB must exceed a threshold. This figure also shows that the AOS metric is dominated by the detection, while AVP and PEAP are more independent.

This leads us to conclude that the AVP and PEAP metrics are more adequate to evaluate the performance of the models on pose estimation. We also observe that the overlap criterion can be relaxed, allowing less precise detections to be evaluated. This way we gain object hypotheses per method, and let the metrics choose which one estimates the viewpoints the best.

Observing the evolution of MAE and MedError, VDPM, V&K and BHF improve their performance with respect to the GT analysis. The detection seems to work as a filter stage letting pass only those candidates which are susceptible to be correctly estimated. Only V&K and BHF almost maintain these errors when the overlap criterion increases. This means that they are not sensitive to the detection accuracy. This is not the case for the DPM-based models, for which the more precise the BB localization, the better the pose estimation.

Fig. 6.
figure 6

Analysis of the influence of the overlap criterion in the different metrics.

Types of false pose estimations. Figure 7 shows the results corresponding to the type of false pose estimations for each method. We follow the same analysis detailed in Sect. 3.1. Remarkably, the detection stage has caused a decrease in the confusion with nearby viewpoints for V&K, improving the correct estimate percentage (from \(47\,\%\) with GT, to \(58\,\%\)). BHF is probably the most stable model, while VDPM is the most benefited by the detection: note that all the error percentages have been reduced. Like we said, overall, the detection stage seems to select those candidates for which the pose estimation is easy to estimate.

All methods exhibit a similar confusion with near and opposite poses. It seems that the classes with the highest opposite errors are boat, car and train. For the nearby poses, the most problematic classes are boat, motorbike and sofa.

Fig. 7.
figure 7

False positive analysis on detection and pose estimation.

Impact of object characteristics. In Fig. 8 we show the impact of object characteristics for each method. Now, the size of the objects is the most influential characteristic, mainly affecting the performance of the detection (small objects are difficult to be detected – this is a common problem for all the approaches). Surprisingly, observing Fig. 9(b), one can say that all methods improve their pose performances working with small or extra large objects (this does not happen in the GT analysis). This fact again reveals that the detection seems to work as a filter stage.

The second aspect which is worth analyzing is the effect of occluded/truncated objects. Now the impact of these characteristics is really considerable, compared with the numbers reported in the previous section with the GT. A conclusion is clear: if we jointly treat object localization and pose estimation tasks, more effort has to be done in order to tackle this problem.

Fig. 8.
figure 8

Summary of sensitivity and impact of object characteristics for simultaneous object detection and pose estimation performance.

Fig. 9.
figure 9

Effect of object characteristics on detection and pose estimation.

All models are sensitive to the aspect ratio, but in this case the impact of this characteristic is reinforced by the detection. In contrast to the GT analysis, now, the models seem to prefer extra wide objects (see Fig. 9(c)). Interestingly, VDPM, V&K and BHF do not work well with tall objects when GT is used, but this seems to be solved by the detection stage. Remarkably, DPM-VOC+VP is the only one that works well with all unusual aspect ratios of the objects.

Again, the visible sides aspect is the one that most adversely affects the performance of the models. Even for the winner method, i.e. V&K, the accuracy dramatically drops from the average AOS of 0.533 to 0.154. From a careful inspection of the results, we conclude that the main problem for all approaches seems to be to obtain a simultaneous precise detection and quality pose estimation for the rear views of the objects (see Fig. 9(a)).

If we introduce the difficult objects in the evaluation, results show that these examples have a slight negative effect. V&K is the one which really suffers from this situation, although the AOS performance decreases just 0.04 points.

Fig. 10.
figure 10

Part visibility influence on detection and pose estimation.

Visibility of Parts Influences Pose Estimation Performance. What is the influence of the different parts for each of the models and categories? Figure 10 shows a detailed analysis for this question. The main difference with respect to the previous analysis is that now the parts play an important and different role for the detection. For instance, now, in car, wheels (the first 4 parts) are more influential than back trunks (the last 2 parts). In aeroplane, wings (p2 and p5) are now not important for BHF. However, there are object parts that have the same effect as before: for instance, the models keep being very sensitive to the back part of the sofa (p5). The variability presented by the models towards train parts in the detection is similar to the one reported for the pose estimation analysis with GT. The class tvmonitor keeps being paradigmatic, affecting to all the approaches in a very sensitive way.

All the methods and classes exhibit a great variability, depending on the visible parts. But not all parts affect in the same way, as it has been discussed. Models vary their sensitivity according to whether they are detecting an object or estimating its pose. Therefore, we should seek parts that are very discriminative for both the object detection and pose estimation tasks.

4 Our Prescription for Success

We have performed a complete diagnosis for all the models. The following are the main problems identified, and our prescription for success.

All the analyzed approaches are biased towards the common modes of object appearance of the PASCAL 3D+ dataset. For example, in aeroplane or tvmonitor classes, where the side-view and the frontal-view involve \(62\,\%\) and \(63\,\%\) of the objects, respectively, the models obtain for these views a pose estimation performance which is almost \(10\,\%\) better than the average. Furthermore, for categories that appear typically occluded in training, such as car (\(82\,\%\) occluded/truncated) or chair (\(97\,\%\) occluded/truncated), the models also do a good job with the slightly occluded or truncated test instances. That is, models are biased towards the data distribution used during training. To deal with the problem of scarcity of training data with viewpoint annotation, one solution could be to design models able to encode the shape and appearance of the categories with robustness to moderate changes in the viewpoint. The very recent work of [16] shows this is a promising direction, where a CNN-based architecture is trained using 3D CAD models.

HF and DPM-based models prefer balanced training sets. Other works [1, 3] have already shown how the performance of DPM-based solutions improves if the training data is sufficiently balanced and clean. One can try to artificially balance the dataset or control the number of training instances per viewpoint. Following these hints, we have completed an extra experiment. We have re-trained a BHF, but balancing the car training set. By doing so, we achieve a gain of \(1\,\%\) and \(1.1\,\%\), for AP and AOS.

Size matters for the detection and pose estimation tasks, all methods present difficulties to work with (extra) small objects. One solution could be to combine detectors at multiple resolutions (e.g. [32]). We also encourage to use contextual cues, which have been shown to be of great benefit for the related task of object detection. For instance, the Allocentric pose estimation model in [33] could be a good strategy to follow. This approach integrates the pose information of all the objects in a scene to improve the viewpoint estimations.

Should we decouple the pose estimation and the detection tasks? This is a fundamental question we wanted to answer in this study. In this diagnosis, only the V&K model decouples the two tasks, and it is the one systematically reporting the best performance. We could not find and analyze a deep learning approach where both problems were treated jointly. Hence we cannot provide a convincing answer to the question. However, our analysis reveals that the performance increases when the pose estimations are given over detections instead of GT bounding boxes. Figures 2 and 7 show that V&K has been able to reduce the large number of confusions with nearby views, simply thanks to the detection stage. This reveals that there is a correlation between easy-to-detect objects and easy-to-estimate-pose objects. Therefore we can affirm that the detection seems to help the pose estimation. Furthermore, knowing that when training and testing data belong to the same distribution, results are generally better, a good strategy could be to re-train the models, but on detected objects on the training set, i.e. not using GT BBs. We show the results of this extra experiment below.

What is the most convenient evaluation metric? After our diagnosis, the answer to this question is clear to us: PEAP and AVP are able to offer more meaningful results and comparatives than AOS, which is greatly dominated by the detection performance. Both PEAP and AVP provide more information regarding the precision in pose estimations, while the localization precision is also considered.

Fig. 11.
figure 11

Random part vs. selected part extractions for training a BHF.

Fig. 12.
figure 12

Extra experiment for V&K.

How can we use this diagnostic tool to improve our models? The main objective of this work is that other researches can use it to improve the performance of their approaches. We provide here some examples. For instance, to improve the V&K performance, we proceed as follows. As it has been previously explained, our diagnosis indicates that the detection step improves the pose estimation performance. We propose to re-train the V&K model but with detections on the training set. First, we collect detected BBs using the R-CNN model [29] in the training images. Only those BBs satisfying the PASCAL VOC overlap criterion with respect to the annotations, with a threshold of 0.7, are selected (\(70\,\%\) of the new training BBs). Following this strategy we achieve an improvement of \(2\,\%\), \(2.3\,\%\) and \(2.4\,\%\) in terms of AOS, AVP and PEAP, respectively. Interestingly, nearby error is reduced by \(8\,\%\), and correct estimations are increased by \(20\,\%\) (see Fig. 12).

We can also improve the BHF performance. A careful inspection of the diagnosis for BHF shows that it exhibits the highest sensitivity to the visibility of object parts. For instance, for the class motorbike, see Fig. 11(a), there is a specific part (the headlight) that is very discriminative. BHF is normally trained performing a random extraction of image patches from the training images. If, instead of this random patch extraction, we check whether this part is visible and extract a patch centered at its annotated position, we get an increase of \(4\,\%\) for the AP, while the AVP increases from 0.037 to 0.045 (see Fig. 11(b)).

Conclusion. We hope that our work will inspire research that targets and evaluates reduction in specific error modes on object detection and pose estimation. Our tool is publicly available giving other researches the opportunity to perform similar analysis with other pose estimation methods working on the PASCAL 3D+ dataset.