Keywords

1 Introduction

Supposed an image is given and we are interested in extracting some 3D information from it, such as, the scene layout or the pose of the visible objects. One potential approach would be to annotate a dataset for every single desired problem and train a fully supervised system for each (i.e., supervised learning). This is undesirable as an annotated dataset for each problem would be needed as well as the fact that the problems would be treated independently. In addition, unlike semantic annotations such as, object labels, certain annotations in 3D are cumbersome to collect and often require special sensors (imagine manually annotating exact pose of an object or surface normals). An alternative approach is to develop a system with a rather generic perception that can conveniently generalize to novel tasks. In this paper, we take a step towards developing a generic 3D perception system that (1) can solve novel 3D problems without fine-tuning, and (2) is capable of certain abstract generalizations in the 3D context (e.g., reason about pose similarity between two drastically different objects).

But, how could one learn such a generalizable system? Cognitive studies suggest living organisms can perform cognitive tasks for which they have not received supervision by supervised learning of other foundational tasks [28, 45, 51]. Learning the relationship between visual appearance and changing the vantage point (self-motion) is among the first visual skills developed by infants and play a fundamental role in developing other skills, e.g., depth perception. A classic experiment [28] showed a kitten that was deprived from self-motion experienced fundamental issues in 3D perception, such as failing to understand depth when placed on the Visual Cliff [22]. Later works [45] argued this finding was not, at least fully, due to motion intentionality and the supervision signal of self-motion was indeed a crucial elements in learning basic visual skills. What these studies essentially suggest are: (1) by receiving supervision on a certain proxy task (in this case, self-motion), other tasks (depth understanding) can be solved sufficiently without requiring an explicit supervision, (2) some vision tasks are more foundational than others (e.g., self-motion perception vs depth understanding).

Fig. 1.
figure 1

Learning a generic 3D representation: we develop a supervised joint framework for camera pose estimation and wide baseline matching. We then show the internal representation of this framework can be used as a 3D representation generalizable to various 3D prediction tasks.

Inspired by the above discussion, we develop a supervised framework where a ConvNet is trained to perform 6DOF camera pose estimation. This basic task allows learning the relationship between an arbitrary change in the viewpoint and the appearance of an object/scene-point. One property of our approach is performing the camera pose estimation in a object/scene-centric manner: the training data is formed of image bundles that show the same point of an object/scene while the camera moves around (i.e., it fixates - see the Fig. 2(c)). This is different from existing video+metadata datasets [20], the problem of Visual Odometry [20, 42], and recent works on ego-motion estimation [4, 29], where in the training data, the camera moves independent of the scene. Our object/scene-centric approach is equivalent to allowing a learner to focus on a physical point while moving around and observing how the appearance of that particular point transforms according to viewpoint change. Therefore, the learner receives an additional piece of information that the observed pixels are indeed showing the same object, giving more information about how the element looks under different viewpoints and providing better grounds for learning visual encoding of an observation. Infants also explore object-motion relationships [51] in a similar way as they hold an object in hand and observe it from different views.

Our dataset also provides supervision for the task of wide baseline matching, defined as identifying if two images/patches are showing the same point regardless of the magnitude of viewpoint change. Wide baseline matching is also an important 3D problem and is closely related to object/scene-centric camera pose estimation: to identify whether two images could be showing the same point despite drastic changes in the appearance, an agent could learnt how viewpoint change impacts the appearance. Therefore, we perform our supervised training in a multi-task manner to simultaneously solve for both wide baseline matching and pose estimation. This has the advantage of learning a single representation that encodes both problems. In experiments Sect. 4.1, we show it is possible to have a single representation solving both problems without a performance drop compared to having two dedicate representations. This provides practical computational and storage advantages. Also, training ConvNets using multiple tasks/losses is desirable as it has been shown to be better regularized [23, 58, 63].Footnote 1

We train the ConvNet (siamese structure with weight sharing) on patch pairs extracted from the training data and use the last FC vector of one siamese tower as the generic 3D representation (see Fig. 1). We will empirically investigate if this representation can be used for solving novel 3D problems (we evaluated on scene layout estimation, object pose estimation, surface normal estimation), and whether it can perform any 3D abstraction (we experimented on cross category pose estimation and relating the pose of synthetic geometric elements to images).

Dataset: We developed an object-centric dataset of street view scenes from the cities of Washington DC, NYC, San Francisco, Paris, Amsterdam, Las Vegas, and Chicago, augmented with camera pose information and point correspondences (with >half a billion training data points). We release the dataset, trained models, and an online demo at http://3Drepresentation.stanford.edu/.

Novelty in the Supervised Tasks: Independent of providing a generic 3D representation, our approach to solving the two supervised tasks is novel in a few aspects. There is a large amount of previous work on detecting, describing, and matching image features, either through a handcrafting the feature [5, 11, 35, 3840] or learning it [9, 19, 25, 49, 50, 65, 66]. Unlike the majority of such features that utilize pre-rectification (within either the method or the training data), we argue that rectification prior to descriptor matching is not required; our representation can learn the impact of viewpoint change, rather than canceling it (by directly training on non-rectified data and supplying camera pose information during training). Therefore, it does not need an apriori rectification and is capable of performing wide baseline matching at the descriptor level. We report state-of-the-art results on feature matching. Wide baseline matching has been also the topic of many papers [24, 44, 54, 62, 67] with the majority of them focused on leveraging various geometric constraints for ruling out incorrect ‘already-established’ correspondences, as well as a number of methods that operate based on generating exhaustive warps [41] or assuming 3D information about the scene is given [61]. In contrast, we learn a descriptor that is supervised to internally handle a wide baseline in the first place.

In the context of pose estimation, we show estimating a 6DOF camera pose given only a pair of local image patches, and without the need for several point correspondences, is feasible. This is different from many previous works [3, 8, 14, 20, 26, 52, 59] from both visual odometery and SfM literature that perform the estimation through a two step process consisting of finding point correspondences between images followed by pose estimation. Koser and Koch [30] also demonstrate pose estimation from a local region, though the plane on which the region lies is assumed to be given. The recent works of [4, 29] supervise a ConvNet on the camera pose from image batches but do not provide results on matching and pose estimation. We report a human-level accuracy on this task.

Existing Unsupervised Learning and ConvNet Initialization Works: The majority of previous unsupervised learning, transfer learning, and representation learning works have been targeted towards semantics [17, 18, 46, 47, 53]. It has been practically well observed [18, 46] that the representation of a convnet trained on imagenet [32] can generalize to other, mostly semantic, tasks. A number of methods investigated initialization techniques for ConvNet training based on unsupervised/weakly supervised data to alleviate the need for a large training dataset for various tasks [17, 57]. Very recently, the methods of [4, 29] explored using motion metadata associated with videos (KITTI dataset [20]) as a form of supervision for training a ConvNet. However, they either do not investigate developing a 3D representation or intent to provide initialization strategies that are meant to be fine-tuned with supervised data for a desired task. In contrast, we investigate developing a generalizable 3D representation, perform the learning in an object-centric manner, and evaluate its unsupervised performance on various 3D tasks without any fine-tuning on the representation. We experimentally compare against the related recent works that made their models available [4, 57].

Primary contributions of this paper are summarized as: (I) A generic 3D representation with empirically validated abstraction and generalization abilities. (II) A learned joint descriptor for wide baseline matching and camera pose estimation at the level of local image patches. (III) A large-scale object-centric dataset of street view scenes including camera pose and correspondence information.

2 Object-Centric Street View Dataset

The dataset for the formulated task needs to not only provide a large amount of training data, but also show a rich camera pose variety, while the scale of the aimed learning problem invalidates any manual procedure. We present a procedure that allows acquiring a large amount of training data in an automated manner, based on two sources of information: (1) Google street view [2] which is an almost inexhaustible source of geo-referenced and calibrated images, (2) 3D city models [1, 2] that cover thousands of cities around the world.

Fig. 2.
figure 2

Illustration of the object-centric data collection process. We use large-scale geo-registered 3D building models to register pixels in street view images on world coordinates system (see (a)) and use that for finding correspondences and their relative pose across multiple street view images (see (b)). Each ray represents one pixel-3D world coordinate correspondence. Each of the red, green, and blue colors represent one street view location. Each row in (c) shows a sample collected image bundle. The center pixel (marker) is expected to correspond to the same physical point. (Color figure online)

The core idea of our approach is to form correspondences between the geo-referenced street view camera and physical 3D points that are given by the 3D models. More specifically, at any given street view location, we densely shoot rays into space in order to find intersections with nearby buildings. Each ray back projects one image pixel into the 3D space, as shown in Fig. 2-(a). By projecting the resulting intersection points onto adjacent street view panoramas (see Fig. 2-b), we can form image to image correspondences (see Fig. 2c). Each image is then associated with a (virtual) camera that fixates on the physical target point on a building by placing it on the optical center. To make the ray intersection procedure scalable, we perform occlusion reasoning on the 3D models to pre-identify from what GPS locations an arbitrary target would be visible and perform the ray intersection on those points only.

Pixel Alignment and Pruning: This system requires integration of multiple resources, including elevation maps, GPS from street view, and 3D models. Though the quality of output exceeded our expectation (see samples in Fig. 2(c)), any slight inaccuracy in the metadata or 3D models can cause a pixel misalignment in the collected images (examples shown in the first and last rows of Fig. 2(c)). Also, there are undocumented objects such as trees or moving objects that cause occlusions. Thus, a content-based post alignment and pruning was necessary. We again used metadata in our alignment procedure to be able to handle image bundles with arbitrarily wide baselines (note that the collected image bundles can show large, often \(>\!\!100^{\circ }\), viewpoint changes). In the interest of space, we describe this procedure in supplementary material (Sect. 3).

This process forms our dataset composed of matching and non-matching patches as well as the relative camera pose for the matching pairs. We stopped collecting data when we reached the coverage of >200 km\(^{2}\) from the 7 cities mentioned in Sect. 1. The collection procedure is currently done on Google street view, but can be performed using any geo-referenced calibrated imagery. We will experimentally show that the trained representation on this data does not manifest a clear bias towards street view scenes and outperforms existing feature learning methods on non-street view benchmarks.

Noise Statistics: We performed a user study through Amazon Mechanical Turk to quantify the amount of noise in the final dataset. Please see supplementary material (Sect. 3.2) for the complete discussion and results. Briefly, \(68\,\%\) of the patch pairs were found to have at least \(25\,\%\) of overlap in their content. The mean and standard deviation of pixel misalignment was 16.12 (\(\approx \)11 % of patch width) and 11.55 pixels, respectively. We did not perform any filtering or geo-fencing on top of the collected data as the amount of noise appeared to be within the robustness tolerance of ConvNet trainings and they converged.

3 Learning Using ConvNets

A joint feature descriptor was learnt by supervising a Convolutional Neural Network (ConvNet) to perform 6DOF camera pose estimation and wide baseline matching between pairs of image patches. For the purpose of training, any two image patches depicting the same physical target point in the street view dataset were labelled as matching and other pairs of images were labelled as non-matching. The training for camera pose estimation was performed using matching patches. The patches were always cropped from the center of the collected street view image to keep the optical center at the target point.

The camera pose between each pair of matching patches was represented by a 6D vector; the first three dimensions were Tait-Bryan angles (roll, yaw, pitch) and the last three dimensions were cartesian (x, y, z) translation coordinates expressed in meters. For the purpose of training, 6D pose vectors were pre-processed to be zero mean and unit standard deviation (i.e., z-scoring). The ground-truth and predicted pose vectors for the \(i^{th}\) example are denoted by \(p^{*}_i, ~p_i\) respectively. The pose estimation loss \(L_{pose}(p^{*}_i, p_i)\) was set to be the robust regression loss described in Eq. 1:

$$\begin{aligned} L_{pose}(p^{*}_i, p_i) = \left\{ \begin{array}{ll} e &{} \text{ if } e \le 1 \\ 1 + \log e &{} \text{ if } e > 1 \end{array} \right. \text{ where } e={||p^{*}_i-p_i||}_{l_2}. \end{aligned}$$
(1)

The loss function for patch matching \(L_{match}({m_i^{*}, m_i})\) was set to be sigmoid cross entropy, where \(m_i^{*}\) is the ground-truth binary variable indicating matching/non-matching and \(m_i\) is the predicted probability of matching.

ConvNet training was performed to optimize the joint matching and pose estimation loss (\(L_{joint}\)) described in Eq. 2. The relative weighting between the pose (\(L_{pose}\)) and matching (\(L_{match}\)) losses was controlled by \(\lambda \) (we set \(\lambda = 1\)).

$$\begin{aligned} L_{joint}(p^{*}_i, m_i^{*}, p_i, m_i) = L_{pose}(p^{*}_i, p_i) + \lambda L_{match}(m_i, m^{*}_i). \end{aligned}$$
(2)

Our training set consisted of patch pairs drawn from a wide distribution of baseline changes ranging from \(0^{\circ }\) to over \(120^{\circ }\). We consider patches of size 192\(\,\times \,\)192 (<15 % of the actual image size) and rescaled them to 101\(\,\times \,\)101 before passing them into the ConvNet.

A ConvNet model with siamese architecture [15] containing two identical streams with identical set of weights was used for computing the relative pose and the matching score between the two input patches. A standard ConvNet architecture was used for each stream: C(20, 7, 1)-ReLU-P(2, 2)-C(40, 5, 1)-ReLU-P(2, 2)-C(80, 4, 1)-ReLU-P(2, 2)-C(160, 4, 2)-ReLU-P(2, 2)-F(500)-ReLU-F(500)-ReLU. The naming convention is as follows: C(nks): convolutional layer n filters, spatial size \(k\times k\), and stride s. P(ks): max pooling layer of size \(k\times k\) and stride s. ReLU: rectified linear unit. F(n): fully connected linear layer with n output units. The feature descriptors of both streams were concatenated and fed into a fully connected layer of 500 units which were then fed into the pose and matching losses. With this ConvNet configuration, the size of the image representation (i.e., the last FC vector of one siamese half - see Fig. 1) is 500. Our architecture is admittedly pretty common and standard. This allows us to evaluate if our good end performance is attributed to our hypothesis on learning on foundational tasks and the new dataset, rather than a novel architecture.

We trained the ConvNet model from scratch (i.e., randomly initialized weights) using SGD with momentum (initial learning rate of .001 divided by 10 per 60 K iterations), gradient clipping, and a batch size of 256. We found that the use of gradient clipping was essential for training as even robust regression losses produce unstable gradients at the starting of training. Our network converged after 210 K iterations. Training using Euler angles performed better than quaternions (\(17.7^{\circ }\) vs \(29.8^{\circ }\) median angular error), and the robust loss outperformed the non-robust \(l_2\) loss (\(17.7^{\circ }\) vs \(22.3^{\circ }\) median angular error). Additional details about the training procedure can be found in the supplementary material.

4 Experimental Discussions and Results

We implemented our framework using data parallelism [31] on a cluster of 5–10 GPUs. At the test time, computing the representation is a feed-forward pass through a siamese half ConvNet and takes \(\sim \)2.9 ms per image on a single processor. Sections 4.1 and 4.2 provide the evaluations of the learned representation on the supervised and novel 3D tasks, respectively.

4.1 Evaluations on the Supervised Tasks

Evaluations on the Street View Dataset. The test set of pose estimation is composed of 7725 pairs of matching patches from our dataset. The test set of matching includes 4223 matching and 18648 non-matching pairs. It is made sure that no data from those areas and their vicinity is used in training. Each patch pair in the test sets was verified by three Amazon Mechanical Turkers to verify the ground truth is indeed correct. For the matching pairs, the Turkers also ensured the center pixel of patches are no more than 25 pixels (\(\sim \)3 % of image width) apart. Visualizations of the test set can be seen on our website.

Fig. 3.
figure 3

(a) Sample qualitative results of camera pose estimation. \(1^{st}\) and \(2^{nd}\) rows show the patches. The \(3^{rd}\) row depicts the estimated relative camera poses on a unit sphere (black: patch 1’s camera (reference), red: ground-truth pose of patch 2, blue: estimated pose of patch 2). Rightward and upward are the positive directions. (b)Sample wide baseline matching results. Green and red represent ‘matching’ and ‘non-matching’, respectively. Three failure cases are shown on the right. (Color figure online)

Pose Estimation. Figure 3-(a) provides qualitative results of pose estimation. The angular evaluation metric is the standard overall angular error [20, 33], defined as the angle between the predicted pose vector and the ground truth vector in the plane defined by their cross product. The translational error metric is \(l_2\) norm of the difference vector between the normalized predicted translation vector and ground truth [20, 33]. The translation vector was normalized to enable comparing with up-to-scale SfM.

Figure 4-right provides the quantitative evaluations. The plots (a) and (c) illustrate the distribution of the test set with respect to pose estimation error for each method (the more skewed to the left, the better). The green curve shows pose estimation results by human subjects. Two users with computer vision knowledge, but unaware of the particular use case, were asked to estimated the relative pitch and yaw between a random subset of 500 test pairs. They were allowed to train themselves with as many training sampled as they wished. ConvNet outperformed human on this task with a margin of \(8^{\circ }\) in median error.

Fig. 4.
figure 4

Left: Quantitative evaluation of matching. ROC curves of each method and corresponding AUC and FPR@95 values are shown in (a). Right: Quantitative evaluation of camera pose estimation. VO and SfM denote Visual Odometery (LIBVISO2) and Structure-from-Motion (visualSfM), respectively. Evaluation of robustness to wide baseline camera shifts is shown in (b) plots. (Color figure online)

Pose Estimation Baselines: We compared against Structure-from-Motion (visualSfM [59, 60] with default components and tuned hyper-parameters for pairwise pose estimation on \(192\times 192\) patches and full images) and LIBVISO2 Visual Odometery [21] on full images. Both SfM and LIBVISO2 VO suffer from a large RANSAC failure rate mostly due to the wide baselines in test pairs.

Figure 4-right (b) shows how the median angular error (Y axis) changes as the baseline of the test pairs (X axis) increases. This is achieved through binning the test set into 8 bins based on their baseline size. This plot quantifies the ability of the evaluated methods in handling a wide baseline. We adopt the slope of the curves as the quantification of deterioration in accuracy as the baseline increases.

Wide Baseline Matching. Figure 3-(b) shows samples feature matching results using our approach, with three failure cases on the right. Figure 4-left provides the quantitative results. The standard metric [12] for descriptor matching is ROC curve acquired from sorting the test set pairs according to their matching score. For unsupervised methods, e.g., SIFT, the matching score is the \(l_2\) distance. False Positive Rate at 95 % recall (FPR@95) and Area Under Curve (AUC) of ROC are standard scalar quantifications of descriptor matching [12, 50].

Matching Baselines: We compared our results with the handcrafted features of SIFT [35], Root-SIFT [7], DAISY [55], VIP [61] (which requires the surface normals in the input for which we used the normals from the 3D models), and ASIFT [41]. The matching score of ASIFT was the number of found correspondences in the test pair given the full images. We also compared against the learning based features of Zagoruyko and Komodakis [65] (using the models of authors), Simonyan et al. [50] (with and without retraining), Simo-Serra et al. [49] (using authors’ best pretrained model) as well as human subjects (the red dot on the ROC plot). Figure 4-left(b) provides the evaluations in terms of handling wide baselines, similar to Fig. 4-right(b).

Brown et al. Benchmark and Mikolajczyk’s Benchmark. We performed evaluations on the non-street view benchmarks of Brown et al. [12] and Mikolajczyk and Schmid [39] to find if (1) if our representation was performing well only on street view scenery, and (2) if wide baseline handling capability was achieved at the expense of lower performance on small baselines (as these benchmarks have a narrower baseline compared to our dataset for the most part). Tables 1 and 2 provide the quantitative results. We include a thorough description of evaluation setup and detailed discussions in the supplementary material (Sect. 2).

Table 1. Evaluations on Brown’s Benchmark [12]. FPR@95 (\(\downarrow \)) is the metric.
Table 2. Evaluation on Mikolajczyk and Schmid’s [39]. The metric is mAP(\(\uparrow \)).

Joint Feature Learning. We studied different aspects of joint learning the representation and information sharing among the core supervised tasks. In the interest of space, we provide quantitative results in supplementary material (Sect. 1). The conclusion of the tests was that: First, the problems of wide baseline matching and camera pose estimation have a great deal of shared information. Second, one descriptor can encode both problems with no performance drop.

4.2 Evaluating the 3D Representation on Novel Tasks

The results of evaluating our representation on novel 3D tasks are provided in this section. The tasks as well as the images (e.g., Airship images from ImageNet) used in these evaluations are significantly different from what our representation was trained for (i.e., camera pose estimation and matching on local patches of street view images). The fact that, despite such differences, our representation achieves best results among all unsupervised methods and gets close to supervised methods for each of the tasks empirically validates our hypothesis on learning on foundational tasks (see Sect. 1).

Our ways of evaluating and probing the representation in an unsupervised manner are (1) tSNE [36]: large-scale 2D embedding of the representation. This allows visualizing the space and getting a sense of similarity from the perspective of the representation, (2) Nearest Neighbors (NN) on the full dimensional representation, and (3) training a simple classifier (e.g., KNN or a linear classifier) on the frozen representation ( i.e., no fine-tuning) to read out a desired variable. The latter enables quantifying if the required information for solving a novel task is encoded in the representation and can be extracted using a simple function. We compare against the representations of related methods that made their models available [4, 57], various layers of AlexNet trained on ImageNet [32], and a number of supervised techniques for some of the tasks. Additional results are provided in the supplementary material and the website.

Fig. 5.
figure 5

2D embedding of our representation on 3,000 unseen patches using tSNE. An organization based on the Manhattan pose of the patches can be seen. See comparable AlexNet’s embedding in the supplementary material’s Sect. 6. (best seen on screen)

Fig. 6.
figure 6

(a) tSNE of a superset of various vanishing point benchmarks [6, 16, 34] (to battle the small size of datasets). (b) inversion [37] of our representation. Both plots shows traits of vanishing points.

Surface Normals and Vanishing Points. Figure 5 shows tSNE embedding of 3,000 unseen patches showing that the organization of the representation space is based on geometry and not semantics/appearance. The ConvNet was trained to estimate the pose between matching patches only while in the embedding, the non-matching patches with a similar pose are placed nearby. This suggests the representation has generalized the concept of pose to non-matching patches. This indeed has relations to surface normals as the relative pose between an arbitrary and a frontal patch is equal to the pose of the arbitrary patch; Fig. 5 can be perceived as the organization of the patches based on their surface normals.

To better understand how this was achieved, we visualized the activations of the ConvNet at different layers. Similar to other ConvNets, the first few layers formed general gradient based filters while in higher layers, the edges parallel in the physical world seemed to persist and cluster together. This is similar to the concept of vanishing points, and from the theoretical perspective, would be intriguing and explain the pose estimation results, since three common vanishing points are theoretically enough for a full angular pose estimation [13, 26]. To further investigate this, we generated the inversion of our representation using the method of [37] (see Fig. 6-(b)), which show patterns correlating with the vanishing points of the image. Figure 6-(a) also illustrates the tSNE of a superset of several vanishing point benchmarks showing that images with similar vanishing points are embedded nearby. Therefore, we speculate that the ConvNet has developed a representation based on the concept of vanishing pointsFootnote 2. This would also explain the results shown in the following sections.

Surface Normal Estimation on NYUv2 [48]: Numerical evaluation on unsupervised surface normal estimation provided in supplementary material Sect. 4.

Fig. 7.
figure 7

Scene layout NN search results between LSUN images and synthetic concave cubes defining abstract 3D layouts. Images with yellow boundary show the ground truth layout. (Color figure online)

Table 3. Layout classification (LSUN)
Table 4. Layout estimation (LSUN)
Table 5. Object pose estimation (PASCAL3D)

Scene Layout Estimation. We evaluated our representation on LSUN [64] layout benchmark using the standard protocol [64]. Table 4 provides the results of layout estimation using a simple NN classifier on our representation along with two supervised baselines, showing that our representation (with no fine-tuning) achieved a performance close to Hedau et al.’s [27] supervised method on this novel task. Table 3 provides the results of layout classification [64] using NN classifier on our representation compared to AlexNets FC7 and Pool5.

Abstraction: Cube \(\leftrightarrows \) Layout: To evaluate the abstract generalization abilities of our representation, we generated a sparse set of 88 images showing the interior of a simple synthetic cube parametrized over different view angles. The rendered images can be seen as an abstract cubic layout of a room. We then performed NN search between these images and LSUN dataset using our representations and several baselines. As apparent in Fig. 7, our representation retrieves meaningful NNs while the baselines mostly overfit to appearance and retrieve either an incorrect or always the same NN. This suggests our representation could abstract away the irrelevant information and encode some information essential to the 3D of the image.

Fig. 8.
figure 8

NN search results between EPFL dataset images and a synthetic cube defining an abstract 3D pose. See the supplementary material (Sect. 5) for tSNE embedding of all cubes and car poses in a joint space. Note that the 3D poses defined by the cubes are 90\(^{\circ }\) congruent.

3D Object Pose Estimation.

Abstraction: Cube \(\leftrightarrows \) Object: We performed a similar abstraction test between a set of 88 convex cubes and the images of EPFL Multi-View Car dataset [43], which includes a dense sampling of various viewpoints of cars in an exhibition. We picked this simple cube pattern as it is the simplest geometric element that defines three vanishing points. The same observation as the abstraction experiment on LSUN’s is made here with our NNs being meaningful while baselines mostly overfit to appearance with no clear geometric abstraction trait (Fig. 8).

ImageNet: Figure 9 shows the tSNE embedding of several ImageNet categories based on our representation and the baselines. The embeddings of our representation are geometrically meaningful, while the baselines either perform a semantic organization or overfit to other aspects, such as color.

Fig. 9.
figure 9

tSNE of several ImageNet categories using our unsupervised representation along with several baselines. Our representation manifests a meaningful geometric organization of objects. tSNE of more categories in the supplementary material and the website. (best seen on screen) (Color figure online)

Fig. 10.
figure 10

Qualitative results of cross-category NN-search on PASCAL3D using our representation along with baselines.

PASCAL3D: Figure 10 shows cross-category NN search results for our representation along with several baselines. This experiment also evaluates a certain level of abstraction as some of the object categories can be drastically different looking. We also quantitatively evaluated on 3D object pose estimation on PASCAL3D. For this experiment, we trained a ConvNet from scratch, fine-tuned AlexNet pre-trained on ImageNet, and fine-tuned our network; we read the pose out using a linear regressor layer.Footnote 3 Our results outperform scratch network and come close to AlexNet that has seen thousands of images from the same categories from ImageNet and other objects (Table 5). Note that certain aspects of object pose estimation, e.g., distinguishing between the front and back of a bus, are more of a semantic task rather than geometric/3D. This explains a considerable part of the failures of our representation which is object/semantic agnostic.

5 Discussion and Conclusion

To summarize, we developed a generic 3D representation through solving a set of supervised foundational proxy tasks. We reported state-of-the-art results on the supervised tasks and showed the learned representation manifests generalization and abstraction traits. However, a number of questions remain open:

Though we were inspired by cognitive studies in defining the foundational supervised tasks leading to a generalizable representation, this remains at an inspiration level. Given that a ‘taxonomy’ among basic 3D tasks has not been developed, it is not concretely defined which tasks are foundational and which ones are secondary. Developing such a taxonomy (i.e., whether task A is inclusive of, overlapping with, or disjoint from task B) or generally efforts understanding the task space would be a rewarding step towards soundly developing the 3D complete representation. Also, semantic and 3D aspects of the visual world are tangled together. So far, we have developed independent semantic and 3D representations, but investigating concrete techniques for integrating them (beyond simplistic late fusion or ConvNet fine-tuning) is a worthwhile future direction for research. Perhaps, inspirations from partitions of visual cortex could be insightful towards developing the ultimate vision complete representation.