1 Introduction

Automatic 3D reconstruction from a single image has long been a challenging problem in computer vision. Previous work have demonstrated that an effective approach to this problem is exploring structural regularities in man-made environments, such as planar surfaces, repetitive patterns, symmetries, rectangles and cuboids [5, 12, 14, 15, 21, 28, 33]. Further, the 3D models obtained by harnessing such structural regularities are often attractive in practice, because they provide a high-level, compact representation of the scene geometry, which is desirable for many applications such as large-scale map compression, semantic scene understanding, and human-robot interaction.

In this paper, we study how to recover 3D planes – arguably the most common structure in man-made environments – from a single image. In the literature, several methods have been proposed to fit a scene with a piecewise planar model. These methods typically take a bottom-up approach: First, geometric primitives such as straight line segments, corners, and junctions are detected in the image. Then, planar regions are discovered by grouping the detected primitives based on their spatial relationships. For example, [3, 6, 27, 34] first detect line segments in the image, and then cluster them into several classes, each associated with a prominent vanishing point. [21] further detects junctions formed by multiple intersecting planes to generate model hypotheses. Meanwhile, [9, 11, 16] take a learning-based approach to predict the orientations of local image patches, and then group the patches with similar orientations to form planar regions.

However, despite its popularity, there are several inherent difficulties with the bottom-up approach. First, geometric primitives may not be reliably detected in man-made environments (e.g., due to the presence of poorly textured or specular surfaces). Therefore, it is very difficult to infer the geometric properties of such surfaces. Second, there are often a large number of irrelevant features or outliers in the detected primitives (e.g., due to presence of non-planar objects), making the grouping task highly challenging. This is the main reason why most existing methods resort to rather restrictive assumptions, e.g., requiring “Manhattan world” scenes with three mutually-orthogonal dominant directions or a “box” room model, to filter outliers and produce reasonable results. But such assumptions greatly limit the applicability of those methods in practice.

Fig. 1.
figure 1

We propose a new, end-to-end trainable deep neural network to recover 3D planes from a single image. (a) Given an input image, the network simultaneously predicts (i) a plane segmentation map that partitions the image into planar surfaces plus non-planar objects, and (ii) the plane parameters \(\{\mathbf {n}_j\}_{j=1}^m\) in 3D space. (b) With the output of our network, a piecewise planar 3D model of the scene can be easily created.

In view of these fundamental difficulties, we take a very different route to 3D plane recovery in this paper. Our method does not rely on grouping low-level primitives such as line segments and image patches. Instead, inspired by the recent success of convolutional neural networks (CNNs) in object detection and semantic segmentation, we design a novel, end-to-end trainable network to directly identify all planar surfaces in the scene, and further estimate their parameters in the 3D space. As illustrated in Fig. 1, the network takes a single image as input, and outputs (i) a segmentation map that identifies the planar surfaces in the image and (ii) the parameters of each plane in the 3D space, thus effectively creating a piecewise planar model for the scene.

One immediate difficulty with our learning-based approach is the lack of training data with annotated 3D planes. To avoid the tedious manual labeling process, we propose a novel plane structure-induced loss which essentially casts our problem as one of single-image depth prediction. Our key insight here is that, if we can correctly identify the planar regions in the image and predict the plane parameters, then we can also accurately infer the depth in these regions. In this way, we are able to leverage existing large-scale RGB-D datasets to train our network. Moreover, as pixel-level semantic labels are often available in these datasets, we show how to seamlessly incorporate the labels into our network to better distinguish planar and non-planar objects.

In summary, the contributions of this work are: (i) We design an effective, end-to-end trainable deep neural network to directly recover 3D planes from a single image. (ii) We develop a novel learning scheme that takes advantage of existing RGB-D datasets and the semantic labels therein to train our network without extra manual labeling effort. Experiment results demonstrate that our method significantly outperforms, both qualitatively and quantitatively, existing plane detection methods. Further, our method achieves real-time performance at the testing time, thus is suitable for a wide range of applications such as visual localization and mapping, and human-robot interaction.

2 Related Work

3D Plane Recovery from a Single Image. Existing approaches to this problem can be roughly grouped into two categories: geometry-based methods and appearance-based methods. Geometry-based methods explicitly analyze the geometric cues in the 2D image to recover 3D information. For example, under the pinhole camera model, parallel lines in 3D space are projected to converging lines in the image plane. The common point of intersection, perhaps at infinity, is called the vanishing point [13]. By detecting the vanishing points associated with two sets of parallel lines on a plane, the plane’s 3D orientation can be uniquely determined [3, 6, 27]. Another important geometric primitive is the junction formed by two or more lines of different orientations. Several work make use of junctions to generate plausible 3D plane hypotheses or remove impossible ones [21, 34]. And a different approach is to detect rectangular structures in the image, which are typically formed by two sets of orthogonal lines on the same plane [26]. However, all these methods rely on the presence of strong regular structures, such as parallel or orthogonal lines in a Manhattan world scene, hence have limited applicability in practice.

To overcome this limitation, appearance-based methods focus on inferring geometric properties of an image from its appearance. For example, [16] proposes a diverse set of features (e.g., color, texture, location and shape) and uses them to train a model to classify each superpixel in an image into discrete classes such as “support” and “vertical (left/center/right)”. [11] uses a learning-based method to predict continuous 3D orientations at a given image pixel. Further, [9] automatically learns meaningful 3D primitives for single image understanding. Our method also falls into this category. But unlike existing methods which take a bottom-up approach by grouping local geometric primitives, our method trains a network to directly predict global 3D plane structures. Recently, [22] also proposes a deep neural network for piecewise planar reconstruction from a single image. But its training requires ground truth 3D planes and does not take advantage of the semantic labels in the dataset.

Machine Learning and Geometry. There is a large body of work on developing machine learning techniques to infer pixel-level geometric properties of the scene, mostly in the context of depth prediction [7, 30] and surface normal prediction [8, 18]. But few work has been done on detecting mid/hight-level 3D structures with supervised data. A notable exception which is also related to our problem is the line of research on indoor room layout estimation [5, 14, 15, 20, 28]. In these work, however, the scene geometry is assumed to follow a simple “box” model which consists of several mutually orthogonal planes (e.g., ground, ceiling, and walls). In contrast, our work aims to detect 3D planes under arbitrary configurations.

3 Method

3.1 Difficulty in Obtaining Ground Truth Plane Annotations

As most computer vision problems, a large-scale dataset with ground truth annotations is needed to effectively train the neural network for our task. Unfortunately, since the planar regions often have complex boundaries in an image, manual labeling of such regions could be very time-consuming. Further, it is unclear how to extract precise 3D plane parameters from an image.

To avoid the tedious manual labeling process, one strategy is to automatically convert the per-pixel depth maps in existing RGB-D datasets into planar surfaces. To this end, existing multi-model fitting algorithms can be employed to cluster 3D points derived from the depth maps. However, this is not an easy task either. Here, the fundamental difficulty lies in the choice of a proper threshold in practice to distinguish the inliers of a model instance (e.g., 3D points on a particular plane) from the outliers, regardless of which algorithm one chooses.

To illustrate this difficulty, we use the SYNTHIA dataset [29] which provides a large number of photo-realistic synthetic images of urban scenes and the corresponding depth maps (see Sect. 4.1 for more details). The dataset is generated by rendering a virtual city created using the Unity game development platform. Thus, the depth maps are noise-free. To detect planes from the 3D point cloud, we apply a popular multi-model fitting method called J-Linkage [31]. Similar to the RANSAC technique, this method is based on sampling consensus. We refer interested readers to [31] for a detailed description of the method.

A key parameter of J-Linkage is a threshold \(\epsilon \) which controls the maximum distance between a model hypothesis (i.e., a plane) and the data points belonging to the hypothesis. In Fig. 2, we show example results produced by J-Linkage with different choices of \(\epsilon \). As one can see in Fig. 2(c), when a small threshold (\(\epsilon =0.5\)) is used, the method breaks the building facade on the right into two planes. This is because the facade is not completely planar due to small indentations (e.g., the windows). When a large threshold (\(\epsilon =2\)) is used (Fig. 2(d)), the stairs on the building on the left are incorrectly grouped with another building. Also, some objects (e.g., cars, pedestrians) are merged with the ground. If we use these results as ground truth to train a deep neural network, the network will also likely learn the systematic errors in the estimated planes. And the problem becomes even worse if we want to train our network on real datasets. Due to the limitation of existing 3D acquisition systems (e.g., RGB-D cameras and LIDAR devices) and computational tools, the depth maps in these datasets are often noisy and of limited resolution and limited reliable range. Clustering based on such depth maps is prone to errors.

Fig. 2.
figure 2

Difficulty in obtaining ground truth plane annotations. (a–b): Original image and depth map. (c–d): Plane fitting results generated by J-Linkage with \(\epsilon = 0.5\) and \(\epsilon = 2\), respectively.

3.2 A New Plane Structure-Induced Loss

The challenge in obtaining reliable labels motivates us to develop alternative training schemes for 3D plane recovery. Specifically, we ask the following question: Can we leverage the wide availability of large-scale RGB-D and/or 3D datasets to train a network to recognize geometric structures such as planes without obtaining ground truth annotations about the structures?

To address this question, our key insight is that, if we can recover 3D planes from the image, then we can use these planes to (partially) explain the scene geometry, which is generally represented by a 3D point cloud. Specifically, let \(\{I_i, D_i\}_{i=1}^n\) denote a set of n training RGB image and depth map pairs with known camera intrinsic matrix K.Footnote 1 Then, for any pixel \(\mathbf {q}\doteq [x,y,1]^T\) (in homogeneous coordinates) on image \(I_i\), it is easy to compute the corresponding 3D point as \(Q = D_i(\mathbf {q})\cdot K^{-1}\mathbf {q}\). Further, let \(\mathbf {n}\in \mathbb {R}^3\) represents a 3D plane in the scene. If Q lies on the plane, then we have \(\mathbf {n}^T Q = 1\)Footnote 2.

With the above observation, assuming there are m planes in the image \(I_i\), we can now train a network to simultaneously output (i) a per-pixel probability map \(S_i\), where \(S_i(\mathbf {q})\) is an \((m+1)\)-dimensional vector with its j-th element \(S_i^j(\mathbf {q})\) indicating the probability of pixel \(\mathbf {q}\) belonging to the j-th plane,Footnote 3 and (ii) the plane parameters \(\varPi _i = \{\mathbf {n}_i^j\}_{j=1}^m\), by minimizing the following objective function:

$$\begin{aligned} \mathcal {L}= \sum _{i=1}^n \sum _{j=1}^m \left( \sum _{\mathbf {q}} S_i^j(\mathbf {q}) \cdot | (\mathbf {n}_i^j)^T Q - 1| \right) + \alpha \sum _{i=1}^n \mathcal {L}_{reg}(S_i), \end{aligned}$$
(1)

where \(\mathcal {L}_{reg}(S_i)\) is a regularization term preventing the network from generating a trivial solution \(S_i^0(\cdot ) \equiv 1\), i.e., classifying all pixels as non-planar, and \(\alpha \) is a weight balancing the two terms.

Before proceeding, we make two important observations about our formulation Eq. (1). First, the term \(| (\mathbf {n}_i^j)^T Q - 1|\) measures the deviation of a 3D scene point Q from the j-th plane in \(I_i\), parameterized by \(\mathbf {n}_i^j\). In general, for a pixel \(\mathbf {q}\) in the image, we know from perspective geometry that the corresponding 3D point must lie on a ray characterized by \(\lambda K^{-1}\mathbf {q}\), where \(\lambda \) is the depth at \(\mathbf {q}\). If this 3D point is also on the j-th plane, we must have

$$\begin{aligned} (\mathbf {n}_i^j) ^T \cdot \lambda K^{-1}\mathbf {q}= 1 \Longrightarrow \lambda = \frac{1}{(\mathbf {n}_i^j) ^T \cdot K^{-1}\mathbf {q}}. \end{aligned}$$
(2)

Hence, in this case, \(\lambda \) can be regarded as the depth at \(\mathbf {q}\) constrained by \(\mathbf {n}_i^j\). Now, we can rewrite the term as:

$$\begin{aligned} | (\mathbf {n}_i^j)^T Q - 1| = |(\mathbf {n}_i^j)^T D_i(\mathbf {q})\cdot K^{-1}\mathbf {q}- 1| = | D_i(\mathbf {q}) / \lambda -1 |. \end{aligned}$$
(3)

Thus, the term \(| (\mathbf {n}_i^j)^T Q - 1|\) essentially compares the depth \(\lambda \) induced by the j-th predicted plane with the ground truth \(D_i(\mathbf {q})\), and penalizes the difference between them. In other words, our formulation casts the 3D plane recovery problem as a depth prediction problem.

Second, Eq. (1) couples plane segmentation and plane parameter estimation in a loss that encourages consistent explanations of the visual world through the recovered plane structure. It mimics the behavior of biological agents (e.g., humans) which also employ structural priors for 3D visual perception of the world [32]. This is in contrast to alternative methods that rely on ground truth plane segmentation maps and plane parameters as direct supervision signals to tackle the two problems separately.

3.3 Incorporating Semantics for Planar/Non-planar Classification

Now we turn our attention to the regularization term \(\mathcal {L}_{reg}(S_i)\) in Eq. (1). Intuitively, we wish to use the predicted planes to explain as much scene geometry as possible. Therefore, a natural choice of \(\mathcal {L}_{reg}(S_i)\) is to encourage plane predictions by minimizing the cross-entropy loss with constant label 1 at each pixel. Specifically, let \(p_{plane}(\mathbf {q}) = \sum _{j=1}^m S_i^j(\mathbf {q})\) be the sum of probabilities of pixel \(\mathbf {q}\) being assigned to each plane, we write

$$\begin{aligned} \mathcal {L}_{reg}(S_i) = \sum _{\mathbf {q}} - 1\cdot \log (p_{plane}(\mathbf {q})) - 0\cdot \log (1 - p_{plane}(\mathbf {q})). \end{aligned}$$
(4)

Note that, while the above term effectively encourages the network to explain every pixel in the image using the predicted plane models, it treats all pixels equally. However, in practice, some objects are more likely to form meaningful planes than others. For example, a building facade is often regarded as a planar surface, whereas a pedestrian or a car is typically viewed as non-planar. In other words, if we can incorporate such high-level semantic information into our training scheme, the network is expected to achieve better performance in differentiating planar vs. non-planar surfaces.

Motivated by this observation, we propose to further utilize the semantic labels in the existing datasets. Take the SYNTHIA dataset as an example. The dataset provides precise pixel-level semantic annotations for 13 classes in urban scenes. For our purpose, we group these classes into “planar” = {building, fence, road, sidewalk, lane-marking} and “non-planar” = {sky, vegetation, pole, car, traffic signs, pedestrians, cyclists, miscellaneous}. Then, let \(z(\mathbf {q}) = 1\) if pixel \(\mathbf {q}\) belongs to one of the “planar” classes, and \(z(\mathbf {q})=0\) otherwise, we can revise our regularization term as:

$$\begin{aligned} \mathcal {L}_{reg}(S_i) = \sum _{\mathbf {q}} - z(\mathbf {q}) \cdot \log (p_{plane}(\mathbf {q})) - (1-z(\mathbf {q}))\cdot \log (1 - p_{plane}(\mathbf {q})). \end{aligned}$$
(5)

Note that the choices of planar/non-planar classes are dataset- and problem-dependent. For example, one may argue that “sky” can be viewed as plane at infinity, thus should be included in the “planar” classes. Regardless the particular choices, we emphasize that here we provide a flexible way to incorporate high-level semantic information (generated by human annotators) to the plane detection problem. This is in contrast to traditional geometric methods that solely rely on a single threshold to distinguish planar vs. non-planar surfaces.

3.4 Network Architecture

In this paper, we choose a fully convolutional network (FCN), following its recent success in various pixel-level prediction tasks such as semantic segmentation [2, 23] and scene flow estimation [25]. Figure 3 shows the overall architecture of our proposed network. To simultaneously estimate the plane segmentation map and plane parameters, our network consists of two prediction branches, as we elaborate below.

Plane Segmentation Map. To predict the plane segmentation map, we use an encoder-decoder design with skip connections and multi-scale side predictions, similar to the DispNet architecture proposed in [25]. Specifically, the encoder takes the whole image as input and produces high-level feature maps via a convolutional network. The decoder then gradually upsamples the feature maps via deconvolutional layers to make final predictions, taking into account also the features from different encoder layers. The multi-scale side predictions further allow the network to be trained with deep supervision. We use ReLU for all layers except for the prediction layers, where the softmax function is applied.

Fig. 3.
figure 3

Network architecture. The width and height of each block indicates the channel and the spatial dimension of the feature map, respectively. Each reduction (or increase) in size indicates a change by a factor of 2. The first convolutional layer has 32 channels. The filter size is 3 except for the first four convolutional layers (7, 7, 5, 5).

Plane Parameters. The plane parameter prediction branch shares the same high-level feature maps with the segmentation branch. The branch consists of two stride-2 convolutional layers (\(3\times 3\times 512\)) followed by a \(1\times 1\times 3\) m convolutional layer to output the parameters of the m planes. Global average pooling is then used to aggregate predictions across all spatial locations. We use ReLU for all layers except for the last layer, where no activation is applied.

Implementation Details. Our network is trained from scratch using the publicly available Tensorflow framework. By default, we set the weight in Eq. (1) as \(\alpha =0.1\), and the number of planes as \(m=5\). During training, we adopt the Adam [17] method with \(\beta _1 = 0.99\) and \(\beta _2=0.9999\). The batch size is set to 4, and the learning rate is set to 0.0001. We also augment the data by scaling the images with a random factor in [1, 1.15] followed by a random cropping. Convergence is reached at about 500K iterations.

4 Experiments

In this section, we conduct experiments to study the performance of our method, and compare it to existing ones. All experiments are conducted on one Nvidia GTX 1080 Ti GPU device. At testing time, our method runs at about 60 frames per second, thus are suitable for potential real-time applicationsFootnote 4.

4.1 Datasets and Ground Truth Annotations

SYNTHIA: The recent SYNTHIA dataset [29] comprises more than 200,000 photo-realistic images rendered from virtual city environments with precise pixel-wise depth maps and semantic annotations. Since the dataset is designed to facilitate autonomous driving research, all frames are acquired from a virtual car as it navigates in the virtual city. The original dataset contains seven different scenarios. For our experiment, we select three scenarios (SEQS-02, 04, and 05) that represents city street views. For each scenario, we use the sequences for all four seasons (spring, summer, fall, and winter). Note that, to simulate real traffic conditions, the virtual car makes frequent stops during navigation. As a result, the dataset has many near-identical frames. We filter these redundant frames using a simple heuristic based on the vehicle speed. Finally, from the remaining frames, we randomly sample 8,000 frames as the training set and another 100 frames as the testing set.

For quantitative evaluation, we need to label all the planar regions in the test images. As we discussed in Sect. 3.1, automatic generation of ground truth plane annotations is difficult and error-prone. Thus, we adopt a semi-automatic method to interactively determine the ground truth labels with user input. To label one planar surface in the image, we ask the user to draw a quadrilateral region within that surface. Then, we fit a plane to the 3D points (derived from the ground truth depth map) that fall into that region to obtain the plane parameters and an instance-specific estimate of the variance of the distance distribution between the 3D points and the fitted plane. Note that, with the instance-specific variance estimate, we are able to handle surfaces with varying degrees of deviation from a perfect plane, but are commonly regarded as “planes” by humans. Finally, we use the plane parameters and the variance estimate to find all pixels that belong to the plane. We repeat this process until all planes in the image are labeled.

Cityscapes: Cityscapes [4] contains a large set of real street-view video sequences recorded in different cities. From the 3,475 images with publicly available fine semantic annotations, we randomly select 100 images for testing, and use the rest for training. To generate the planar/non-planar masks for training, we label pixels in the following classes as “planar” = {ground, road, sidewalk, parking, rail track, building, wall, fence, guard rail, bridge, and terrain}.

In contrast to SYNTHIA, the depth maps in Cityscapes are highly noisy because they are computed from stereo correspondences. Fitting planes on such data is extremely difficult even with user input. Therefore, to identify planar surfaces in the image, we manually label the boundary of each plane using polygons, and further leverage the semantic annotations to refine it by ensuring that the plane boundary aligns with the object boundary, if they overlap.

4.2 Methods for Comparison

As discussed in Sect. 2, a common approach to plane detection is to use geometric cues such as vanishing points and junction features. However, such methods all make strong assumptions on the scene geometry, e.g., a “box”-like model for indoor scenes or a “vertical-ground” configuration for outdoor scenes. They would fail when these assumptions are violated, as in the case of SYNTHIA and Cityscapes datasets. Thus, we do not compare to these methods. Instead, we compare our method to the following appearance-based methods:

Depth + Multi-model Fitting: For this approach, we first train a deep neural network to predict pixel-level depth from a single image. We directly adopt the DispNet architecture [25] and train it from scratch with ground truth depth data. Following recent work on depth prediction [19], we minimize the berHu loss during training.

To find 3D planes, we have then applied two different multi-model fitting algorithms, namely J-Linkage [31] and RansaCov [24], on the 3D points derived from the predicted depth map. We call the corresponding methods Depth + J-Linkage and Depth + RansaCov, respectively. For fair comparison, we only keep the top-5 planes detected by each method. As mentioned earlier, a key parameter in these methods is the distance threshold \(\epsilon \). We favor them by running J-Linkage or RansaCov multiple times with various values of \(\epsilon \) and retaining the best results.

Geometric Context (GC) [16]: This method uses a number of hand-crafted local image features to predict discrete surface layout labels. Specifically, it trains decision tree classifiers to label the image into three main geometric classes {support, vertical, sky}, and further divide the “vertical” class into five subclasses {left, center, right, porous, solid}. Among these labels, we consider the “support” class and “left”, “center”, “right” subclasses as four different planes, and the rest as non-planar.

To retrain their classifiers using our training data, we translate the labels in SYNTHIA dataset into theirsFootnote 5 and use the source code provided by the authorsFootnote 6. We found that this yields better performance on our testing set than the pre-trained classifiers provided by the authors. We do not include this method in the experiment on Cityscapes dataset because it is difficult to determine the orientation of the vertical structures from the noisy depth maps.

Finally, we note that there is another closely related work [11], which also detects 3D planes from a single image. Unfortunately, the source code needed to train this method on our datasets is currently unavailable. And it is reported in [11] that its performance on plane detection is on par with that of GC. Thus, we decided to compare our method to GC instead.

4.3 Experiment Results

Plane Segmentation. Figure 4 shows example plane segmentation results on SYNTHIA dataset. We make several important observations below.

Fig. 4.
figure 4

Plane segmentation results on SYNTHIA. From left to right: Input image; Ground truth; Depth + J-Linkage; Depth + RansaCov; Geometric Context; Ours.

First, Neither Depth + J-Linkage nor Depth + RansaCov performs well on the test images. In many cases, they fail to recover the individual planar surfaces (except the ground). To understand the reason, we show the 3D point cloud derived from the predicted depth map in Fig. 5. As one can see, the point cloud tends to be very noisy, making the task of choosing a proper threshold \(\epsilon \) in the multi-model fitting algorithm extremely hard, if possible at all – if \(\epsilon \) is small, it would not be able to tolerate the large noises in the point cloud; if \(\epsilon \) is large, it would incorrectly merge multiple planes/objects into one cluster. Also, these methods are unable to distinguish planar and non-planar objects due to lack of ability to reason about the scene semantics.

Second, GC does a relatively good job in identifying major scene categories (e.g., separating the ground, sky from buildings). However, it has difficulty in determining the orientation of vertical structures (e.g., Fig. 4, first and fifth rows). This is mainly due to the coarse categorization (left/center/right) used by this method. In complex scenes, such a discrete categorization is often ineffective and ambiguous. Also, recall that GC is unable to distinguish planes that have the same orientation but are at different distances (e.g., Fig. 4, fourth row), not to mention finding the precise 3D plane parameters.

Fig. 5.
figure 5

Comparison of 3D models. First column: Input image. Second and third columns: Model generated by depth prediction. Fourth and fifth columns: Model generated by our method.

Table 1. Plane segmentation results. Left: SYNTHIA. Right: Cityscapes.

Third, our method successfully detects most prominent planes in the scene, while excluding non-planar objects (e.g., trees, cars, light poles). This is no surprise because our supervised framework implicitly encodes high-level semantic information as it learns from the labeled data provided by humans. Interestingly, one may observe that, in the last row of Fig. 4, our method classifiers the unpaved ground next to the road as non-planar. This is because such surfaces are not considered part of the road in the original SYNTHIA labels. Figure 5 further shows some piecewise planar 3D models obtained by our method.

Fig. 6.
figure 6

Plane segmentation results on Cityscapes. From left to right: Input image; Ground truth; Depth + J-Linkage; Depth + RansaCov; Ours (w/o fine-tuning); Ours (w/ fine-tuning).

For quantitative evaluation, we use three popular metrics [1] to compare the plane segmentation maps obtained by an algorithm with the ground truth: Rand index (RI), variation of information (VOI), and segmentation covering (SC). Table 1(left) compares the performance of all methods on SYNTHIA dataset. As one can see, our method outperforms existing methods by a significant margin w.r.t all evaluation metrics.

Table 1(right) further reports the segmentation accuracies on Cityscapes dataset. We test our method under two settings: (i) directly applying our model trained on SYNTHIA dataset, and (ii) fine-tuning our network on Cityscapes dataset. Again, our method achieves the best performance among all methods. Moreover, fine-tuning on the Cityscapes dataset significantly boost the performance of our network, despite that the provided depth maps are very noisy. Finally, we show example segmentation results on Cityscapes in Fig. 6.

Depth Prediction. To further evaluate the quality of the 3D planes estimated by our method, we compare the depth maps derived from the 3D planes with those obtained via standard depth prediction pipeline (see Sect. 4.2 for details). Recall that our method outputs a per-pixel probability map \(S(\mathbf {q})\). For each pixel \(\mathbf {q}\) in the test image, we pick the 3D plane with the maximum probability to compute our depth map. We exclude pixels which are considered as “non-planar” by our method, since our network is not designed to make depth predictions in that case.

As shown in Table 2, our method achieves competitive results on both datasets, but the accuracies are slightly lower than those of standard depth prediction pipeline. The decrease in accuracy may be partly attributed to that our method is designed to recover large planar structures in the scene, therefore ignores small variations and details in the scene geometry.

Table 2. Depth prediction results.
Fig. 7.
figure 7

Failure examples.

Failure Cases. Figure 7 shows typical failure cases of our method, which include occasionally separating one plane into two (first column) or merging multiple planes into one (second column). Interestingly, for the formal case, one can still obtain a decent 3D model (Fig. 5, last row), suggesting opportunities to further refine our results via post-processing. Our method also has problem with curved surfaces (third column).

Other failures are typically associated with our assumption that there are at most \(m=5\) planes in the scene. For example, in Fig. 7, fourth column, the building on the right has a large number of facades. And it becomes even more difficult when multiple planes are at great distance (fifth column). We leave adaptively choosing the plane number in our framework for future work.

5 Conclusion

This paper has presented a novel approach to recovering 3D planes from a single image using convolutional neural networks. We have demonstrated how to train the network, without 3D plane annotations, via a novel plane structure-induced loss. In fact, the idea of exploring structure-induced loss to train neural networks is by no means restricted to planes. We plan to generalize the idea to detect other geometric structures, such as rectangles and cuboids.

Another promising direction for future work would be to improve the generalizability of the networks via unsupervised learning, as suggested by [10]. For example, it is interesting to probe the possibility of training our network without depth information, which is hard to obtain in many real world applications.