1 Introduction

Semi-Global Matching (SGM) is a popular stereo matching algorithm proposed by Hirschmüller [15] that has found widespread use in applications ranging from 3D mapping [17, 34, 39, 40], robot and drone navigation [19, 38], and assisted driving [8]. The technique is efficient and parallelizable and suitable for real-time stereo reconstruction on FPGAs and GPUs [2, 9, 19]. SGM incorporates regularization in the form of smoothness priors, similar to global stereo methods but at lower computational cost. The main idea in SGM is to approximate a 2D Markov random field (MRF) optimization problem with several independent 1D scanline optimization problems corresponding to multiple canonical scanline directions in the image (typically 4 or 8). These 1D problems are optimized exactly using dynamic programming (DP) by aggregating matching costs along the multi-directional 1D scanlines. The costs of the minimum cost paths for the various directions are then summed up to compute a final aggregated cost per pixel. Finally, a winner-take-all (WTA) strategy is used to select the disparity with the minimum aggregated cost at each pixel.

Summation of the aggregated costs from multiple directions and the final WTA strategy are both ad-hoc steps in SGM that lack proper theoretical justification. The summation was originally proposed to reduce 1D streaking artifacts [15] but is ineffective for weakly textured slanted surfaces and also generally inadequate when multiple scanline optimization solutions are inconsistent.

Fig. 1.
figure 1

Fusing Multiple Scanline Proposals. Left: Visualization of disparity maps from SGM, two (out of 8) scanline optimizations (SO) and our proposed SGM-Forest method. While SGM is more accurate than each SO on the whole image, each SO solution is better in some specific areas. SGM-Forest identifies the best SO proposal at each pixel and produces the best overall result. Right: Error plots for SGM, SO and SGM-Forest solutions (solid line) and upper bounds for oracles making optimal selections (dotted line). In this example, SGM-Forest gets close to the upper bounds.

Our main motivation in this work is to devise a better strategy to fuse 1D scanline optimization costs from multiple directions. We argue that the scanline optimization solutions should be considered as independent disparity map proposals and the WTA step should be replaced by a more general fusion step. Figure 1 shows two of the eight scanline optimization solutions for the Adirondack pair from the Middlebury 2014 dataset [35]. While both solutions suffer from directional bias due to their respective propagation directions, each solution is accurate in certain image regions where the other one is inaccurate. For example, the horizontal pass produces accurate disparities near the left occlusion boundaries of the chair, whereas the diagonal pass performs better on the right occlusion edges. In those regions, the final SGM solution is slightly worse. The error plot in Fig. 1 quantifies this observation for the entire image. Whereas SGM is more accurate than each scanline optimization individually, the joint accuracy of all scanlines is much higher than SGM. Here, joint accuracy refers to a theoretical upper bound of the achievable accuracy of an oracle, which has access to ground truth and selects the best out of all the scanline solution proposals.

Based on this insight, we formulate the fusion step as the task of selecting the best amongst all the scanline optimization proposals at each pixel in the image. We propose to solve this task using supervised learning. Our method, named SGM-Forest, uses a per-pixel random forest classifier. As shown in Fig. 1, it gets close to the theoretical upper bound and significantly outperforms SGM.

The per-pixel classifier in SGM-Forest is trained on a low-dimensional input feature that encodes a sparse set of aggregated cost samples. Specifically, these cost values are sampled from the cost volumes computed during the scanline optimization passes. The sampling locations correspond to the disparity candidates for all scanline directions at each pixel. In fact, the proposals need not be limited to the usual scanline directions. Including the SGM solution and two horizontal scanline optimization solutions from the right image as additional proposals improves accuracy further. We train and evaluate the forest using ground truth disparity maps provided by stereo benchmarks [35, 37, 41]. At test time, the random forest predicts the disparity proposal to be selected at each pixel. Inference is fast and parallelizable and thus has small overhead. The forest automatically outputs per-pixel posterior class probabilities from which suitable confidence maps are derived, for use in a final disparity refinement step.

Thus, the main contribution in this paper is a new, efficient learning-based fusion method for SGM that directly predicts the best amongst all the 1D scanline optimization disparity proposals at each pixel based on a small set of scanline optimization costs. SGM-Forest uses this fusion method instead of SGM’s sum-based aggregation and WTA steps and our results shows that it consistently outperforms SGM in many different settings. We evaluate SGM-Forest on three stereo benchmarks. Currently, it is ranked 1st on ETH3D [41] and is competitive on Middlebury 2014 [35] and KITTI 2015 [10]. We run extensive ablation studies and show that our method is extremely robust to dataset bias. It outperforms SGM even when the forests are trained on datasets from different domains.

2 Related Work

In this section, we review SGM and learning-based methods for stereo. We then compare and contrast our proposed SGM-Forest to closely related works.

SGM was built on top of earlier methods such as 1D scanline optimization [29, 37, 50] and dynamic programming stereo [46] with a new aggregation scheme to fix the lack of proper 2D regularization in those methods. However, a proper derivation of the aggregation step remained elusive until Drory et al. [6] showed its connection to non-loopy belief propagation on a special graph structure. Veksler [47] and Bleyer et al. [3] advanced dynamic programming stereo to tree structures connecting all pixels, but those methods have not been widely adopted. SGM has been extended to improve speed and accuracy [1, 2, 7, 9, 13, 14, 16, 19], reduce memory usage [18, 19, 23], and to compute optical flow [45, 49].

Scharstein and Pal [36] were one of the first to use learning in stereo. They trained a conditional random field (CRF) on Middlebury 2005–06 datasets to model the relationship between the CRF’s penalty terms and local intensity gradients in the image. The KITTI and Middlebury 2014 [10, 35] benchmarks encouraged much work on learning. In particular, CNNs have been trained to compute robust matching costs [5, 25, 48]. Zbontar and Lecun were the first; they proposed MC-CNN [48] and reported higher accuracy when using MC-CNN in conjunction with SGM for regularization and additional post-processing steps. Newer methods combined MC-CNN with better optimization but as a result are much slower. The method of Taniai et al. [44] uses iterative graph cut optimization and MC-CNN-acrt [48] and is the current state of the art on Middlebury.

End-to-end training of CNNs is nowadays popular on KITTI [11, 21, 27, 30] but is almost never tested on Middlebury. In one rare case, moderate results were reported [22]. In contrast, our method generalizes across three benchmarks [10, 35, 41] on which it consistently outperforms baseline SGM. Furthermore, we train three separate models on Middlebury 2005–06, KITTI, and ETH3D. All three outperform SGM when tested on the Middlebury 2014 training set. SGM-Net [42] is a CNN-based method for improving SGM. SGM-Net performs more accurate scanline optimization by using a CNN to predict the parameters of the underlying scanline optimization objective. In contrast, we use regular scanline optimization but propose a learning-based fusion step using random forests.

Stereo matching has been solved by combining multiple disparity maps using MRF fusion moves [4, 24, 44]. Fusion moves are quite general, but computationally expensive and need many iterations. This makes them slow. Alternatively, multiple disparity maps can also be fused using learning, based on random forests [43] and CNNs [32]. Other methods first predict confidence maps [20], often via learning [12, 26, 31, 33], and then use the predicted confidence values in a greedy fashion to combine multiple solutions. Drory et al. [6] proposed a different uncertainty measure for SGM but do not show how to use it. Unlike MRF fusion moves [24], our fusion method is not general. It combines a specific number and specific type of proposals but does so in a single efficient step.

Michael et al. [28] and Poggi and Mattoccia [33] (SGM-RF) proposed replacing SGM’s sum-based aggregation with a weighted sum, setting smaller weights in areas with 1D streaking artifacts. The former work [28] proposes using global weights per scanline direction. SGM-RF [33] is more effective as it predicts per-pixel weights for each scanline direction using random forests based on disparity-based features. However, SGM-RF was not evaluated on the official test sets of the Middlebury 2014 and KITTI 2015 benchmarks. Mac Aodha et al. [26] also used random forests to fuse optical flow proposals using flow-based features.

Our SGM-Forest differs from these methods in several ways. First, it avoids predicting confidence separately for each proposal [26, 33] but instead directly predicts the best proposal at each pixel. The forest is invoked only once at each pixel and has information from all the scanline directions. This makes inference more effective. Furthermore, the features used by our forest are directly obtained by sampling the aggregated cost volumes of each scanline optimization problem at multiple selective disparities. This is much more effective than handcrafted disparity-based features [33, 43]. Finally, our confidence maps derived from posterior class probabilities are normalized and hence better for refining the disparities during post-processing. Haeusler et al. [12] aim to detect unreliable disparities and suggest adding SGM’s aggregated (summed) costs to their handcrafted disparity-based features. In contrast, we focus on fusing multiple proposals and propose to sample all the cost volumes for each independent scanline optimization at multiple disparities to better exploit contextual information.

3 Semi-Global Matching

We now review SGM as proposed by Hirschmüller [17] for approximate energy minimization of a 2D Markov Random Field (MRF)

$$\begin{aligned} E(D) = \sum _{\mathbf{p}} C_\mathbf{p}(d_\mathbf{p}) + \sum _{\mathbf{p},\mathbf{q}\in \mathcal {N}} V(d_\mathbf{p}, d_\mathbf{q}), \end{aligned}$$
(1)

where \(C_\mathbf{p}(d)\) is a unary data term that encodes the penalty of assigning pixel \(\mathbf{p}\in \mathbb {R}^2\) to disparity \(d \in \mathcal {D} =\) \( \{d_{\min }, \ldots , d_{\max }\}\). The pairwise smoothness term \(V(d, d')\) penalizes disparity differences between neighboring pixels \(\mathbf{p}\) and \(\mathbf{q}\). In SGM, the term V is chosen to have the following specific form

$$\begin{aligned} V(d, d') = \left\{ \begin{array}{ll} 0 &{} \text{ if } d = d' \\ P_1 &{} \text{ if } |d - d'| = 1 \\ P_2 &{} \text{ if } |d - d'| \ge 2\, , \end{array} \right. \end{aligned}$$
(2)

which favors first-order smoothness, i.e., has a preference for fronto-parallel surfaces. Minimizing the 2D MRF is NP-hard. Therefore, SGM instead solves multiple scanline optimization problems, each of which involves solving the 1D version of Eq. 1 along 1D scanlines in 8 cardinal directions \(\mathbf{r}= \{(0, 1), (0, -1), (1, 0), ... \}\). For each direction \(\mathbf{r}\), SGM computes an aggregated matching cost

$$\begin{aligned} L_\mathbf{r}(\mathbf{p}, d) = C_\mathbf{p}(d) + \min _{d' \in \mathcal {D}}(L_\mathbf{r}(\mathbf{p}-\mathbf{r}\,, d')+ V(d, d')) . \end{aligned}$$
(3)

The definition of \(L_\mathbf{r}(\mathbf{p}, d)\) is recursive and is typically started from a pixel on the image border. An aggregated cost volume \(S(\mathbf{p},d)\) is finally computed by summing up the eight individual aggregated cost volumes

$$\begin{aligned} S(\mathbf{p},d) = \sum _\mathbf{r}L_\mathbf{r}(\mathbf{p},d)\,. \end{aligned}$$
(4)

The final disparity map is obtained using a WTA strategy by selecting per-pixel minima in the aggregated cost volume

$$\begin{aligned} d_\mathbf{p}= {\mathop {{\text {arg}}\,{\text {min}}}\limits _{d}}\,S(\mathbf{p}, d) . \end{aligned}$$
(5)

The steps in Eqs. 4 and 5 are accurate when the costs from different scanline directions are mostly consistent wrt. each other. However, these steps are likely to fail as the scanlines become more inconsistent. To overcome this problem, we propose a novel fusion method to robustly compute the disparity \(d_\mathbf{p}\) from the multiple scanline costs \(L_\mathbf{r}(\mathbf{p}, d)\).

4 Learning to Fuse Scanline Optimization Solutions

We start by analyzing some difficult examples for scanline optimization in order to motivate our fusion method and then describe the method in detail.

Fig. 2.
figure 2

1D Scanline Optimization Costs. Each of the four subfigures shows the following – Top Left: Image and reference scanline section in green centered around yellow patch. Top Right: x–d slice of unary cost volume C along the reference scanline and ray of reference patch center in yellow. Bottom: Aggregated costs \(L_\mathbf{r}\) for four scanline directions on the left and the corresponding disparities on the right. The WTA solution is shown in red whereas the ground truth disparity is in blue. (Color figure online)

4.1 Scanline Optimization Analysis

Figure 2 shows four scanlines from the left Adirondack image with the corresponding x–d slices of the the unary cost C and the four horizontal and vertical aggregated scanline costs \(L_\mathbf{r}\) alongside their respective WTA solutions. Notice the patterns in the \(L_\mathbf{r}\) cost slices for the different passes. When the smoothness prior is effective, the noisy unary costs get filtered, producing strong minima at the correct disparities. However, when the unary costs are weak and the prior is ineffective, multiple noisy minima are present or the minimum is at an incorrect location. We now investigate these problematic cases in further detail.

Weak Texture. Figure 2(a)–(d) focus on weakly textured image patches. Whenever the unary cost is weak, the smoothness prior in the 1D optimization favors propagating several equally likely disparity estimates along the propagation direction. This effect is seen clearly on the vertical wooden plank in Fig. 2(d) in the horizontal passes. Here, the left-right propagation continues the solution from the left occlusion boundary to the right, while the right-left solution continues from the corner of the chair to the left. In contrast, the two vertical passes are in agreement at the correct disparity as the surface along that propagation direction is indeed fronto-parallel.

Slanted Surface. Figure 2(b), (c), (d) show examples of weakly textured slanted surfaces, where the 1D scanline solutions are typically biased and jump at random pixel locations, leading to inconsistent solutions in different scanlines. A prominent example is the arm rest in Fig. 2(b), where the left-right pass underestimates the disparity, whereas the right-left and bottom-up passes overestimate the disparity. In this case, there is no clear outlier in the solution but final cost summation leads to a biased estimate. Notice also the asymmetry in the two vertical passes where the bottom-up direction has a much more consistent solution while the top-down solution jumps at random locations. On weakly textured slanted surfaces, adjacent scanlines solutions are mostly inconsistent leading to noisy disparity maps and well-known streaking artifacts.

Occlusion. Figure 2(a) is centered around a region which is occluded in the right image. In this case, the unary cost is invalid and the only pass producing a correct prediction is the left-to-right direction. Here, the occluded surface is fronto-parallel and the smoothness prior is likely to propagate the correct disparity to the occluded region. Typically, only a small subset of scanlines results are correct in occluded areas, whereas SGM’s standard cost summation is not robust and therefore produces gross outliers (see Fig. 1).

Repetitive Structure. The wooden planks on the chair’s backrest in Fig. 2(c) are repetitive and produce multiple ambiguous local cost minima. In this example, the solutions of the left-right and top-down directions are incorrectly estimated, since the centered patch is almost identical to the symmetric patch on the right-most wooden plank. Notice also that the right-left and bottom-up directions are much less susceptible to this specific ambiguity problem.

These examples show that the joint distribution of aggregated costs over the disparity range at each pixel appears to provide strong clues about which scanline proposal or which subset of proposals are likely to be correct. This insight forms the basis of our fusion model which is described next.

4.2 Definition of Fusion Model

The disparities of the different scanline solutions are often inconsistent, especially in areas of weak data cost. Yet, in almost all cases there is at least one scanline that is either correct or is very close to the correct solution. The main challenge for robust and accurate scanline fusion is to identify the scanlines which agree on the correct estimate. In our proposed approach, we cast the fusion of scanlines as a classification problem that chooses the optimal estimate from the given set of candidate scanlines. Typically, the pattern at which specific scanlines perform well is consistent and repeatable. We aim to encode these patterns into rules that can identify the correct solution from a given set of candidate solutions. However, manually hand-crafting these rules is unfeasible and error-prone, which is why we resort to automatically learning these rules from training data in a supervised fashion. To facilitate the learning of these rules, we provide the model with discriminative signals that allow for a robust and efficient disparity prediction. Our proposed model takes sparse samples from a set of proposal cost volumes \(K_n(\mathbf{p}, d)\) (e.g., the optimized scanline costs \(L_\mathbf{r}(\mathbf{p}, d)\)) and concatenates them into a per-pixel feature vector \(\mathbf{f}_\mathbf{p}\). This feature vector is then fed into a learned model that predicts a disparity estimate \(\hat{d}_\mathbf{p}\) together with a posterior probability \(\hat{\rho }_\mathbf{p}\), which we use as a confidence measure for further post-processing.

More specifically, our model is defined as \((\hat{d}_\mathbf{p}, \hat{\rho }_\mathbf{p}) = F(f_\mathbf{p})\) with \(d_\mathbf{p}\in \mathbb {R}^+_0\), \(\rho _\mathbf{p}\in [0, 1]\), and \(\mathbf{f}_\mathbf{p}\in \mathbb {R}^{N+N^2}\), where N is the number of proposal costs \(K_n(\mathbf{p}, d)\). For all \(n = 1...N\) proposals \(K_n(\mathbf{p}, d)\), the feature \(\mathbf{f}_\mathbf{p}\) stores the location of its per-pixel WTA solution \(d^*_\mathbf{p}(n) = \text {arg}\,\text {min}_d K_n(\mathbf{p}, d)\) and the corresponding costs \(K_m(\mathbf{p}, d^*_\mathbf{p}(n))\) in all proposals \(m = 1 ... N\). Overall, the feature is composed of N WTA solutions and the \(N^2\) sparsely sampled costs. For each disparity proposal \(d^*_\mathbf{p}(n)\), we thereby encode its relative significance wrt. the other proposals in a compact representation. The intuition is that when multiple proposals agree, their minima \(d^*_\mathbf{p}(n)\) are close and their respective costs \(K_m(\mathbf{p}, d^*_\mathbf{p}(n))\) are low.

Note that the naïve approach of concatenating the per-pixel costs of all proposals into a feature vector is not feasible for two reasons. First, we want a light-weight feature representation and model with small runtime overhead wrt. regular SGM. However, the naïve approach would result in a very high-dimensional feature representation of size \(N \cdot \vert \mathcal {D} \vert \) (e.g., \(8 \cdot 256 = 2048\) for 256 disparity candidates and 8 scanlines), which would require a complex model and eliminate the computational efficiency of SGM. In contrast, our proposed feature vector is only \(8+8^2=72\)-dimensional in case of 8 scanline proposals. Second, we strive to learn a generalizable model, which uses a fixed-size feature representation during training and inference even though the disparity range \(\mathcal {D}\) may vary between scenes. In summary, our proposed feature encodes discriminative signals for our classification task without sacrificing efficiency, compactness, or accuracy.

Table 1. Validation performance for non-occluded pixels on the Middlebury 2014 training set (15 half resolution pairs). Rows 1–5 show results for SGM baselines. Rows 6–14 report ablation studies for SGM-Forest. Bottom three rows show results for the best SGM-Forest setting, trained on different datasets. Letters M, K, and E refer to Middlebury 2005–06, KITTI, and ETH3D, respectively. The matching cost is always MC-CNN-acrt. Runtimes exclude matching cost and timed on same CPU.

4.3 Random Forests for Disparity and Confidence Prediction

Given ground truth disparities, there are many ways to learn the model \(F(\mathbf{f}_\mathbf{p})\) using supervised learning. The first principal design decision is whether to pose the problem as a classification or regression task. Arguably, classification problems are often considered as easier to solve. As shown in Fig. 1, at least one of the different scanline solutions is often accurate. We therefore chose to formulate a N-class classification task that predicts the best solution from the set of candidates \(d^*_\mathbf{p}(n)\). This approach gave much better results than modeling the problem as a regression task. The second principal design decision is the specific type of classifier to use, e.g., k-NN, support vector machines, decision trees, neural nets, etc. In our experiments, random forests provided the best trade-off between accuracy and efficiency (see Sect. 5.2 and Table 1).

At test time, we first perform 1D scanline optimization to construct the proposal cost volumes \(K_n(\mathbf{p}, d)\), from which we build the per-pixel feature vectors \(\mathbf{f}_\mathbf{p}\). In the second stage, we simply feed the feature vectors \(\mathbf{f}_\mathbf{p}\) of all pixels \(\mathbf{p}\) through our model to obtain a posterior probability \(\rho _\mathbf{p}(n)\) for each proposal n. We select the proposal with the maximum posterior probability \(n^*_\mathbf{p}= \text {arg}\,\text {max}_n\,\rho _\mathbf{p}(n)\) as our initial disparity estimate \(d^*_\mathbf{p}({n^*})\) for pixel \(\mathbf{p}\). To further refine this initial estimate, we find the subset of disparity proposals close to the initial estimate and their corresponding posteriors:

$$\begin{aligned} \mathcal {D}^*_\mathbf{p}= \{ (d^*_\mathbf{p}(k), \rho _\mathbf{p}(k)) \, \vert \, k = 1...N \wedge \vert d^*_\mathbf{p}(k) - d^*_\mathbf{p}({n^*}) \vert < \epsilon _d \} \end{aligned}$$
(6)

When multiple scanlines agree on a solution, the inlier set \(\mathcal {D}^*_\mathbf{p}\) contains multiple elements, even for small disparity thresholds \(\epsilon _d\). The final per-pixel disparity estimate \(\hat{d}_\mathbf{p}\) and confidence measure \(\hat{\rho }_\mathbf{p}\) are computed as

$$\begin{aligned} \hat{d}_\mathbf{p}= \frac{\sum _k \rho _\mathbf{p}(k) \, d^*_\mathbf{p}(k)}{\sum _k \rho _\mathbf{p}(k)} \text {and} \,\, \hat{\rho }_\mathbf{p}= \sum _k \rho _\mathbf{p}(k) \end{aligned}$$
(7)

Note that the final disparity estimate has sub-pixel precision. Moreover, all steps are fully parallelizable on the pixel level and therefore suitable for real-time FPGA implementations (see Sects. 5.2 and 5.5). Next, we will describe our spatial edge-aware filtering scheme for disparity refinement.

4.4 Confidence-Based Spatial Filtering

The random forest produces a per-pixel estimate for disparity and confidence. In a final filtering step, we now enhance the spatial smoothness of the disparity and confidence maps. Towards this goal, we define the adaptive local neighborhood

$$\begin{aligned} \mathcal {N}_\mathbf{p}= \{ \mathbf{q}\, \vert \, \Vert \mathbf{q}- \mathbf{p}\Vert< \epsilon _\mathbf{p}\, \wedge \, \hat{\rho }_\mathbf{q}> \epsilon _\rho \, \wedge \vert I(\mathbf{p}) - I(\mathbf{q}) \vert < \epsilon _I \} \end{aligned}$$
(8)

centered around each pixel \(\mathbf{p}\), where \(I(\mathbf{q})\) is the image intensity at pixel \(\mathbf{q}\). The filtered disparity and confidence estimates are finally given as \(\bar{d}_\mathbf{p}= \text {median}\,\hat{d}_\mathbf{q}\) and \(\bar{\rho }_\mathbf{p}= \text {median}\,\hat{\rho }_\mathbf{q}\) with \(\mathbf{q}\in \mathcal {N}_\mathbf{p}\). The filter essentially computes a median on the selective set of neighborhood pixels \(\mathcal {N}_\mathbf{p}\) which have high confidence and similar color as the center pixel \(\mathbf{p}\).

5 Experiments

We report a thorough evaluation of SGM-Forest on three stereo benchmarks – Middlebury 2014, KITTI 2015, and ETH3D 2017 [10, 35, 41]. Our evaluation protocol contrasts to most top-ranked stereo methods which often evaluate only on one benchmark [11, 21, 27, 30, 42, 44]. In all our experiments, SGM-Forest outperforms SGM by a significant margin and ranks competitively against the state-of-the-art learning-based and global stereo methods, which are computationally more expensive. It also robustly generalizes across different dataset domains.

5.1 Implementation Details

Scanline Optimization and SGM. To facilitate an unbiased comparison, we use the same SGM implementation throughout all experiments. We compare three different matching costs (NCC, MC-CNN-fast [48], MC-CNN-acrt [48]) as the unary term C, which is quantized to 8 bits for reduced memory usage using linear rescaling to the range [0, 255]. Image intensities are given in the range [0, 255]. For NCC, we use a patch size of \(7 \times 7\). We follow standard procedure and improve the right image rectification using sparse feature matching before computing the matching cost. The smoothness term \(V(d, d')\) uses the constant parameters \(P_1 = 100\) and \(P_2 = P_1 (1 + \alpha e^{-\vert \varDelta I \vert / \beta })\), where \(\alpha = 8\), \(\beta = 10\), and \(\varDelta I\) is the intensity difference between neighboring pixels.

SGM-Forest. In all our experiments, we train random forests with 128 trees, a maximum depth of 25, and the Gini impurity measure to decide on the optimal data split. We set \(\epsilon _d = 2\), \(\epsilon _\rho = 0.1\), \(\epsilon _\mathbf{p}= 5\), and \(\epsilon _I = 10\). These optimal parameters were decided using parameter grid search and 3-fold cross validation on the Middlebury 2014 training scenes. For generalization across different disparity ranges between training and test datasets, we normalize to relative disparities prior to the extraction of the feature \(\mathbf{f}_\mathbf{p}\) using the average of the input disparity proposals \(d^*_\mathbf{p}(n)\). The relative disparity estimates are then denormalized to achieve absolute disparities. To showcase the generalization robustness of our approach, we train and evaluate our SGM-Forest on different dataset combinations. In all settings, the training and test scenes are non-overlapping and we provide a detailed list of training/test splits in the supplementary material. For learning our SGM-Forest model, we sample a maximum of 500 K random pixels with ground-truth disparity uniformly in each training image.

5.2 Ablation Study

We now evaluate several aspects of our algorithm using an extensive ablation study summarized in Tables 1 and 2 (full tables in the supplementary material).

Table 2. This table shows the validation performance for non-occluded pixels using 3-fold cross-validation for different matching costs and datasets at different error thresholds. Our method (SGM-F.) outperforms baseline SGM in all settings.

SGM Baseline. We compare our SGM baseline against two simple methods that robustify Eqs. 4 and 5 (see Table 1): SGM – \(\min _d L_\mathbf{r}(\mathbf{p}, d)\) selects the scanline solution with minimum cost as the disparity estimate, while SGM – \(\min _d \text {median}_\mathbf{r}\,L_\mathbf{r}(\mathbf{p}, d)\) uses the robust median instead of summation for aggregating the costs from multiple scanlines. Both methods perform worse than baseline SGM, underlining the need for a more sophisticated fusion approach.

Input Proposals. The input to our algorithm is a set of proposal cost volumes \(K_n(\mathbf{p}, d)\). As demonstrated in Fig. 1, a single scanline performs worse than SGM while the best of multiple scanlines is significantly better. In fact, our method is general and the input proposals to our system need not be limited to the canonical 1D scanline optimizations. We always consider the regular SGM cost volume \(S(\mathbf{p},d)\) as a proposal. Using only this proposal leads to a trivial 1-class classification problem and is equivalent to running baseline SGM (see Table 1). Adding the four horizontal and vertical scanlines from the left image as proposals improves the accuracy significantly, which is further boosted by adding the remaining 4 diagonal scanlines. Using only scanlines that propagate in the five top-down or five bottom-up directions degrades performance slightly but is still much better than regular SGM and enables real-time implementation of our algorithm on an FPGA [19]. We also experimented with running two horizontal scanline optimizations on the right image and warping the results to the left view to be used as two additional proposals. This is because the occluded pixels in the left image are invisible in the right image and the left occlusion edges are usually more accurately recovered in the right disparity map. These additional proposals provide a small but consistent improvement.

Classification Model. In Sect. 4.3, we argued that, for our task, random forests provide the best trade-off in terms of accuracy and efficiency. We experimented with many different classification models, including k-NN search, SVMs, (gradient boosted) decision trees, AdaBoost, neural nets, etc. In Table 1, we show results for two other well-performing models: SGM-SVM uses a linear SVM classifier and SGM-MLP is a multi-layer perceptron using 3 hidden layers with ReLU activation and twice the neurons after each layer followed by a final softmax layer for classification. SGM-MLP outperforms the SGM baseline but has slightly lower accuracy and efficiency on the CPU than SGM-Forest.

Table 3. Middlebury Benchmark. Left: Official results for the top 10 performing methods using MC-CNN-acrt for our SGM-Forest. Our method achieves the best runtime among the top performing methods. Right: Inofficial results on the training scenes trained on Middlebury 2005–06 using MC-CNN-fast. SGM-Forest with MC-CNN-fast outperforms baseline SGM with MC-CNN-acrt but is an order of magnitude faster.

Filtering. The final step in our algorithm is the confidence-based spatial filtering of the disparity and confidence maps. While the biggest accuracy improvement stems from the initial fusion step (see Table 1), the final filtering further improves the results by eliminating spatially inconsistent outliers.

Efficiency. The reported runtimes in Table 1 show only a small computational overhead of SGM-Forest and our proposed filtering over baseline SGM, enabling a potential real-time implementation on the GPU or FPGA (see Sect. 5.5). Note that the runtimes exclude the matching cost computation, i.e., the overhead of SGM-Forest becomes negligible if, for example, MC-CNN-acrt is used.

Generalization and Robustness. All results in Table 1 were obtained by training on Middlebury 2005–06 and evaluating on Middlebury 2014, which already demonstrates good generalization properties. Note that Middlebury 2014 images are much more challenging than those in Middlebury 2005–06. Moreover, we also evaluate cross-domain generalization by training on KITTI (outdoors) and ETH3D (outdoors and indoors) and evaluating on Middlebury 2014 (indoors). In both cases, our approach achieves almost the same performance as compared to training on Middlebury. Table 2 shows that SGM-Forest improves over baseline SGM in every single metric irrespective of matching cost and dataset. In contrast to most learning-based methods, we demonstrate that our learned fusion approach is general and extremely robust across different domains and settings: SGM-Forest performs well outdoors when trained on indoor scenes, handles different image resolutions, disparity ranges and diverse matching costs, and consistently outperforms baseline SGM by a large margin.

5.3 Benchmark Results

Unlike most existing methods, we evaluate SGM-Forest on three benchmarks and achieve competitive performance wrt. the state of the art. For all benchmark submissions, we use the best setting found in our ablation study, i.e., we include 8 (and 2) proposals from the left (and right) view and run disparity refinement.

Table 4. KITTI and ETH3D Benchmarks. Left: KITTI results over all pixels for all ranked SGM variants. Our SGM-Forest uses MC-CNN-fast as matching cost and achieves high accuracy at comparatively low runtime. Right: ETH3D results over non-occluded and all pixels for all ranked methods. Our SGM-Forest uses MC-CNN-fast as matching cost and achieves the best accuracy at comparatively low runtime.

Middlebury. Table 3 reports our results on Middlebury 2014. For the benchmark submission, we use MC-CNN-acrt matching costs and jointly train on Middlebury 2005–06 and the training scenes of Middlebury 2014. Our method ranks competitively among the top ten methods in terms of accuracy but is significantly faster. In addition to our official submission, we also report unofficial results for MC-CNN-fast evaluated on the training scenesFootnote 1. The models for this submission were trained only on the Middlebury 2005–06 scenes. Using MC-CNN-fast, SGM-Forest outperforms SGM by two percentage points on non-occluded pixels. Evaluated on all pixels, SGM-Forest with MC-CNN-fast outperforms baseline SGM with MC-CNN-acrt by two percentage points but SGM-Forest is an order of magnitude faster.

KITTI. Table 4 lists all SGM-based methods evaluated on KITTI. We use MC-CNN-fast for this submission and are ranked right behind the original MC-CNN-acrt method [48], CNNF+SGM [51], and SGM-Net [42]. However, our method is an order of magnitude faster even though our scanline optimization and the proposed additional steps are implemented on the CPU while MC-CNN-WS runs on the GPU. Note that CNNF+SGM and SGM-Net report results only on KITTI whereas our method generalizes across domains and datasets.

ETH3D. On this fairly new benchmark with diverse indoor and outdoor images, SGM-Forest is currently ranked 1st with competitive running times (see Table 4). Our submission uses MC-CNN-fast which was surprisingly more accurate than MC-CNN-acrt on ETH3D (also see Table 2). Here, our SGM-Forest submission has almost half the error as the original SGM method [17].

5.4 Qualitative Results

Figure 3 shows qualitative results for Middlebury. Compared to baseline SGM, our SGM-Forest produces less streaking artifacts and performs significantly better in occluded areas. High confidence regions in general correspond to low errors. This is further confirmed by the monontonically decreasing precision-recall curves, which were produced by thresholding on the predicted confidences. For further qualitative results, e.g., comparisons between raw predictions and filtered results, we refer the reader to the supplementary material.

Fig. 3.
figure 3

Qualitative Middlebury results for SGM and SGM-Forest. Absolute error maps clipped to \([0\text {px}, 8\text {px}]\). Precision (Y) and Recall (X) in [0, 1]. Confidence maps log-scaled.

5.5 Limitations and Future Work

Our current SGM and random forest implementation is CPU-based and is not real-time capable since we buffer all scanline cost volumes before fusion. The learned forests in this paper use 128 trees, so our method could be sped up easily by using fewer trees. In our experiments, even a single decision tree improved upon baseline SGM. An implementation of our method on the GPU would be straightforward, where SGM-MLP would probably outcompete SGM-Forest in efficiency at the cost of a small degradation in accuracy. Real-time implementation on embedded systems [19] requires a one-pass, buffer-less algorithm prohibiting the use of all 8 scanline directions. In Table 1, we demonstrated that our idea also works well for top-down/bottom-up directions only.

6 Conclusion

We proposed a learning-based approach to fuse scanline optimization proposals in SGM, replacing the brittle and heuristic scanline aggregation steps in standard SGM. Our method is efficient and accurate and ranks 1st on the ETH3D benchmark while being competitive on Middlebury and KITTI. We have demonstrated consistent improvements over SGM on three stereo benchmarks. The learning appears to be extremely robust and generalizes well across datasets. Our method can be readily integrated into existing SGM variants and allows for real-time implementation in practical, high-quality stereo systems.