1 Introduction

Single-view depth prediction and optical flow estimation are two fundamental problems in computer vision. While the two tasks aim to recover highly correlated information from the scene (i.e., the scene structure and the dense motion field between consecutive frames), existing efforts typically study each problem in isolation. In this paper, we demonstrate the benefits of exploring the geometric relationship between depth, camera motion, and flow for unsupervised learning of depth and flow estimation models.

With the rapid development of deep convolutional neural networks (CNNs), numerous approaches have been proposed to tackle dense prediction problems in an end-to-end manner. However, supervised training CNN for such tasks often involves in constructing large-scale, diverse datasets with dense pixelwise ground truth labels. Collecting such densely labeled datasets in real-world requires significant amounts of human efforts and is prone to error. Existing efforts of RGB-D dataset construction [18, 45, 53, 54] often have limited scope (e.g., in terms of locations, scenes, and objects), and hence are lack of diversity. For optical flow, dense motion annotations are even more difficult to acquire [37]. Consequently, existing CNN-based methods rely on synthetic datasets for training the models [5, 12, 16, 24]. These synthetic datasets, however, do not capture the complexity of motion blur, occlusion, and natural image statistics from real scenes. The trained models usually do not generalize well to unseen scenes without fine-tuning on sufficient ground truth data in a new visual domain.

Fig. 1.
figure 1

Joint learning v.s. separate learning. Single-view depth prediction and optical flow estimation are two highly correlated tasks. Existing work, however, often addresses these two tasks in isolation. In this paper, we propose a novel cross-task consistency loss to couple the training of these two problems using unlabeled monocular videos. Through enforcing the underlying geometric constraints, we show substantially improved results for both tasks.

Several work [17, 21, 28] have been proposed to capitalize on large-scale real-world videos to train the CNNs in the unsupervised setting. The main idea lies to exploit the brightness constancy and spatial smoothness assumptions of flow fields or disparity maps as supervisory signals. These assumptions, however, often do not hold at motion boundaries and hence makes the training unstable.

Many recent efforts [59, 60, 65, 73] explore the geometric relationship between the two problems. With the estimated depth and camera pose, these methods can produce dense optical flow by backprojecting the 3D scene flow induced from camera ego-motion. However, these methods implicitly assume perfect depth and camera pose estimation when “synthesizing” the optical flow. The errors in either depth or camera pose estimation inevitably produce inaccurate flow predictions.

In this paper, we present a technique for jointly learning a single-view depth estimation model and a flow prediction model using unlabeled videos as shown in Fig. 2. Our key observation is that the predictions from depth, pose, and optical flow should be consistent with each other. By exploiting this geometry cue, we present a novel cross-task consistency loss that provides additional supervisory signals for training both networks. We validate the effectiveness of the proposed approach through extensive experiments on several benchmark datasets. Experimental results show that our joint training method significantly improves the performance of both models (Fig. 1). The proposed depth and flow models compare favorably with state-of-the-art unsupervised methods.

We make the following contributions. (1) We propose an unsupervised learning framework to simultaneously train a depth prediction network and an optical flow network. We achieve this by introducing a cross-task consistency loss that enforces geometric consistency. (2) We show that through the proposed unsupervised training our depth and flow models compare favorably with existing unsupervised algorithms and achieve competitive performance with supervised methods on several benchmark datasets. (3) We release the source code and pre-trained models to facilitate future research: http://yuliang.vision/DF-Net/.

Fig. 2.
figure 2

Supervised v.s. unsupervised learning. Supervised learning of depth or flow networks requires large amount of training data with pixelwise ground truth annotations, which are difficult to acquire in real scenes. In contrast, our work leverages the readily available unlabeled video sequences to jointly train the depth and flow models.

2 Related Work

Supervised Learning of Depth and Flow. Supervised learning using CNNs has emerged to be an effective approach for depth and flow estimation to avoid hand-crafted objective functions and computationally expensive optimization at test time. The availability of RGB-D datasets and deep learning leads to a line of work on single-view depth estimation [13, 14, 35, 38, 62, 72]. While promising results have been shown, these methods rely on the absolute ground truth depth maps. These depth maps, however, are expensive and difficult to collect. Some efforts [8, 74] have been made to relax the difficulty of collecting absolute depth by exploring learning from relative/ordinal depth annotations. Recent work also explores gathering training datasets from web videos [7] or Internet photos [36] using structure-from-motion and multi-view stereo algorithms.

Compared to ground truth depth datasets, constructing optical flow datasets of diverse scenes in real-world is even more challenging. Consequently, existing approaches [12, 26, 47] typically rely on synthetic datasets [5, 12] for training. Due to the limited scalability of constructing diverse, high-quality training data, fully supervised approaches often require fine-tuning on sufficient ground truth labels in new visual domains to perform well. In contrast, our approach leverages the readily available real-world videos to jointly train the depth and flow models. The ability to learn from unlabeled data enables unsupervised pre-training for domains with limited amounts of ground truth data.

Self-supervised Learning of Depth and Flow. To alleviate the dependency on large-scale annotated datasets, several works have been proposed to exploit the classical assumptions of brightness constancy and spatial smoothness on the disparity map or the flow field [17, 21, 28, 43, 71]. The core idea is to treat the estimated depth and flow as latent layers and use them to differentiably warp the source frame to the target frame, where the source and target frames can either be the stereo pair or two consecutive frames in a video sequence. A photometric loss between the synthesized frame and the target frame can then serve as an unsupervised proxy loss to train the network. Using photometric loss alone, however, is not sufficient due to the ambiguity on textureless regions and occlusion boundaries. Hence, the network training is often unstable and requires careful hyper-parameter tuning of the loss functions. Our approach builds upon existing unsupervised losses for training our depth and flow networks. We show that the proposed cross-task consistency loss provides a sizable performance boost over individually trained models.

Methods Exploiting Geometry Cues. Recently, a number of work exploits the geometric relationship between depth, camera pose, and flow for learning depth or flow models [60, 65, 68, 73]. These methods first estimate the depth of the input images. Together with the estimated camera poses between two consecutive frames, these methods “synthesize” the flow field of rigid regions. The synthesized flow from depth and pose can either be used for flow prediction in rigid regions [48, 60, 65, 68] as is or used for view synthesis to train depth model using monocular videos [73]. Additional cues such as surface normal [67], edge [66], physical constraints [59] can be incorporated to further improve the performance.

These approaches exploit the inherent geometric relationship between structure and motion. However, the errors produced by either the depth or the camera pose estimation propagate to flow predictions. Our key insight is that for rigid regions the estimated flow (from flow prediction network) and the synthesized rigid flow (from depth and camera pose networks) should be consistent. Consequently, coupled training allows both depth and flow networks to learn from each other and enforce geometrically consistent predictions of the scene.

Structure from Motion. Joint estimation of structure and camera pose from multiple images of a given scene is a long-standing problem [15, 46, 64]. Conventional methods can recover (semi-)dense depth estimation and camera pose through keypoint tracking/matching. The outputs of these algorithms can potentially be used to help train a flow network, but not the other way around. Our work differs as we are also interested in learning a depth network to recover dense structure from a single input image.

Multi-task Learning. Simultaneously addressing multiple tasks through multi-task learning [52] has shown advantages over methods that tackle individual ones [70]. For examples, joint learning of video segmentation and optical flow through layered models [6, 56] or feature sharing [9] helps improve accuracy at motion boundaries. Single-view depth model learning can also benefit from joint training with surface normal estimation [35, 67] or semantic segmentation [13, 30].

Our approach tackles the problems of learning both depth and flow models. Unlike existing multi-task learning methods that often require direct supervision using ground truth training data for each task, our approach instead leverage meta-supervision to couple the training of depth and flow models. While our models are jointly trained, they can be applied independently at test time.

Fig. 3.
figure 3

Overview of our unsupervised joint learning framework. Our framework consists of three major modules: (1) a Depth Net for single-view depth estimation; (2) a Pose Net that takes two stacked input frames and estimates the relative camera pose between the two input frames; and (3) a Flow Net that estimates dense optical flow field between the two input frames. Given a pair of input images \(\mathbf {I}_t\) and \(\mathbf {I}_{t+1}\) sampled from an unlabeled video, we first estimate the depth of each frame, the 6D camera pose, and the dense forward and backward flows. Using the predicted scene depth and the estimated camera pose, we can synthesize 2D forward and backward optical flows (referred as rigid flow) by backprojecting the induced 3D forward and backward scene flows (Sect. 3.2). As we do not have ground truth depth and flow maps for supervision, we leverage standard photometric and spatial smoothness costs to regularize the network training (Sect. 3.3, not shown in this figure for clarity). To enforce the consistency of flow and depth prediction in both directions, we exploit the forward-backward consistency (Sect. 3.4), and adopt the valid masks derived from it to filter out invalid regions (e.g., occlusion/dis-occlusion) for the photometric loss. Finally, we propose a novel cross-network consistency loss (Sect. 3.5)—encouraging the optical flow estimation (from the Flow Net) and the rigid flow (from the Depth and Pose Net) to be consistent to each other within in valid regions.

3 Unsupervised Joint Learning of Depth and Flow

3.1 Method Overview

Our goal is to develop an unsupervised learning framework for jointly training the single-view depth estimation network and the optical flow prediction network using unlabeled video sequences. Figure 3 shows the high-level sketch of our proposed approach. Given two consecutive frames \((I_t, I_{t+1})\) sampled from an unlabeled video, we first estimate depth of frame \(I_t\) and \(I_{t+1}\), and forward-backward optical flow fields between frame \(I_t\) and \(I_{t+1}\). We then estimate the 6D camera pose transformation between the two frames \((I_t, I_{t+1})\).

With the predicted depth map and the estimated 6D camera pose, we can produce the 3D scene flow induced from camera ego-motion and backproject them onto the image plane to synthesize the 2D flow (Sect. 3.2). We refer this synthesized flow as rigid flow. Suppose the scenes are mostly static, the synthesized rigid flow should be consistent with the results from the estimated optical flow (produced by the optical flow prediction model). However, the prediction results from the two branches may not be consistent with each other. Our intuition is that the discrepancy between the rigid flow and the estimated flow provides additional supervisory signals for both networks. Hence, we propose a cross-task consistency loss to enforce this constraint (Sect. 3.5). To handle non-rigid transformations that cannot be explained by the camera motion and occlusion-disocclusion regions, we exploit the forward-backward consistency check to identify valid regions (Sect. 3.4). We avoid enforcing the cross-task consistency for those forward-backward inconsistent regions.

Our overall objective function can be formulated as follows:

$$\begin{aligned} L = L_\text {photometric} + \lambda _{s} L_\text {smooth} + \lambda _{f}L_\text {forward-backward} + \lambda _{c} L_\text {cross}. \end{aligned}$$
(1)

All of the four loss terms are applied to both depth and flow networks. Also, all of the four loss terms are symmetric for forward and backward directions, for simplicity we only derive them for the forward direction.

3.2 Flow Synthesis Using Depth and Pose Predictions

Given the two input frames \(I_t\) and \(I_{t+1}\), the predicted depth map \(\hat{D}_t\), and relative camera pose \(\hat{T}_{t\rightarrow t+1}\), here we wish to establish the dense pixel correspondence between the two frames. Let \(p_t\) denotes the 2D homogeneous coordinate of an pixel in frame \(I_t\) and K denotes the intrinsic camera matrix. We can compute the corresponding point of \(p_t\) in frame \(I_{t+1}\) using the equation [73]:

$$\begin{aligned} p_{t+1} = K \hat{T}_{t\rightarrow t+1} \hat{D}_t(p_t) K^{-1} p_t. \end{aligned}$$
(2)

We can then obtain the synthesized forward rigid flow at pixel \(p_t\) in \(I_t\) by

$$\begin{aligned} F_\text {rigid}(p_t) = p_{t+1} - p_t \end{aligned}$$
(3)

3.3 Brightness Constancy and Spatial Smoothness Priors

Here we briefly review two loss functions that we used in our framework to regularize network training. Leveraging the brightness constancy and spatial smoothness priors used in classical dense correspondence algorithms [4, 23, 40], prior work has used the photometric discrepancy between the warped frame and the target frame as an unsupervised proxy loss function for training CNNs without ground truth annotations.

Photometric Loss. Suppose that we have frame \(I_t\) and \(I_{t+1}\), as well as the estimated flow \(F_{t\rightarrow t+1}\) (either from the optical flow predicted from the flow model or the synthesized rigid flow induced from the estimated depth and camera pose), we can produce the warped frame \(\bar{I}_t\) with the inverse warping from frame \(I_{t+1}\). Note that the projected image coordinates \(p_{t+1}\) might not lie exactly on the image pixel grid, we thus apply a differentiable bilinear interpolation strategy used in the spatial transformer networks [27] to perform frame synthesis.

With the warped frame \(\bar{I}_t\) from \(I_{t+1}\), we formulate the brightness constancy objective function as

$$\begin{aligned} L_\text {photometric} = \sum _{p}\rho {\left( I_{t}(p), \bar{I}_{t}(p)\right) }. \end{aligned}$$
(4)

where \(\rho (\cdot )\) is a function to measure the difference between pixel values. Previous work simply choose \(L_1\) norm or the appearance matching loss [21], which is not invariant to illumination changes in real-world scenarios [61]. Here we adopt the ternary census transform based loss [43, 55, 69] that can better handle complex illumination changes.

Smoothness Loss. The brightness constancy loss is not informative in low-texture or homogeneous region of the scene. To handle this issue, existing work incorporates a smoothness prior to regularize the estimated disparity map or flow field. We adopt the spatial smoothness loss as proposed in [21].

3.4 Forward-Backward Consistency

According to the brightness constancy assumption, the warped frame should be similar to the target frame. However, the assumption does not hold for occluded and dis-occluded regions. We address this problem by using the commonly used forward-backward consistency check technique to identify invalid regions and do not impose the photometric loss on those regions.

Valid Masks. We implement the occlusion detection based on forward-backward consistency assumption [58] (i.e., traversing flow vector forward and then backward should arrive at the same position). Here we use a simple criterion proposed in [43]. We mark pixels as invalid whenever this constraint is violated. Figure 4 shows two examples of the marked invalid regions by forward-backward consistency check using the synthesized rigid flow (animations can be viewed in Adobe Reader).

Denote the valid region by V (either from rigid flow or estimated flow), we can modify the photometric loss term (4) as

$$\begin{aligned} L_\text {photometric} = \sum _{p\in V}\rho {\left( I_{t}(p), \bar{I}_{t}(p)\right) }. \end{aligned}$$
(5)

Forward-Backward Consistency Loss. In addition to using forward-backward consistency check for identifying invalid regions, we can further impose constraints on the valid regions so that the network can produce consistent predictions for both forward and backward directions. Similar ideas have been exploited in [25, 43] for occlusion-aware flow estimation. Here, we apply the forward-backward consistency loss to both flow and depth predictions.

For flow prediction, the forward-backward consistency loss is of the form:

$$\begin{aligned} L_\text {forward-backward, flow} = \sum _{p\in V_\mathrm {flow}}\left\| F_{t\rightarrow t+1}(p) + F_{t+1\rightarrow t}(p+F_{t\rightarrow t+1}(p))\right\| _1 \end{aligned}$$
(6)

Similarly, we impose a consistency penalty for depth:

$$\begin{aligned} L_\text {forward-backward, depth} = \sum _{p\in V_\mathrm {depth}}\Vert D_{t}(p) - \bar{D}_{t}(p)\Vert _1 \end{aligned}$$
(7)

where \(\bar{D}_t\) is warped from \(D_{t+1}\) using the synthesized rigid flow from t to \(t+1\).

Fig. 4.
figure 4

Valid mask visualization. We estimate the invalid mask by checking the forward-backward consistency from the synthesized rigid flow, which can not only detect occluded regions, but also identify the moving objects (cars) as they cannot be explained by the estimated depth and pose. (See supplementary material)

While we exploit robust functions for enforcing photometric loss, forward-backward consistency for each of the tasks, the training of depth and flow networks using unlabeled data remains non-trivial and sensitive to the choice of hyper-parameters [33]. Building upon the existing loss functions, in the following we introduce a novel cross-task consistency loss to further regularize the network training.

3.5 Cross-Task Consistency

In Sect. 3.2, we show that the motion of rigid regions in the scene can be explained by the ego-motion of the camera and the corresponding scene depth. On the one hand, we can estimate the rigid flow by backprojecting the induced 3D scene flow from the estimated depth and relative camera pose. On the other hand, we have direct estimation results from an optical flow network. Our core idea is the that these two flow fields should be consistent with each other for non-occluded and static regions. Minimizing the discrepancy between the two flow fields allows us to simultaneously update the depth and flow models.

We thus propose to minimize the endpoint distance between the flow vectors in the rigid flow (computed from the estimated depth and pose) and that in the estimated flow (computed from the flow prediction model). We denote the synthesized rigid flow as \(F_\text {rigid}=(u_\text {rigid},v_\text {rigid})\) and the estimated flow as \(F_\text {flow}=(u_\text {flow},v_\text {flow})\). Using the computed valid masks (Sect. 3.4), we impose the cross-task consistency constraints over valid pixels.

$$\begin{aligned} L_\text {cross} = \sum _{p\in V_\mathrm {depth}\cap V_\mathrm {flow}}\Vert F_\text {rigid}(p) - F_\text {flow}(p)\Vert _1 \end{aligned}$$
(8)

4 Experimental Results

In this section, we validate the effectiveness of our proposed method for unsupervised learning of depth and flow on several standard benchmark datasets. More results can be found in the supplementary material. Our source code and pre-trained models are available on http://yuliang.vision/DF-Net/.

4.1 Datasets

Datasets for Joint Network Training. We use video clips from the train split of KITTI raw dataset [18] for joint learning of depth and flow models. Note that our training does not involve any depth/flow labels.

Datasets for Pre-training. To avoid the joint training process converging to trivial solutions, we (unsupervisedly) pre-train the flow network on the SYNTHIA dataset [51]. For pre-training both depth and pose networks, we use either KITTI raw dataset or the CityScapes dataset [11].

The SYNTHIA dataset [51] contains multi-view frames captured by driving vehicles in different scenarios and traffic conditions. We take all the four-view images of the left camera from all summer and winter driving sequences, which contains around 37K image pairs. The CityScapes dataset [11] contains real-world driving sequences, we follow Zhou et al. [73] and pre-process the dataset to generate around 75K training image pairs.

Datasets for Evaluation. For evaluating the performance of our depth network, we use the test split of the KITTI raw dataset. The depth maps for KITTI raw are sampled at irregularly spaced positions, captured using a rotating LIDAR scanner. Following the standard evaluation protocol, we evaluate the performance using only the regions with ground truth depth samples (bottom parts of the images). We also evaluate the generalization of our depth network on general scenes using the Make3D dataset [53].

For evaluating our flow network, we use the challenging KITTI flow 2012 [19] and KITTI flow 2015 [44] datasets. The ground truth optical flow is obtained from a 3D laser scanner and thus only covers about 50% of the pixels.

Fig. 5.
figure 5

Sample results on KITTI raw test set. The ground truth depth is interpolated from sparse point cloud for visualization only. Compared to Zhou et al. [73] and Eigen et al. [14], our method can better capture object contour and thin structures.

4.2 Implementation Details

We implement our approach in TensorFlow [1] and conduct all the experiments on a single Tesla K80 GPU with 12 GB memory. We set \(\lambda _s = 3.0\), \(\lambda _f=0.2\), and \(\lambda _c=0.2\). For network training, we use the Adam optimizer [31] with \(\beta _1=0.9\), \(\beta _2=0.99\). In the following, we provide more implementation details in network architecture, network pre-training, and the proposed unsupervised joint training.

Network Architecture. For the pose network, we adopt the architecture from Zhou et al. [73]. For the depth network, we use the ResNet-50 [22] as our feature backbone with ELU [10] activation functions. For the flow network, we adopt the UnFlow-C structure [43]—a variant of FlowNetC [12]. As our network training is model-agnostic, more advanced network architectures (e.g., pose [20], depth [36], or flow [57]) can be used for further improving the performance.

Unsupervised Depth Pre-training. We train the depth and pose networks with a mini-batch size of 6 image pairs whose size is \(576 \times 160\), from KITTI raw dataset or CityScapes dataset for 100K iterations. We use a learning rate is 2e-4. Each iteration takes around 0.8s (forward and backprop) during training.

Unsupervised Flow Pre-training. Following Meister et al. [43], we train the flow network with a mini-batch size of 4 image pairs whose size is \(1152\times 320\) from SYNTHIA dataset for 300K iterations. We keep the initial learning rate as 1e-4 for the first 100K iterations and then reduce the learning rate by half after each 100K iterations. Each iteration takes around 2.4 s (forward and backprop).

Unsupervised Joint Training. We jointly train the depth, pose, and flow networks with a mini-batch size of 4 image pairs from KITTI raw dataset for 100K iterations. Input size for the depth and pose networks is \(576\times 160\), while the input size for the flow network is \(1152\times 320\). We divide the initial learning rate by 2 for every 20K iterations. Our depth network produces depth predictions at 4 spatial scales, while the flow network produces flow fields at 5 scales. We enforce the cross-network consistency in the finest 4 scales. Each iteration takes around 3.6 s (forward and backprop) during training.

Image Resolution of Network Inputs/Outputs. As the input size of the UnFlow-C network [43] must be divisible by 64, we resize input image pairs of the two KITTI flow datasets to \(1280 \times 384\) using bilinear interpolation. We then resize the estimated optical flow and rescale the predicted flow vectors to match the original input size. For depth estimation, we resize the input image to the same size of training input to predict the disparity first. We then resize and rescale the predicted disparity to the original size and compute the inverse the obtain the final prediction.

Table 1. Single-view depth estimation results on test split of KITTI raw dataset [18]. The methods trained on KITTI raw dataset [18] are denoted by K. Models with additional training data from CityScapes [11] are denoted by CS+K. (D) denotes depth supervision, denotes stereo input pairs, denotes monocular video clips. The best and the second best performance in each block are highlighted as bold and underline.

4.3 Evaluation Metrics

Following Zhou et al. [73], we evaluate our depth network using several error metrics (absolute relative difference, square related difference, RMSE, log RMSE). For optical flow estimation, we compute the average endpoint error (EPE) on pixels with the ground truth flow available for each dataset. On KITTI flow 2015 dataset [44], we also compute the F1 score, which is the percentage of pixels that have EPE greater than 3 pixels and 5% of the ground truth value.

4.4 Experimental Evaluation

Single-View Depth Estimation. We compare our depth network with state-of-the-art algorithms on the test split of the KITTI raw dataset provided by Eigen et al. [14]. As shown in Table 1, our method achieves the state-of-the-art performance when compared with models trained with monocular video sequences. However, our method performs slightly worse than the models that exploit calibrated stereo image pairs (i.e., pose supervision) or with additional ground truth depth annotation. We believe that performance gap can be attributed to the error induced by our pose network. Extending our approach to calibrated stereo videos is an interesting future direction.

We also conduct an ablation study by removing the forward-backward consistency loss or cross-task consistency loss. In both cases our results show significant performance of degradation, highlighting the importance the proposed consistency loss. Figure 5 shows qualitative comparison with [14, 73], our method can better capture thin structure and delineate clear object contour.

To evaluate the generalization ability of our depth network on general scenes, we also apply our trained model to the Make3D dataset [53]. Table 2 shows that our method achieves the state-of-the-art performance compared with existing unsupervised models and is competitive with respect to supervised learning models (even without fine-tuning on Make3D datasets).

Table 2. Results on the Make3D dataset [54]. Our results were obtained by the model trained on Cityscapes + KITTI without fine-tuning on the training images in Make3D. Following the evaluation protocol of [21], the errors are only computed where depth is less than 70 m. The best and the second best performance in each block are highlighted as bold and underline
Table 3. Quantitative evaluation on optical flow. Results on KITTI flow 2012 [19], KITTI flow 2015 [44] datasets. We denote “C” as the FlyingChairs dataset [12], “T” as the FlyingThings3D dataset [42], “K” as the KITTI raw dataset [18], “SYN” as the SYNTHIA dataset [51]. indicates that the model is trained with ground truth annotation, while indicates the model is trained in an unsupervised manner. The best and the second best performance in each block are highlighted as bold and underline.
Table 4. Pose estimation results on KITTI Odometry datest [19].

Optical Flow Estimation. We compare our flow network with conventional variational algorithms, supervised CNN methods, and several unsupervised CNN models on the KITTI flow 2012 and 2015 datasets. As shown in Table 3, our method achieves state-of-the-art performance on both datasets. A visual comparison can be found in Fig. 6. With optional fine-tuning on available ground truth labels on the KITTI flow datasets, we show that our approach achieves competitive performance sharing similar network architectures. This suggests that our method can serve as an unsupervised pre-training technique for learning optical flow in domains where the amounts of ground truth data are scarce.

Pose Estimation. For completeness, we provide the performance evaluation of the pose network. We follow the same evaluation protocol as [73] and use a 5-frame based pose network. As shown in Table 4, our pose network shows competitive performance with respect to state-of-the-art visual SLAM methods or other unsupervised learning methods. We believe that a better pose network would further improve the performance of both depth or optical flow estimation.

Fig. 6.
figure 6

Visual results on KITTI flow datasets. All the models are directly applied without fine-tuning on KITTI flow annotations. Our model delineates clearer object contours compared to both supervised/unsupervised methods.

5 Conclusions

We presented an unsupervised learning framework for both sing-view depth prediction and optical flow estimation using unlabeled video sequences. Our key technical contribution lies in the proposed cross-task consistency that couples the network training. At test time, the trained depth and flow models can be applied independently. We validate the benefits of joint training through extensive experiments on benchmark datasets. Our single-view depth prediction model compares favorably against existing unsupervised models using unstructured videos on both KITTI and Make3D datasets. Our flow estimation model achieves competitive performance with state-of-the-art approaches. By leveraging geometric constraints, our work suggests a promising future direction of advancing the state-of-the-art in multiple dense prediction tasks using unlabeled data.