1 Introduction

Despite recent advances in optical flow estimation, it is still challenging to account for complicated motion patterns. At video rates, even such complicated motion patterns are smooth for longer than just two consecutive frames. This suggests that information from frames that are adjacent in time could be used to improve optical flow estimates. Indeed, numerous methods have been developed [2, 3, 9, 10]. However, none of the top three optical flow algorithms on the major benchmark datasets uses more than two frames [4, 6].

We observe that, for some types of motion and in certain regions, past frames may carry more valuable information than recent ones, even if the optical flow changes abruptly—as is the case of occlusion regions and out-of-boundary pixels. Kennedy and Taylor [8] also leverage this observation, and select which one of multiple flow estimates from adjacent frames is the best for a given pixel. We propose a method to fuse the available information. Specifically, we first estimate per-frame optical flow using a two-frame network module, and then warp multiple optical flow estimates from the past to the current frame, which we can fuse with a second neural network module.

Our approach offers several advantages. First, it allows to fully capitalize on motion information from past frames. Second, our fusion network is agnostic to the algorithm that generates the two-frame optical flow estimates; any standard method can be used as an input, making our framework flexible and straightforward to upgrade when improved two-frame algorithms become available. Finally, if the underlying optical flow algorithm is differentiable, our approach can be trained end-to-end. Extensive experiments show that the proposed algorithm outperforms published state-of-the-art, two-frame optical flow methods by significant margins on the KITTI [6] and Sintel [4] benchmarks. To further validate our results, we present alternative baseline approaches incorporating recurrent neural networks with the state-of-the-art deep-learning optical flow estimation methods, and show that the fusion approach achieves significant performance gains.

2 Proposed Model: Temporal FlowFusion

For clarity reasons, we focus on three-frame optical flow estimation. Given three input frames \(\mathbf {I}_{t\!-\!1}\), \(\mathbf {I}_{t}\), and \(\mathbf {I}_{t\!+\!1}\), our aim is to estimate the optical flow from frame t to frame \(t+1\), \(\mathbf {w}^f_{t \rightarrow t+1}\). The superscript ‘f’ indicates that it fuses information from all the frames. We use two-frame methods, such as PWC-Net [11], to estimate three motion fields, \(\mathbf {w}_{t \rightarrow t+1}\), \(\mathbf {w}_{t-1 \rightarrow t}\), and \(\mathbf {w}_{t \rightarrow t-1}\). We backward warp \(\mathbf {w}_{t-1\rightarrow t} \) using \(\mathbf {w}_{t \rightarrow t\!-\!1}\): \(\widehat{\mathbf {w}}_{t \rightarrow t+1} \!=\! \mathcal {W} (\mathbf {w}_{t\!-\!1 \rightarrow t}; \mathbf {w}_{t \rightarrow t\!-\!1})\), where \(\mathcal {W} (\mathbf {x}; {\mathbf {w}})\) denotes warping the input \(\mathbf {x}\) using the flow \(\mathbf {w}\).

Now we have two candidates for the same frame: \(\widehat{\mathbf {w}}_{t \rightarrow t+1}\) and \(\mathbf {w}_{t \rightarrow t+1}\), we take inspiration from the work of Ilg et al. who perform optical flow fusion in the spatial domain for two-frame flow estimation [7]. We extend this approach to the temporal domain. Our fusion network takes two flow estimates \(\widehat{\mathbf {w}}_{t \rightarrow t+1}\) and \(\mathbf {w}_{t \rightarrow t+1}\), the corresponding brightness constancy errors \(E_{\widehat{\mathbf {w}}} = |\mathbf {I}_t - \mathcal {W} ({\mathbf {I}}_{t+1}; \widehat{\mathbf {w}}_{t \rightarrow t+1})|~\text {and}~ E_\mathbf {w} = |\mathbf {I}_t - \mathcal {W} ({\mathbf {I}}_{t+1}; \mathbf {w}_{t \rightarrow t+1})|\) as well as the current frame \(\mathbf {I}_t\). A visualization of the network structure is shown at Fig. 1. The dotted lines indicate that two sub-networks share the same weights, while the double vertical lines denote the feature concatenation.

Fig. 1.
figure 1

Architecture of the proposed fusion approach.

We also propose two deep-learning baseline methods, shown at Fig. 2. FlowNetS++: FlowNetS [5] is a standard U-Net structure. We copy the encoded features from the previous pair of images to the current frame. FlowNetS \(+\) GRU:  We use GRU-RCN [1] to extract abstract representations from videos and propagate encoded features in previous frames through time in a GRU-RCN unit and introduce a network structure, which we dub FlowNetS \(+\) GRU. We preserve the overall U-Net structure and apply GRU-RCN units at different levels of the encoder with different spatial resolutions. Encoded features at the sixth level are the smallest in resolution.

3 Experimental Results

We test two architectures as building blocks: FlowNetS [5] for its wide adoption, and PWC-Net [11] for its efficiency and performance on standard benchmarks. We follow Sun et al.  [11] to design our training procedure and loss function. For consistency among different multi-frame algorithms, we use three frames as inputs.

For fusion networks, the network structure is similar to FlowNet2 [7] except for the first convolution layer, because our input to the fusion network has different channels. For the single optical flow prediction output by our fusion network, we set \(\alpha =0.005\) in the loss function [11] and use learning rate 0.0001 for fine-tuning.

We perform an ablation study of the two-frame and multi-frame methods using the virtual KITTI and Monkaa datasets, as summarized in Table 1. The Fusion approach consistently outperforms all other methods, including those using the GRU units. On the MPI Sintel [4] and KITTI benchmark [6], PWC-Fusion outperforms all two-frame optical flow methods including the state-of-the-art PWC-Net. This is also the first time a multi-frame optical flow algorithm consistently outperforms two-frame approaches across different datasets. We provide some visual results in Fig. 3 (Tables 2 and 3).

Fig. 2.
figure 2

Baseline network structures.

Table 1. Ablation study on the virtual KITTI dataset.
Table 2. Results of the MPI Sintel [4].
Table 3. Results of the KITTI [6].
Fig. 3.
figure 3

Visual results of our fusion method. Green in the indication map means that PWC-Net+Fusion is more accurate than PWC-Net, and red means the opposite. (Color figure online)

4 Conclusions

We have presented a simple and effective fusion approach for multi-frame optical flow estimation. Multiple frames provide new information beyond what is available from two adjacent frames, in particular for occluded and out-of-boundary pixels. Thus we propose fusing the warped previous flow with the current flow estimate. Extensive experiments demonstrate the benefit of our approach: it outperforms both two-frame baselines and sensible multi-frame baselines based on GRUs. Moreover, it is top-ranked among all published flow methods on the MPI Sintel and KITTI 2015 benchmark.