A Simple and Effective Fusion Approach for Multi-frame Optical Flow Estimation

Ren, Zhile; Gallo, Orazio; Sun, Deqing; Yang, Ming-Hsuan; Sudderth, Erik B.; Kautz, Jan

doi:10.1007/978-3-030-11024-6_53

A Simple and Effective Fusion Approach for Multi-frame Optical Flow Estimation

Zhile Ren¹⁴,
Orazio Gallo¹⁵,
Deqing Sun¹⁵,
Ming-Hsuan Yang¹⁶,
Erik B. Sudderth¹⁷ &
…
Jan Kautz¹⁵

Conference paper
First Online: 23 January 2019

1737 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11134))

Abstract

To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previous frames to the current, thereby yielding multiple plausible estimates. It then fuses the complementary information carried by these estimates into a new optical flow field. At the time of writing, our method ranks first among published results in the MPI Sintel and KITTI 2015 benchmarks.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Despite recent advances in optical flow estimation, it is still challenging to account for complicated motion patterns. At video rates, even such complicated motion patterns are smooth for longer than just two consecutive frames. This suggests that information from frames that are adjacent in time could be used to improve optical flow estimates. Indeed, numerous methods have been developed [2, 3, 9, 10]. However, none of the top three optical flow algorithms on the major benchmark datasets uses more than two frames [4, 6].

We observe that, for some types of motion and in certain regions, past frames may carry more valuable information than recent ones, even if the optical flow changes abruptly—as is the case of occlusion regions and out-of-boundary pixels. Kennedy and Taylor [8] also leverage this observation, and select which one of multiple flow estimates from adjacent frames is the best for a given pixel. We propose a method to fuse the available information. Specifically, we first estimate per-frame optical flow using a two-frame network module, and then warp multiple optical flow estimates from the past to the current frame, which we can fuse with a second neural network module.

Our approach offers several advantages. First, it allows to fully capitalize on motion information from past frames. Second, our fusion network is agnostic to the algorithm that generates the two-frame optical flow estimates; any standard method can be used as an input, making our framework flexible and straightforward to upgrade when improved two-frame algorithms become available. Finally, if the underlying optical flow algorithm is differentiable, our approach can be trained end-to-end. Extensive experiments show that the proposed algorithm outperforms published state-of-the-art, two-frame optical flow methods by significant margins on the KITTI [6] and Sintel [4] benchmarks. To further validate our results, we present alternative baseline approaches incorporating recurrent neural networks with the state-of-the-art deep-learning optical flow estimation methods, and show that the fusion approach achieves significant performance gains.

2 Proposed Model: Temporal FlowFusion

For clarity reasons, we focus on three-frame optical flow estimation. Given three input frames \(\mathbf {I}_{t\!-\!1}\), \(\mathbf {I}_{t}\), and \(\mathbf {I}_{t\!+\!1}\), our aim is to estimate the optical flow from frame t to frame \(t+1\), \(\mathbf {w}^f_{t \rightarrow t+1}\). The superscript ‘f’ indicates that it fuses information from all the frames. We use two-frame methods, such as PWC-Net [11], to estimate three motion fields, \(\mathbf {w}_{t \rightarrow t+1}\), \(\mathbf {w}_{t-1 \rightarrow t}\), and \(\mathbf {w}_{t \rightarrow t-1}\). We backward warp \(\mathbf {w}_{t-1\rightarrow t} \) using \(\mathbf {w}_{t \rightarrow t\!-\!1}\): \(\widehat{\mathbf {w}}_{t \rightarrow t+1} \!=\! \mathcal {W} (\mathbf {w}_{t\!-\!1 \rightarrow t}; \mathbf {w}_{t \rightarrow t\!-\!1})\), where \(\mathcal {W} (\mathbf {x}; {\mathbf {w}})\) denotes warping the input \(\mathbf {x}\) using the flow \(\mathbf {w}\).

Now we have two candidates for the same frame: \(\widehat{\mathbf {w}}_{t \rightarrow t+1}\) and \(\mathbf {w}_{t \rightarrow t+1}\), we take inspiration from the work of Ilg et al. who perform optical flow fusion in the spatial domain for two-frame flow estimation [7]. We extend this approach to the temporal domain. Our fusion network takes two flow estimates \(\widehat{\mathbf {w}}_{t \rightarrow t+1}\) and \(\mathbf {w}_{t \rightarrow t+1}\), the corresponding brightness constancy errors \(E_{\widehat{\mathbf {w}}} = |\mathbf {I}_t - \mathcal {W} ({\mathbf {I}}_{t+1}; \widehat{\mathbf {w}}_{t \rightarrow t+1})|~\text {and}~ E_\mathbf {w} = |\mathbf {I}_t - \mathcal {W} ({\mathbf {I}}_{t+1}; \mathbf {w}_{t \rightarrow t+1})|\) as well as the current frame \(\mathbf {I}_t\). A visualization of the network structure is shown at Fig. 1. The dotted lines indicate that two sub-networks share the same weights, while the double vertical lines denote the feature concatenation.

We also propose two deep-learning baseline methods, shown at Fig. 2. FlowNetS++: FlowNetS [5] is a standard U-Net structure. We copy the encoded features from the previous pair of images to the current frame. FlowNetS \(+\) GRU: We use GRU-RCN [1] to extract abstract representations from videos and propagate encoded features in previous frames through time in a GRU-RCN unit and introduce a network structure, which we dub FlowNetS \(+\) GRU. We preserve the overall U-Net structure and apply GRU-RCN units at different levels of the encoder with different spatial resolutions. Encoded features at the sixth level are the smallest in resolution.

3 Experimental Results

We test two architectures as building blocks: FlowNetS [5] for its wide adoption, and PWC-Net [11] for its efficiency and performance on standard benchmarks. We follow Sun et al. [11] to design our training procedure and loss function. For consistency among different multi-frame algorithms, we use three frames as inputs.

For fusion networks, the network structure is similar to FlowNet2 [7] except for the first convolution layer, because our input to the fusion network has different channels. For the single optical flow prediction output by our fusion network, we set \(\alpha =0.005\) in the loss function [11] and use learning rate 0.0001 for fine-tuning.

We perform an ablation study of the two-frame and multi-frame methods using the virtual KITTI and Monkaa datasets, as summarized in Table 1. The Fusion approach consistently outperforms all other methods, including those using the GRU units. On the MPI Sintel [4] and KITTI benchmark [6], PWC-Fusion outperforms all two-frame optical flow methods including the state-of-the-art PWC-Net. This is also the first time a multi-frame optical flow algorithm consistently outperforms two-frame approaches across different datasets. We provide some visual results in Fig. 3 (Tables 2 and 3).

Table 1. Ablation study on the virtual KITTI dataset.

Full size table

Table 2. Results of the MPI Sintel [4].

Full size table

Table 3. Results of the KITTI [6].

Full size table

4 Conclusions

We have presented a simple and effective fusion approach for multi-frame optical flow estimation. Multiple frames provide new information beyond what is available from two adjacent frames, in particular for occluded and out-of-boundary pixels. Thus we propose fusing the warped previous flow with the current flow estimate. Extensive experiments demonstrate the benefit of our approach: it outperforms both two-frame baselines and sensible multi-frame baselines based on GRUs. Moreover, it is top-ranked among all published flow methods on the MPI Sintel and KITTI 2015 benchmark.

References

Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)
Black, M.J., Anandan, P.: Robust dynamic motion estimation over time. In: CVPR (1991)
Google Scholar
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2_3
Chapter Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)
Google Scholar
Kennedy, R., Taylor, C.J.: Optical flow with geometric occlusion estimation and fusion of multiple frames. In: Tai, X.-C., Bae, E., Chan, T.F., Lysaker, M. (eds.) EMMCVPR 2015. LNCS, vol. 8932, pp. 364–377. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14612-6_27
Chapter Google Scholar
Maurer, D., Bruhn, A.: Proflow: learning to predict optical flow. In: BMVC (2018)
Google Scholar
Sand, P., Teller, S.: Particle video: long-range motion estimation using point trajectories. IJCV 80(1), 72–91 (2008)
Article Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)
Google Scholar

Download references

Acknowledgement

We thank Fitsum Reda and Jinwei Gu for help with implementations, Xiaodong Yang for helpful discussions about RNN models, and Simon Baker for insightful discussions about multi-frame flow estimation.

Author information

Authors and Affiliations

Brown University, Providence, USA
Zhile Ren
NVIDIA, Santa Clara, USA
Orazio Gallo, Deqing Sun & Jan Kautz
UC Merced, Merced, USA
Ming-Hsuan Yang
UC Irvine, Irvine, USA
Erik B. Sudderth

Authors

Zhile Ren
View author publications
You can also search for this author in PubMed Google Scholar
Orazio Gallo
View author publications
You can also search for this author in PubMed Google Scholar
Deqing Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hsuan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Erik B. Sudderth
View author publications
You can also search for this author in PubMed Google Scholar
Jan Kautz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Kautz .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, Z., Gallo, O., Sun, D., Yang, MH., Sudderth, E.B., Kautz, J. (2019). A Simple and Effective Fusion Approach for Multi-frame Optical Flow Estimation. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11134. Springer, Cham. https://doi.org/10.1007/978-3-030-11024-6_53

Download citation

DOI: https://doi.org/10.1007/978-3-030-11024-6_53
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11023-9
Online ISBN: 978-3-030-11024-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Proposed Model: Temporal FlowFusion

3 Experimental Results

4 Conclusions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation