Devon: Deformable Volume Network for Learning Optical Flow

Lu, Yao; Valmadre, Jack; Wang, Heng; Kannala, Juho; Harandi, Mehrtash; Torr, Philip H. S.

doi:10.1007/978-3-030-11024-6_50

Yao Lu^14,15,
Jack Valmadre¹⁶,
Heng Wang¹⁷,
Juho Kannala¹⁸,
Mehrtash Harandi¹⁹ &
…
Philip H. S. Torr¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11134))

Included in the following conference series:

European Conference on Computer Vision

1702 Accesses
2 Citations

Abstract

We propose a new neural network module, Deformable Cost Volume, for learning large displacement optical flow. The module does not distort the original images or their feature maps and therefore avoids the artifacts associated with warping. Based on this module, a new neural network model is proposed. The full version of this paper can be found online (https://arxiv.org/abs/1802.07351).

You have full access to this open access chapter, Download conference paper PDF

FlowFormer: A Transformer Architecture for Optical Flow

Optical Flow Estimation in the Deep Learning Age

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

1 Introduction

Warping has been used in variational methods [1, 6] and neural network models [4, 7, 8] for iteratively refining optical flow estimations in a multi-stage framework. The first stage covers large displacements and outputs a rough estimation. Then the second image (or its feature maps) is warped by the roughly estimated optical flow such that pixels of large displacements in the second image are moved closer to their correspondences in the first image. As a result, the next stage, which receives the original first image and the warped second image as inputs, only needs to handle smaller displacements and refines the estimation.

Let $I: \mathbb {R}^2 \rightarrow \mathbb {R}^3$ denote the first image, $J: \mathbb {R}^2 \rightarrow \mathbb {R}^3$ denote the second image and $F: \mathbb {R}^2 \rightarrow \mathbb {R}^2$ denote the optical flow field of the first image. The warped second image is defined as

$$\begin{aligned} \tilde{J}(\mathbf {p}) = J(\mathbf {p} + F(\mathbf {p})) \end{aligned}$$

(1)

for image location $\mathbf {p} \in \mathbb {R}^2$ [4].

The warping operation creates a transformed image reasonably well if the new pixel locations $\mathbf {p}+F(\mathbf {p})$ do not occlude or collide with each other. For example, affine transform $F(\mathbf {p}) = \mathbf {Ap} + \mathbf {t}$ where $\mathbf {A}$ and $\mathbf {t}$ are the transformation parameters. However, for real-world images, occlusions are common (e.g. when an object moves and the background is still). If an image is warped with the optical flow which induces occlusions, duplicates will be created.

The effect is demonstrated in Fig. 1. The artifacts cannot be cleaned simply by subtracting the first or the second image from the warped image, as shown in Fig. 1(e) and (f). Intuitively, imagine a pixel which is moved by warping to a new location. If no other pixel are moved to fill in its old location, the pixel will appear twice in the warped image. Mathematically, consider the following example. Assume the value of $J(\mathbf {p}_1)$ is unique in J, that is, $J(\mathbf {p}) \ne J(\mathbf {p}_1)$ for all $\mathbf {p} \ne \mathbf {p}_1$. Then for an optical flow field in which

$$\begin{aligned} F(\mathbf {p}_1) = 0, \quad F(\mathbf {p}_2) = \mathbf {p}_1-\mathbf {p}_2, \end{aligned}$$

(2)

we have

$$\begin{aligned} \tilde{J}(\mathbf {p}_1)&= J(\mathbf {p}_1 + F(\mathbf {p}_1)) \end{aligned}$$

(3)

$$\begin{aligned}&= J(\mathbf {p}_1 + 0) = J(\mathbf {p}_1), \end{aligned}$$

(4)

$$\begin{aligned} \tilde{J}(\mathbf {p}_2)&= J(\mathbf {p}_2 + F(\mathbf {p}_2)) \end{aligned}$$

(5)

$$\begin{aligned}&= J(\mathbf {p}_2 + \mathbf {p}_1 - \mathbf {p}_2) = J(\mathbf {p}_1). \end{aligned}$$

(6)

Therefore $\tilde{J}(\mathbf {p}_1) = \tilde{J}(\mathbf {p}_2) = J(\mathbf {p}_1)$. Since the value of $J(\mathbf {p}_1)$ is unique in image J but not unique in $\tilde{J}$, a duplicate is created on the warped second image $\tilde{J}$.

2 Deformable Cost Volume

Let I denote the first image, J denote the second image and $f_I: \mathbb {R}^2 \rightarrow \mathbb {R}^d$ and $f_J: \mathbb {R}^2 \rightarrow \mathbb {R}^d$ denote their feature maps of dimensionality d, respectively. The standard cost volume is defined as

$$\begin{aligned} C(\mathbf {p},\mathbf {v}) = \Vert f_I(\mathbf {p}) -f_J(\mathbf {p}+\mathbf {v}) \Vert , \end{aligned}$$

(7)

for image location $\mathbf {p} \in \mathbb {R}^2$, neighbor $\mathbf {v} \in [-\frac{k-1}{2},\frac{k-1}{2}]^2$ of neighborhood size k and a given vector norm $\Vert \cdot \Vert $.

The cost volume gives an explicit representation of displacements. To reduce the computational burden of constructing fully connected cost volumes, one can embed the cost volume in a multi-scale representation and use warping to propagate the flow between two stages. However, as discussed in Sect. 1, warping induces artifacts and distortion. To avoid the drawbacks of warping, we propose a new neural network module, the deformable cost volume. The key idea is: instead of deforming images or their feature maps, as done with warping, we deform the cost volume and leave the images and the feature maps unchanged.

The proposed deformable cost volume is defined as

$$\begin{aligned} C(\mathbf {p},\mathbf {v},r,F) = \Vert f_I(\mathbf {p}) -f_J(\mathbf {p}+r\cdot \mathbf {v} + F(\mathbf {p})) \Vert \end{aligned}$$

(8)

where r is the dilation factor and $F(\cdot )$ is an external flow field. The dilation factor r is introduced to enlarge the size of the neighborhood to handle large displacements without increasing computation significantly. This is inspired by the dilated convolution [3, 9] which enlarges its receptive field in a similar way. $F(\cdot )$ can be obtained from the optical flow estimated from a previous stage or an external algorithm. If $F(\mathbf {p})=0$ for all $\mathbf {p}$ and $r=1$, then the deformable cost volume is reduced to the standard cost volume. For non-integer $F(\mathbf {p})$, bilinear interpolation is used. The deformable cost volume is illustrated in Fig. 2.

Since the deformable cost volume does not distort $f_I$ or $f_J$, the artifacts associated with warping will not be created. Optical flow can be inferred from the deformable cost volume solely without resorting to the feature maps of the first image to counter the duplicates.

The deformable cost volume is differentiable with respect to $f_I(\mathbf {p})$ and $f_J(\mathbf {p}+r\cdot \mathbf {v} + F(\mathbf {p}))$ for each image location $\mathbf {p}$. Due to bilinear interpolation, the deformable cost volume is also differentiable with respect to $F(\mathbf {p})$, using the same technique as in [4, 5]. Therefore, the deformable cost volume can be inserted in a neural network for end-to-end learning optical flow.

3 Deformable Volume Network

Our proposed model is the deformable volume network (Devon), as illustrated in Fig. 3. Compared to previous neural network models, Devon has several major differences: (1) All feature maps in Devon have the same resolution. (2) Each stage computes on the undistorted images. No warping is used. (3) The decoding module only receives inputs from the relation module. (4) All stages share the feature extraction module.

References

Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2_3
Chapter Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv (2016)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR 2017 (2017)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS 2015 (2015)
Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981 (1981)
Google Scholar
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: CVPR 2017 (2017)
Google Scholar
Sun, D., Yang, X., Liu, M.-Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. arXiv 2017 (2017)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR 2015 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Australian National University, Canberra, Australia
Yao Lu
Data61, CSIRO, Sydney, Australia
Yao Lu
University of Oxford, Oxford, UK
Jack Valmadre & Philip H. S. Torr
Facebook, Cambridge, USA
Heng Wang
Aalto University, Helsinki, Finland
Juho Kannala
Monash University, Melbourne, Australia
Mehrtash Harandi

Authors

Yao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jack Valmadre
View author publications
You can also search for this author in PubMed Google Scholar
Heng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Juho Kannala
View author publications
You can also search for this author in PubMed Google Scholar
Mehrtash Harandi
View author publications
You can also search for this author in PubMed Google Scholar
Philip H. S. Torr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yao Lu .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, Y., Valmadre, J., Wang, H., Kannala, J., Harandi, M., Torr, P.H.S. (2019). Devon: Deformable Volume Network for Learning Optical Flow. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11134. Springer, Cham. https://doi.org/10.1007/978-3-030-11024-6_50

Download citation

DOI: https://doi.org/10.1007/978-3-030-11024-6_50
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11023-9
Online ISBN: 978-3-030-11024-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Devon: Deformable Volume Network for Learning Optical Flow

Abstract

Similar content being viewed by others

FlowFormer: A Transformer Architecture for Optical Flow

Optical Flow Estimation in the Deep Learning Age

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

1 Introduction

2 Deformable Cost Volume

3 Deformable Volume Network

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Devon: Deformable Volume Network for Learning Optical Flow

Abstract

Similar content being viewed by others

FlowFormer: A Transformer Architecture for Optical Flow

Optical Flow Estimation in the Deep Learning Age

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

1 Introduction

2 Deformable Cost Volume

3 Deformable Volume Network

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation