1 Introduction

Warping has been used in variational methods [1, 6] and neural network models [4, 7, 8] for iteratively refining optical flow estimations in a multi-stage framework. The first stage covers large displacements and outputs a rough estimation. Then the second image (or its feature maps) is warped by the roughly estimated optical flow such that pixels of large displacements in the second image are moved closer to their correspondences in the first image. As a result, the next stage, which receives the original first image and the warped second image as inputs, only needs to handle smaller displacements and refines the estimation.

Let \(I: \mathbb {R}^2 \rightarrow \mathbb {R}^3\) denote the first image, \(J: \mathbb {R}^2 \rightarrow \mathbb {R}^3\) denote the second image and \(F: \mathbb {R}^2 \rightarrow \mathbb {R}^2\) denote the optical flow field of the first image. The warped second image is defined as

$$\begin{aligned} \tilde{J}(\mathbf {p}) = J(\mathbf {p} + F(\mathbf {p})) \end{aligned}$$
(1)

for image location \(\mathbf {p} \in \mathbb {R}^2\) [4].

Fig. 1.
figure 1

Artifacts of using image warping. From (d), we can see the duplicates of the dragon head and wings. The images and the ground truth optical flow are from the Sintel dataset [2]. Warping is done with function image.warp() in the Torch-image toolbox.

The warping operation creates a transformed image reasonably well if the new pixel locations \(\mathbf {p}+F(\mathbf {p})\) do not occlude or collide with each other. For example, affine transform \(F(\mathbf {p}) = \mathbf {Ap} + \mathbf {t}\) where \(\mathbf {A}\) and \(\mathbf {t}\) are the transformation parameters. However, for real-world images, occlusions are common (e.g. when an object moves and the background is still). If an image is warped with the optical flow which induces occlusions, duplicates will be created.

The effect is demonstrated in Fig. 1. The artifacts cannot be cleaned simply by subtracting the first or the second image from the warped image, as shown in Fig. 1(e) and (f). Intuitively, imagine a pixel which is moved by warping to a new location. If no other pixel are moved to fill in its old location, the pixel will appear twice in the warped image. Mathematically, consider the following example. Assume the value of \(J(\mathbf {p}_1)\) is unique in J, that is, \(J(\mathbf {p}) \ne J(\mathbf {p}_1)\) for all \(\mathbf {p} \ne \mathbf {p}_1\). Then for an optical flow field in which

$$\begin{aligned} F(\mathbf {p}_1) = 0, \quad F(\mathbf {p}_2) = \mathbf {p}_1-\mathbf {p}_2, \end{aligned}$$
(2)

we have

$$\begin{aligned} \tilde{J}(\mathbf {p}_1)&= J(\mathbf {p}_1 + F(\mathbf {p}_1)) \end{aligned}$$
(3)
$$\begin{aligned}&= J(\mathbf {p}_1 + 0) = J(\mathbf {p}_1), \end{aligned}$$
(4)
$$\begin{aligned} \tilde{J}(\mathbf {p}_2)&= J(\mathbf {p}_2 + F(\mathbf {p}_2)) \end{aligned}$$
(5)
$$\begin{aligned}&= J(\mathbf {p}_2 + \mathbf {p}_1 - \mathbf {p}_2) = J(\mathbf {p}_1). \end{aligned}$$
(6)

Therefore \(\tilde{J}(\mathbf {p}_1) = \tilde{J}(\mathbf {p}_2) = J(\mathbf {p}_1)\). Since the value of \(J(\mathbf {p}_1)\) is unique in image J but not unique in \(\tilde{J}\), a duplicate is created on the warped second image \(\tilde{J}\).

2 Deformable Cost Volume

Let I denote the first image, J denote the second image and \(f_I: \mathbb {R}^2 \rightarrow \mathbb {R}^d\) and \(f_J: \mathbb {R}^2 \rightarrow \mathbb {R}^d\) denote their feature maps of dimensionality d, respectively. The standard cost volume is defined as

$$\begin{aligned} C(\mathbf {p},\mathbf {v}) = \Vert f_I(\mathbf {p}) -f_J(\mathbf {p}+\mathbf {v}) \Vert , \end{aligned}$$
(7)

for image location \(\mathbf {p} \in \mathbb {R}^2\), neighbor \(\mathbf {v} \in [-\frac{k-1}{2},\frac{k-1}{2}]^2\) of neighborhood size k and a given vector norm \(\Vert \cdot \Vert \).

The cost volume gives an explicit representation of displacements. To reduce the computational burden of constructing fully connected cost volumes, one can embed the cost volume in a multi-scale representation and use warping to propagate the flow between two stages. However, as discussed in Sect. 1, warping induces artifacts and distortion. To avoid the drawbacks of warping, we propose a new neural network module, the deformable cost volume. The key idea is: instead of deforming images or their feature maps, as done with warping, we deform the cost volume and leave the images and the feature maps unchanged.

Fig. 2.
figure 2

Cost volumes

The proposed deformable cost volume is defined as

$$\begin{aligned} C(\mathbf {p},\mathbf {v},r,F) = \Vert f_I(\mathbf {p}) -f_J(\mathbf {p}+r\cdot \mathbf {v} + F(\mathbf {p})) \Vert \end{aligned}$$
(8)

where r is the dilation factor and \(F(\cdot )\) is an external flow field. The dilation factor r is introduced to enlarge the size of the neighborhood to handle large displacements without increasing computation significantly. This is inspired by the dilated convolution [3, 9] which enlarges its receptive field in a similar way. \(F(\cdot )\) can be obtained from the optical flow estimated from a previous stage or an external algorithm. If \(F(\mathbf {p})=0\) for all \(\mathbf {p}\) and \(r=1\), then the deformable cost volume is reduced to the standard cost volume. For non-integer \(F(\mathbf {p})\), bilinear interpolation is used. The deformable cost volume is illustrated in Fig. 2.

Since the deformable cost volume does not distort \(f_I\) or \(f_J\), the artifacts associated with warping will not be created. Optical flow can be inferred from the deformable cost volume solely without resorting to the feature maps of the first image to counter the duplicates.

The deformable cost volume is differentiable with respect to \(f_I(\mathbf {p})\) and \(f_J(\mathbf {p}+r\cdot \mathbf {v} + F(\mathbf {p}))\) for each image location \(\mathbf {p}\). Due to bilinear interpolation, the deformable cost volume is also differentiable with respect to \(F(\mathbf {p})\), using the same technique as in [4, 5]. Therefore, the deformable cost volume can be inserted in a neural network for end-to-end learning optical flow.

3 Deformable Volume Network

Our proposed model is the deformable volume network (Devon), as illustrated in Fig. 3. Compared to previous neural network models, Devon has several major differences: (1) All feature maps in Devon have the same resolution. (2) Each stage computes on the undistorted images. No warping is used. (3) The decoding module only receives inputs from the relation module. (4) All stages share the feature extraction module.

Fig. 3.
figure 3

Deformable Volume Network (Devon) with three stages. I denotes the first image, J denotes the second image, f denotes the shared feature extraction module (a fully convolutional network), \(R_t\) denotes the relation module (concatenation of several deformable cost volumes), \(g_t\) denotes the decoding module (a fully convolutional network) and \(F_t\) denotes the estimated optical flow for stage t.