1 Introduction

An increasingly popular approach to representation learning is to use proxy tasks that do not require the use of manual annotations. In this paper, we explore using motion cues, represented as optical flow, to formulate a proxy task for self-supervision. Inspired by Gestalt principle of common fate, we develop a framework which groups pixels that constitute “coherent” motion. Crucially this grouping is obtained by looking at a single image only. The optical flow is used only in the loss function. Therefore, at test time, the model can be deployed without video or flow information. The underlying assumption is that a segment containing an object exhibits “coherent” motion. Therefore a segmentation with our objective will learn to segment objects or object-parts. We call this framework Self-Supervised Segmentation-CNN or S3-CNN. An illustration is provided in Fig. 1a.

Our formulation can be easily extended to the case where motion is induced by action/ego-motion. This extension is more expensive to experiment with and hence we restrict ourselves to offline videos.

Fig. 1.
figure 1

(a) We propose to learn a neural network operating on images using temporal information contained in videos as supervision. The learning goal is to predict regions that are likely to have “coherent” optical flow. Flow can be observed by the loss, but not by the CNN. It encourages the network to learn about object-part-like regions in images. (b) Affine Motion Loss: Optical flow within each region is approximated using an affine transformation (\(A_1, \cdots , A_M\)). These are recombined to give a reconstructed flow which is compared against ground truth.

2 Related Work

Self-supervised Learning. S3-CNNis a self-supervised pre-training scheme to learn a feature extractor that can be fine-tuned for other tasks. We review closely related prior works by grouping them based on the nature of their pre-training loss.

The first group comprises methods that predict an auxiliary input \(\mathbf {y}\) given an image \(\mathbf {x}\). For example using RNNs to predict future frames in videos [1]. Similarly, Colorization [2, 3] predicts colour given grayscale input. A generalization to arbitrary pairs of modalities was proposed in [4]. Recent work has explored the geometric target of surface normals [5]. Closely related to our work is the use of video segmentation by [6]. They use an off-the-shelf video segmentation method to construct a foreground-background segmentation dataset to pretrain a CNN. We differ from them in that we do not require a sophisticated pre-existing pipeline to extract video segments, but use optical flow directly.

The second group of self-supervised methods reconstruct (properties of) the image \(\mathbf {x}\) given an incomplete or corrupted version of the same. For example, [7] solve the inpainting problem, where part of the image is occluded. Alternative low dimensional targets have been explored by the community. For example, [8] learn to predict the global image rotation. [9] predict the relative position of two patches extracted from an image. [10, 11] solve a jig-saw puzzle problem. [12] improve upon context based methods. The temporal analog of these are methods that predict the correct ordering of frames [13,14,15] or embed frames using temporal cues [16,17,18,19,20]. [21] train a siamese style convolutional neural network to predict the transformation between two images. [22] use videos along with spatial context pretraining [9] to construct an image graph. Transitivity in the graph is exploited to learn representations.

Our approach borrows from both paradigms. We predict a property of image \(\mathbf {x}\) – a grouping of its pixels. At the same time, we supervise these segments using auxiliary data. This adds richer supervision than can be obtained by looking at cues contained in image \(\mathbf {x}\) alone.

Segmentation Cues. Our method is based on using various motion cues to evaluate image regions and in this way relates to classical work [23,24,25]. These methods, however, use motion at test/inference time while we use it only at training time for supervision.

3 Method: Self-supervised Grouping Losses

Our idea is to learn a CNN that predicts a segmentation \(\varPhi : \mathbf {x}\mapsto \mathbf {m}\in \{1,\dots ,L\}^{H\times W}\) of the image. Pixels \(u \in [1,\dots ,H]\times [1,\dots ,W]\) within each region l are assumed to be I.I.D with respect to a simple parametric distribution \(p(f_u|\theta _l)\) where \(f_u\) is the flow at pixel u. Marginalizing the region parameters \(p(\theta _l)\) results in the model:

$$\begin{aligned} p(\mathbf {f}|\mathbf {m}) = \prod _{l=1}^L \int \left[ \prod _{u : m_u = l} p(f_u | \theta _l) \right] p(\theta _l)\,d\theta _l. \end{aligned}$$
(1)

Crucially, due to the marginalization, network \(\varPhi \) is not tasked with predicting the transformation parameters \(\theta \), but only the regions \(\mathbf {m}\). As a simpler alternative to marginalizing by integration, in the rest of this extended abstract we marginalize the model parameters by maximization and drop the prior on the parameters, so that the probability density for a region is written as:

$$\begin{aligned} p(\mathbf {f}|\mathbf {m}) = \prod _{l=1}^L \max _{\theta _l} \prod _{u : m_u = l} p(f_u | \theta _l). \end{aligned}$$
(2)

We further adapt the formulation for soft segments \(\mathbf {m}\in [0,1]^{H \times W \times L}\). We experiment with two choices of \(\theta _l\) - Affine transforms and flow-magnitude histograms.

Affine Transformations: We fit an affine motion model to the optical flow within each segment. This “fit” corresponds to the \(\max \) operation in Eq. (2) and is computed by solving a weighted least squares problem. As a proxy for the likelihood in Eq. (2), our loss function is a robust residual between the affine approximation and the optical-flow \(\mathbf {f}\). This is a motion based self-supervision loss which conveys a notion of coherent motion within each segment based on an affine approximation of its optical flow. Computing this loss requires solving a weighted least squares online in the network’s forward pass which is a simple combination of matrix arithmetic and a matrix inverse all of which are differentiable.

Low Entropy Motion Loss. Instead of fitting parametric motion models to regions, histograms offer a general non-parametric alternative. We compute a histogram for the flow-magnitude within each segment. The histogram itself constitutes \(\theta _l\) (Eq. (2)) and \(\mathbf {f}\) is the flow magnitude rather than 2D flow vectors. The entropy of this histogram is used as a loss, again as a proxy for the likelihood in Eq. (2). We assume that a segment straddling different independently movable objects will constitute a high entropy histogram. In other words, we assume a histogram entropy loss encourages the separation of independently movable objects.

4 Experiments

We show qualitative results as sample image segmentations generated by our S3-CNN. We then assess its capability to pre-train for image recognition. In these experiments, we use a Fully Convolutional Network [26] FCN-8s model on VGG-16 [27]. FCN scores are mapped to soft segmentation masks as in [28]. Parameter free batch normalization [29] was used after every convolutional and fully connected layer in the pretraining stage.

Qualitative Results: First, we demonstrate our method on a toy problem. The data consists of synthetic videos of a single translating and rotating 3D textured cube (Fig. 2a); paired with the corresponding optical flow field. Cubes are imaged under an orthographic camera, so that the affine motion model of Sect. 3 applies to each cube face. We train a network to predict 5 segments with self-supervision from five sequences containing 99 frames each. As seen in Fig. 2a, the network learns to correctly group together the pixels in each cube face.

Fig. 2.
figure 2

Predicted regions are visualized by a colour map. (Color figure online)

Next we consider Sintel [30], containing videos from an animated 3D movie and use the affine flow model to learn a grouping of image regions. While this model offers only a loose approximation of the complex motions in these videos, informative regions can still be learned as the affine approximation is quite good for body parts and other small objects. The results obtained, on training and validation images, by the model trained using the affine flow loss on 20 training sequence from Sintel are shown in Fig. 2b, where several objects and parts are highlighted. Notice in particular that even bodies and heads are picked up despite their non-planar structure.

In the case of real world data, we have large systematic noise in automatically computed optical flow. We find that the histogram entropy loss works best in these cases. Figure 2c shows qualitative results on frames from the Youtube objects dataset [31, 32]. These were predicted by our model trained on frames extracted from YFCC100m [33] and supervised using the flow magnitude histogram entropy loss (Sect. 3). The cat boundaries align well with segments in the first column and a bird in the middle is segmented out. Also each segment caters to one spatial region. The teal coloured region is always in the middle left whereas the light green region is always in the top right corner.

Pre-training for Object Recognition: Our approach can also be used as a proxy to pre-train a generic feature extractor. These features can then be fine-tuned for other tasks such as image classification. To test this use, we follow the protocol of [34] to evaluate on Pascal VOC 2007 classification. Batch normalization moments are absorbed into convolution filters and biases before fine-tuning.

We first pre-train our S3-CNNmodel on optical flow and frames extracted from videos in the YFCC100m dataset. We use 150k videos and compute optical flow between the first and fifth frame of each using EpicFlowFootnote 1 [35] with initial matches given by FlowFields [36]. This yields a dataset of 150k frames.

Table 1. We fine-tune our model for VOC-07 classification (% mAP on test split).

Table 1 lists methods that report results on VOC-07 classification using a VGG-16 based model. We observe that our S3-CNNmodel performs better than a non pretrained VGG-16. We are competitive to state-of-the-art models for VOC-07 classification: 76.35% mAP compared to 77.2% mAP of [2] despite using only 150k pre-training pairs compared to their pretraining dataset of 3.7M images. Lastly, we trained an AlexNet model akin to that of [6] by constructing an AlexNet FCN S3-CNN. We compare with them on VOC-07 classification and obtain 57.37% mAP versus their result of 61% mAP. This is promising given that we use 150k images versus their dataset of 1.6M images.

5 Conclusions

We have presented the S3 framework, that allows supervising neural network architectures for general-purpose feature extraction using optical flow.