1 Introduction

Optical flow is fundamentally pointwise local motion on an imaging plane (retina) [1,2,3,4]. This pointwise motion is low-level information for perception of global motion [5,6,7,8, 34, 35]. In this paper, we introduce a model for the extraction of queues for perception of global motion from the optical flow fields using the temporal transportation [9, 10] of optical flow fields along times.

Flow vectors locally extract point correspondences between a pair of successive images on the retina [1]. These local correspondences are applied to motion tracking because temporal evolution of a correspondence describes the temporal trajectory of a point in a video stream of images.

In Field VI of the human brain for motion perception, independent components of optical flow field on the retina [11, 36] are transmitted from the medial superior temporal area (MST) to the middle temporal area (MT) [11,12,13,14,15,16]. Then, pointwise local motion is transformed to intermediate-level information for motion cognition. Flying insects also control motion using optical flow. Honey bees navigate using optical flow [17,18,19,20,21]. The compound eyes [18, 19, 38, 39] of insects perceive spherical optical flow fields [38, 39]. The divergence on the spherical optical flow filed indicates the direction of flying in the global environment [19]. Disparities of optical flow fields between the fields on the left and right hemi-spheres control the direction of fling in the local environment. Therefore, temporal optical flow fields generated on the spherical retina of an omnidirectional camera system provide queue for navigation [39]. These geometric properties of optical flow fields on the spherical retina are the basis of insect-inspired visual navigation. Geometrical processing of optical flow fields on the spherical retina yields syntactical information for robot navigation [37,38,39].

Autonomous vehicles navigate using images captured by a planar retina [40, 41]. We have develop an algorithm for the generation of motion semantics from optical flow fields generated on a planar retina, which is a common imaging process for non-compound eye systems.

In the previous paper, we introduced a model for the extraction of queue for recognising global spatial motion from scene flow fields using the temporal transportation of the vector field [40]. As a comparative study with our previous results, we apply the same idea to the optical flow field on a planar retina. This comparative study implies that for global motion perception, the optical flow fields, which is computed from monocular image sequence, possess the similar properties with those of the scene flow fields.

2 Metric for Optical Flow Fields

Setting \(\varvec{u}(\varvec{x})=(u(\varvec{x}),v(\varvec{x}))^\top \) for \(\varvec{x}=(x,y)^\top \in \mathbf{R}^2\) to be the optical flow field on two-dimensional Euclidean space, the directional histogram [22] of \(\varvec{u}(\varvec{x})\) is obtained by integration of the magnitude of \(\varvec{u}(\varvec{x})\) in the region of interest (ROI), that is,

$$\begin{aligned} h_{\varvec{x}}(\theta ;\varvec{u})=\frac{1}{|\varOmega (\varvec{x})|} \int _{\varOmega (\varvec{x})\, \frac{\varvec{u}(\varvec{y})}{|\varvec{u}(\varvec{y})|} =(\cos \theta , \sin \theta )^\top }|\varvec{u}(\varvec{y})|d\varvec{y}, \end{aligned}$$
(1)

where \(\varOmega (\varvec{x})\), \(|\varOmega (\varvec{x})|\) and \(\varvec{x}\in \mathbf{R}^2\) are the ROI, the area measure of the ROI and the reference point of the ROI, respectively.

The distance between two optical flow fields \(\varvec{u}(\varvec{x})\) and \(\varvec{v}(\varvec{x})\) in the region \(\varLambda \) is defined as

$$\begin{aligned}&D(\varvec{u},\varvec{v}) \nonumber \\&=\left( \int _{\varLambda } \left( \min _{\phi } \min _{c(\theta ,\theta ')} \int _0^{2\pi } \int _0^{2\pi } |h_{\varvec{x}}(\theta -\phi ;\varvec{u})-h_{\varvec{x}}(\theta ';\varvec{v})|^2 c_{\varvec{x}}(\theta ,\theta ') d\theta d\theta ' \right) d\varvec{x} \right) ^{\frac{1}{2}}, \nonumber \\&\end{aligned}$$
(2)

where

$$\begin{aligned} \int _{0}^{2\pi } c_{\varvec{x}}(\theta ,\theta ')d\theta \le h_{\varvec{x}}(\theta ';\varvec{u}), \, \, \, \int _{0}^{2\pi } c_{\varvec{x}}(\theta ,\theta ')d\theta ' \le h_{\varvec{x}}(\theta ;\varvec{v}), \end{aligned}$$
(3)

for \(c_{\varvec{x}}(\theta ,\theta ')\ge 0\), using the transportation [9] of the directional histograms [22] of the fields.

For the discrete optical flow field \(\varvec{u}_{mn}=(u_{mn},v_{mn})^\top \) at the point \((m,n)^\top \) on discrete plane \(\mathbf{Z}^2\), let \(\{f_{mn}(p)\}_{p=0}^{N-1}\) be the cyclic directional histogram for the directions \(\varvec{\omega }_{N}=(\cos 2\pi \frac{p}{N} \sin 2\pi \frac{p}{N})^\top \). For the discrete cyclic histograms \(F_{mn}=\{f_{mn}(i)\}_{i=0}^{N-1}\) and \(G_{mn}=\{g_{mn}(i)\}_{i=0}^{N-1}\), such that \(f_{mn}(i+N)=f_{mn}(i)\) and \(g_{mn}(i+N)=g_{mn}(i)\), we define the transportation between the histograms as

$$\begin{aligned} d_{mn}(F_{mn},G_{mn})= & {} \left( \min _k \min _{c_{ij}^{mn}} \sum _{i=0}^{N-1} \sum _{j=0}^{N-1} |f_{mn}(i)-g_{mn}(j-k)|^2 c(k)_{ij}^{mn} \right) ^{\frac{1}{2}},\end{aligned}$$
(4)
$$\begin{aligned}& \sum _{i=0}^{N-1}c_{ij}(k)^{mn}\le g_{mn}(j-k), \, \, \, \sum _{j=0}^{N-1}c_{ij}(k)^{mn}\le f_{mn}(i), \, \, \, c_{ij}^{mn}\ge 0. \end{aligned}$$
(5)

Setting \(A_{ij}^{mn}(k)=|f_{mn}(i)-g_{mn}(j-k)|^2\), the minimisation of \(J_{mn}(k)\)

$$\begin{aligned} J_{mn}(k)=\min _{c(k)_{ij}^{mn}} \sum _{i=0}^{N-1}\sum _{j=0}^{N-1}A_{ij}^{mn}(k)c(k)_{ij}^{mn}, \end{aligned}$$
(6)

with the constraints of Eq. (5) is solved by linear programming for each \(k=0,1,\cdots , N-1\). Then, we define the metric between discrete vector fields \(\varvec{u}_{mn}\) and \(\varvec{v}_{mn}\) in the ROI \(\varLambda \) on the two-dimensional discrete plane \(\mathbf{Z}^2\) as

$$\begin{aligned} d(\varvec{u},\varvec{v})=\sqrt{\sum _{(m,n)^\top \in \varLambda } d_{mn}(F_{mn},G_{mn})^2}, \, \, \, d_{mn}(F_{mn},G_{mn})=\min _k \sqrt{J_{mn}(k)}. \end{aligned}$$
(7)

Figure 1 shows the process of transportation of a pair of circular histograms f and g. (a) and (b) show two probabilistic distribution on a circle. and their samples on the circle. The top row in (c) shows the residual values after the maximum flows moved from each bin of P to bins of Q. The bottom row in (c) shows the flows that moved from P to Q as the maximum flow between histograms.

Fig. 1.
figure 1

Examples of the computation of the transportation distances. (a) and (b) show two probabilistic distributions on a circles and their samples on the circles. The top row in (c) shows the residual values after the maximum flows move from each bin of P to the bins of Q. The bottom row in (c) shows the flows that move from P to Q as the maximum flow between samples. (d) The state at the end of the computation. All sampled values of P are moved to bins of Q.

3 Symbolisation of Global Motion

The temporal trajectory of the distance between a successive pair of optical flow fields \(\varvec{u}(\varvec{x},t+1)\) and \(\varvec{u}(\varvec{x},t)\) of the spatiotemporal image \(f(\varvec{x},t)\) is

$$\begin{aligned} H(t;f)=d(\varvec{u}(\varvec{x},t+1), \varvec{u}(\varvec{x},t)). \end{aligned}$$
(8)

Setting \(H_t(t;f)\) and \(H_{tt}(t;f)\) to be the first and second derivatives, respectively, of the histogram H(t : f), we define the interval \(I_i=[t_i,t_{i+1}]\) along the time axis t using a pair of successive points for extremals \(H_{tt}(t;f)=0\). Using the \(l_1\) linear approximation of H(tf) such that

$$\begin{aligned} \bar{H}(t;f)=a_it+b_i, \end{aligned}$$
(9)

which minimises the criterion

$$\begin{aligned} J(a_i,b_i)=\sum _{i=1}^n\sum _{j=1}^{n(i)}|H(t_{i(j)};f)-(a_it_{i(j)}+b_i)|, \end{aligned}$$
(10)

where \(t_{i(j)}\in I_i\), we allocate signs for spatial motion.

From the sign of \(a_i\), we define the symbols of motion of \(f(\varvec{x},t)\) in the interval \(I_i=[t_i,t_{i+1}]\) as \(\{\nearrow , \rightarrow , \searrow \}\), where

$$\begin{aligned} S(H(t;f))=\left\{ \begin{array}{ll} \nearrow &{} \text{ if } a_i>0 \text{ if } t\in I_i,\\ \rightarrow &{} \text{ if } a_i=0 \text{ if } t\in I_i,\\ \searrow &{} \text{ if } a_i<0 \text{ if } t\in I_i, \end{array} \right. \, \, \, \, S(\log H(t;f))=\left\{ \begin{array}{ll} \nearrow &{} \text{ if } a_i>0 \text{ if } t\in I_i,\\ \rightarrow &{} \text{ if } a_i=0 \text{ if } t\in I_i,\\ \searrow &{} \text{ if } a_i<0 \text{ if } t\in I_i. \end{array} \right. \end{aligned}$$
(11)

4 Numerical Examples

Table 1. Statuses of three sequences from KITTI sceneflowDataset2015[44].
Fig. 2.
figure 2

Examples of the Wasserstein distances

Table 2. Extracted events

For numerical experiments, three image sequences from left images of KITTI-Scene Flow Dataset2015 [44] are selected.

For event extraction using Eq. (11), we employ S(H(tf)) and \(S(\log H(t;f))\), since \(S(\log H(t;f))\) allows us to detect symbols from small perturbations of H(tf).

Table 1 lists statuses of the images. Figure 2 shows the temporal trajectories of the transportation of the vector fields. Table 2 shows the event strings extracted by linear approximation by using Eq. (10) and symbolisation by using Eq. (11). These experiments show that the algorithm extracts symbol strings, which describe the states in front of driving cars in various environments.

5 Dictionary Generation

Tables 4 and 3 show status and speed of objects in synthetic video sequences. Figure 3 shows top views of geometric configurations of objects in synthetic video sequences. Table 5 shows combinations of events as symbol strings captured by vehicle mounted camera in a synthetic world. In Figs. 4, 5 and 6, (a) and (b) show a frame view of the image sequence and its optical flow field, respectively. In Fig. Figures 4, 5 and 6, (c), \(\log H(t;f)\) and \(\bar{H}(t;f)\) are the blue curve and red polygonal curve, respectively.

Tables 6 and 7 show the strings \(S(\log H(t;f)) \) and S(H(tf)) detected by the algorithm using \(\log H(t;f) \) and H(tf), respectively. Since both \(\log H(t;f) \) and H(tf) are approximated by polygonal curves for the extraction of symbol strings, events are described by using \(\vee \), \(\wedge \) and M based on the semi-local shapes of the curves.

Table 3. Speed of objects in synthetic image sequences.
Table 4. Geometry in synthetic image sequences.
Fig. 3.
figure 3

Top views of geometric configurations of objects in synthetic video sequences simulating city environments. The blue car is the ego-vehicle for experiments. The green car is the object-vehicle for experiments. The lane width is 2.8 m. The pavement width is 1 m–1.5 m.

Fig. 4.
figure 4

Motion of synthetic image No. 7. (a) A frame view of the image seqence. (b) Optical flow field. (c) \(\log H(t;f)\) and \(\log \bar{H}(t;f)\) are drawn in the blue curve and red polygonal curve, respectively. (Color figure online)

Fig. 5.
figure 5

Motion of synthetic image No. 8. (a) A frame view of the image sequence. (b) Optical flow field. (c) \(\log H(t;f)\) and \(\log \bar{H}(t;f)\) are drawn in the blue curve and red polygonal curve, respectively. (Color figure online)

Fig. 6.
figure 6

Motion of synthetic image No. 16. (a) A frame view of the image sequence. (b) Optical flow field. (c) \(\log H(t;f)\) and \(\log \bar{H}(t;f)\) are drawn in the blue curve and red polygonal curve, respectively. (Color figure online)

Table 5. Events in synthetic data
Table 6. Symbol strings extracted from \(\log H(t;f)\)
Table 7. Symbol strings extracted from H(tf)

Five pairs 1 and 2, 5 and 7, 6 and 8, 13 and 14, and 15 and 16 provide same environments with and without oncoming vehicles. These examples show that pairs 1 and 2, 5 and 7, 6 and 8, 13 and 14, possess same properties for symbol string. Pairs 1 and 2, and 13 and 14 imply that the temporal transportation of optical flow vector fields achieves recognition of oncoming vehicles. The algorithm detects acceleration and deceleration of the ego-vehicle.

The results observed in a pair 7 and 8 show that for the detection of the directions of turning additional information is required, since the optical flow fields for left and right turning possess the same statistical properties.

The difference of the results observed in a pair 15 and 16 depends on the background properties caused by trees, since the correspondences between a pair of natural scene contains ambiguities. Moreover, for the detection of oncoming vehicle, the pointwise optical flow vectors are required.

The algorithm does not distinguish left and right turns, since the time trajectory of the distance between to field possess the shape profiles. However, it is possible to detect the stating frame of the turns, since the symbol \(\wedge \) is detected on the frame. For detection of the straight motion from real sequences, symbol strings both S(H(tf)) and \(S(\log H(t;f))\) are necessary, since in real sequences of the straight motion temporal local-perturbation of the optical flow vectors are detected. This local-perturbation derives perturbations on H(tf) and \(\log H(t;f)\).

6 Discussions

For the function \(f(\varvec{x},t)\) defined in \(\mathbf{R}^n\), the total derivative with respect to the variable t is

$$\begin{aligned} \frac{df}{dt}=\nabla f^\top \frac{d\varvec{x}}{dt} +f_t. \end{aligned}$$
(12)

Mathematically, optical flow is the solution of the linear equation \(\frac{df}{dt}=0\). This inconsistent linear equation is solved by regularisation [3, 23] and using local geometric constraints [1, 33].

In the medical volumetric-image analysis, for instance, the motion analysis of the moving organs, we are required to deal with volumetric images defined in three-dimensional Euclidean space \(\mathbf{R}^3\). In computer vision, optical flow is usually computed from planar images.

For motion analysis with range data, setting f(xyt) to be a grey-label image, we deal with the following system of equations

$$\begin{aligned} \frac{df}{dt}= & {} f_x \frac{dx}{dt}+f_y \frac{dy}{dt}+f_t=0, \nonumber \\ \frac{dg}{dt}= & {} h_x \frac{dx}{dt}+h_y \frac{dy}{dt}-\frac{dz}{dt}+h_t=0, \end{aligned}$$
(13)

where \(g(x,y)=h(x,y,t)-z\) for depth z of the temporal range image h(xyt) [24].

For colour and multi-channel images, the system of equations

$$\begin{aligned} \frac{df^\alpha }{dt} = f_x^\alpha \frac{dx}{dt} +f_y^\alpha \frac{dy}{dt} +f_t^\alpha =0, \, \, \, \, \alpha =1,2,\cdots ,k \end{aligned}$$
(14)

is derived from the k-channel images [25, 26].

For the left image \(f(x_l,y_l,t)\) and the right image \(g(x_r,y_r,t)\) of temporal stereo-pair images, the system of equations

$$\begin{aligned} \frac{df}{dt}= & {} f_x u_l+f_y v_l+f_t=0\end{aligned}$$
(15)
$$\begin{aligned} \frac{dg}{dt}= & {} g_x u_r+g_y v_r+g_t=0 \end{aligned}$$
(16)

derive the optical flow vectors \(\varvec{u}_l=(u_l,v_l)^\top \) and \(\varvec{u}_r=(u_r,v_r)^\top \) on the left and right images, respectively. After establishing correspondences between \(\varvec{x}_l\) and \(\varvec{x}_r\) and between \(\varvec{x}_l+\varvec{u}_l\) and \(\varvec{x}_r+\varvec{u}_r\), the stereo reconstruction algorithm computes scene flow \(\dot{\varvec{X}}\) in the space using disparities between temporal stereo-pair images. Estimation of correspondences is established by solving system of equations

$$\begin{aligned} f(x+d,y,t)=g(x,y,t), \, \, f(x+u_l+d'_1,y+v_l+d'_2,t)=g(x+u_r,y+v_r,t) \end{aligned}$$
(17)

for the displacement \(\varvec{d}=(d,0)^\top \) and \(\varvec{d}'=(d'_1,d'_2)^\top \).

For images on a manifold \(\mathcal {M}\), the optical flow vector filed is the solution of the equation

$$\begin{aligned} \frac{df}{dt} =\nabla _{\mathcal {M}} f^\top \frac{d\varvec{\nu }}{dt}+f_t=0 \end{aligned}$$
(18)

where \(\nabla _{\mathcal {M}}\) is the gradient operation on the manifold. For example, if \(\mathcal {M}\) is the unit sphere \(S^2\) in three-dimensional Euclidean space \(\mathbf{R}^3\), the gradient operation is

$$\begin{aligned} \nabla _{\mathcal {M}}f = \left( \frac{\partial }{\partial \theta } f, \frac{1}{\cos \theta } \frac{\partial }{\partial \phi }f \right) ^\top . \end{aligned}$$
(19)

Equation (18) allows us to compute the optical flow vectors on a spherical retina, which is the mathematical model of compound eyes.

In this paper, we have shown a method to extract intermediate queues for motion perception from optical flow on flow fields on the plane [34,35,36, 41]. It is possible to apply the event extraction method based on the transportation of optical flow fields for scene flow [40] and the optical flow field on non-planar retina [38]. In reference [38], we have shown a method to extract intermediate queue for motion perception from optical flow fields on a sphere.

Moreover, we have developed a method to decompose the optical flow fields [27, 28] on the surface of the moving organs [42] employing three-dimensional optical flow computation.

The optical flow fields between a pair of successive images in a sequence provide queues for image alignment. Aligning images along the time axis achieves the tracking of images in a video sequence [2]. Therefore, tracking is a sequential alignment. Multiple alignment in a space by deformation fields derives the deformation-based average of images.

For a collection of images \(\{f_i(\varvec{x})\}_{i=1}^m\), setting \(\varvec{u}_i(\varvec{x})\) to be the deformation fields, the minimiser f of the energy functional

$$\begin{aligned} J=\sum _{i=1}^m \int _{\mathbf{R}^n} (f(\varvec{x}+\varvec{u}_i(\varvec{x}))-f_i(\varvec{x}))^2d\varvec{x} \end{aligned}$$
(20)

with appropriate constraints derives the deformation-based average of the collection of images \(\{f_i(\varvec{x})\}_{i=1}^m\) [29, 43]. The deformation-based average was applied for motion analysis of a volumetric beating-heart sequence.

The directional gradient of an image \(f(\varvec{x})\) at the point \(\varvec{x} = (x,y)^{\top }\) in the direction of \(\varvec{\omega } = (\cos \theta , \sin \theta )^{\top }\) is computed as \(\varvec{\omega }^\top \nabla f\). The directional gradient evaluates the steepness, smoothness and flatness of \(f(\varvec{x})\) along the direction of vector \(\varvec{\omega }\). Setting F to be a injective mapping for gradient, the gradient-based feature constructed by F satisfies the relations \(F(\nabla f)=0\) and \(F(\nabla f)=F(\nabla g)\) if \(f=0\) and \(f=g+a\) for constant a, respectively.

The census transform is computed by

$$\begin{aligned} s(\varvec{x})=\frac{1}{2\pi }\int _{0}^{2\pi }u(\varvec{\omega }^\top \nabla f)d\theta \end{aligned}$$
(21)

where u is the Heaviside function. The directional histogram (DH) is computed by

$$\begin{aligned} h_{\varvec{x}}(\theta ) = \frac{G_f(\theta ,\varvec{x})}{ \int _0^{2\pi }G_f(\theta ,\varvec{x})d\theta }, \, \, \, G_f(\theta ,\varvec{x})=\int _{\varOmega (\varvec{x})} \varvec{\omega }^{\top }\nabla f(\varvec{y})d\varvec{y}, \end{aligned}$$
(22)

such that \(h_{\varvec{x}}(\theta +2\pi ) = h_{\varvec{x}}(\theta )\), where \(\varvec{x}\in \mathbf{R}^2\) is the centre of the region \(\varOmega (\varvec{x})\). The vector \(\varvec{x}\) is used as the index of the DH. We call \(h_{\varvec{x}}(\theta )\) the HoG signature of f.

The census transform encodes local geometric property of the gradient vector fields as scalar function. The HoG signature encodes semi-global geometric properties of the gradient vector field as a scalar function. These encoded features are used for matching of images and motion detection [30]. Our transform in Eq. (1) encodes the global geometric properties of motions on the retina as a scalar function using optical flow vector fields. Then, using this encoded motion vector field, we define a metric between a pair of motion fields for the extraction of events on video streams.

Since \(\varvec{v}=\frac{-f_t}{|\nabla f|^2}\nabla f\) is a solution of \(\frac{df}{dt}=0\), the optical flow vector is expressed as \(\varvec{u}=\frac{-f_t}{|\nabla f|^2}\nabla f+\alpha \nabla f^\perp \) for an appropriate scalar \(\alpha \), where \(\nabla f^\top (\nabla f^\perp )=0\). If the motion perpendicular to the gradient of the edges of the segments is small, that is, \(\alpha \) is small, \(\varvec{u}\sim \mu \nabla f\) for an appropriate real number \(\mu \). This relation between the optical flow filed and the gradient field implies that events in the image stream detected by the features encoded by Eq. (1) are those caused by the temporal fluctuations of the gradient of the foreground.

In ref. [32] the on-line algorithm for detection of a polygonal curve from a time signal of a string of conversation dialogs. was proposed based on the randomized Hough transform. This algorithm is pre-processing for the construction of the syntactic trees of conversation dialogs. The event detection from video sequence is an extension of syntactic analysis of dialog signals to image sequences.

In pedestrian detection, annotated data for designing classifier is generated using artificially generated virtual world [31]. It is possible to extend the idea for event detection from image observed by vehicle mounted camera system. We generated symbol sequences from events in virtual world. Events detected from generated symbol strings coincide with the events detected from real world test data sequences.

7 Conclusions

We proposed a method for the symbolisation of the temporal transition of environments using statistical analysis of the flow field. The algorithm allows us to interpret a sequence of images as a string of events.

A machine can control a car to avoid incidents by detecting abnormalities using event strings stored in a dictionary. The symbolisation of temporal optical fields is suitable for the generation of entries in such a dictionary.

We have introduced a framework for syntactical interpretation of dynamic scenes using the temporal transportation of the optical flow fields. The future work for us is to derive semantics of the motion fields from strings of symbols. Multiscale image analysis of the dynamic scenes provides hierarchies of the motions [32] in the scenes from temporal local deformation to global fluctuations. Therefore, these hierarchies of motions would define the syntactic structure and semantic meaning of dynamic scene. The optical flow fields are the important queries for linguistic analysis of the dynamic scene.