1 Introduction

Autonomous mobile robots are expected to operate in dynamic environments with other agents such as humans, robots, pets etc. This is especially true for service robots that work in a home environment, such as in RoboCup@Home, and for outdoor robots that need to navigate through traffic and pedestrians. Moving objects in such environments can add to the complexity of certain tasks such as navigation, scene understanding and action monitoring; however, motion can also be an important cue for interpreting the environment and understanding the effects of actions.

While motion can be detected by infrared sensors, radars etc., vision-based motion detection is a practical and cheap solution for robots, and is a well-studied field, especially in the context of video surveillance systems. For a stationary camera, the methods used in video surveillance systems can be directly applied and used for detecting motion. However, robots often need to detect motion while they are in motion (egomotion); in such cases methods that rely on a static camera cannot be used directly because from the point of view of the robot, the entire image of the scene appears to be moving as the robot is moving.

Consider the two images in Fig. 1, which includes forward motion of the camera and a falling toy. As humans, our attention is naturally drawn to the toy even when we are moving, since it provides us with a visually salient stimulus [1, p. 41]. However, from a 2D image perspective, the entire scene has changed; the red arrows indicate the direction of apparent motion in the image due to camera motion and the green arrow indicates the direction of motion of the toy. The apparent motion of the scene follows a pattern, while the motion of the toy can be seen as an anomaly in this pattern (i.e. it is contrary to the expected egomotion-induced optical flow), and hence an indicator for the robot that some action is required. Detecting this motion is a challenging task for a robot as it involves being able to distinguish between global changes in the scene caused to its own motion and local changes due to the movement of other objects.

Fig. 1.
figure 1

Camera motion and independent motion (yellow circle) between frames. (Color figure online)

This paper tackles the problem of using 2D vision methods for detecting independent motions in the environment when the observer (a robot) is also moving through the environment. We approach this problem by combining existing methods: the Fourier-Mellin transform (FMT) [2] for compensating camera motion and temporal differencing for motion detection. The algorithm is able to run close to real-time on a robot.

The paper is structured as follows: Sect. 2 discusses related work, Sect. 3 explains the proposed approach, Sect. 4 discusses the results, along with a comparison to a feature-based method, and finally conclusions and future work are discussed in Sect. 5.

2 Related Work

The majority of the literature regarding motion detection considers scenes from a static camera, with some approaches allowing for slight camera motions; however, there has been an increase in vision-related work with camera motion due to the importance of autonomous driving and driver assistance systems [3].

For static cameras, methods such as background subtraction, temporal differencing and optical flow are well known for motion detection. However, none of these methods, used in isolation, are applicable for moving cameras since the majority of the changes in the scene would be caused by camera motion. Some methods [4,5,6] compensate for camera motion before applying differencing or optical flow to detect motion. Camera motion is typically compensated by first calculating optical flow or tracking features between frames, followed by selection of inlier vectors that best describe the dominant motion. The transformation estimated using the inlier vectors is then used to align the frames, minimizing the differences between the two frames due to camera motion.

Badino et al. [6] estimate 6D egomotion using optical flow calculated from stereo cameras. The computed egomotion is used to compensate the flow vectors such that only moving objects have large vectors. Kim et al. [7] use the Lukas-Kanade tracker both for camera motion compensation and for continuously updating a background model, and subsequently use background subtraction for motion detection. Meier et al. [8] use inertial-optical flow for egomotion and object motion estimation. Tracked features are classified as outliers and inliers based on whether they agree with the global motion seen in the image. The authors use the outliers specifically to detect and characterize the motion of independently moving objects. Kumar et al. [9] use machine learning to predict optical flow statistics (mean and covariance) given the head and eye velocities of the iCub robot. Moving objects are detected by comparing their flow vectors to the learned statistics.

Convolutional Neural Networks (CNNs) have also been used for compensation egomotion: Tokmakov et al. [10] train a network that uses ground truth optical flow to detect independently moving objects from a moving camera. Rezazadegan et al. [11] perform action recognition on moving cameras by first centering the person of interest. Agrawal et al. [12] compute the transformation between sequential images due to camera motion by using a Top-CNN which receives input from two identical Base-CNNs (one for each image).

A common theme in the related work is the use of optical flow or feature tracking methods, which rely on detecting good features. Apart from the work by Kumar et al. [9], there are no published results of extensive evaluation in different scenes or with varying parameters. Additionally, the hardware, frame rate, image size are often not reported, making it difficult to determine whether the methods can be run on a mobile robot.

Adjacent fields, such as visual odometry and structure-from-motion, essentially try to solve a similar problem: that of estimating camera motion. Most approaches in these fields also rely on detecting and tracking good features, but other methods have also been explored. For example, FMT has been used for visual odometry [13, 14] and found to be equally good or better than optical flow methods. The advantage of Fourier-based methods is that they are not dependent on finding good features and are robust to some lighting changes and frequency-dependent noise. The FMT approach calculates the similarity transform, which is a simplification of the affine and perspective transforms, which also makes it a better candidate for running on a robot with limited resources. We assume, as in other related work, that moving objects constitute less than half the image; i.e. the apparent motion of the scene due to camera motion is the dominant motion.

3 Approach

Since it has already been successfully used in visual odometry, we decided to use FMT to compensate camera motion. Additionally, in order to avoid calculation of features and optical flow, we then use temporal differencing for detecting independent motions. Our approach is based on a vision pipeline that has two stages: (1) FMT is used to compute the transform between consecutive frames, which is then used to align them, hence compensating camera motion. (2) Temporal differencing is performed on the aligned frames to detect moving objects.

I. Image registration using FMT: Image registration is the process of transforming an image to geometrically align it with a reference image. FMT was first used for image registration by Reddy et al. [2]. It is an extension of the phase correlation method and can simultaneously retrieve the rotation, scale and translation between images; i.e. it estimates the similarity transform for a given pair of images. The steps of the method can be seen in Fig. 2.

Fig. 2.
figure 2

Top: FMT image registration pipeline. Bottom: Phase correlation block which is used twice in the registration pipeline.

FMT-based image registration itself is a two-step process; the first step estimates the rotation and scale and the second estimates the translation. In both steps, phase correlation is used to find the translation between two images. In the first step, the input images to the phase correlation step are the log-polar transforms of the discrete Fourier transforms of the grayscale images (\(im0\) and \(im1\)) to be registered. The estimated shift in the log-polar images (\(x,y\)) is converted into rotation \(\theta \) and scale \(s\) and used to transform \(im0\).

This transformed image \(im2\) and the reference image \(im1\) are the inputs to the second phase correlation step. The estimated translation from the second phase correlation step is used to transform \(im2\) again. This resultant image \(im3\) is now registered with the reference image \(im1\).

Fourier-Mellin Transform: The FMT of a function \(f(r, \theta )\) is given by [13]:

$$\begin{aligned} M_f(\varvec{u}, v) = \frac{1}{2\pi }\int _{0}^{\infty } \int _{0}^{2\pi } f(r, \theta )\varvec{r^{-ju}}e^{-jv\theta }d\theta \varvec{\frac{dr}{r}} \end{aligned}$$
(1)

where the elements in bold are the Mellin transform parameters and the remaining ones are the Fourier transform parameters.

By substituting \(r = e^{\rho }\), the FMT can be expressed as just a Fourier transformation [13]:

$$\begin{aligned} M_f(u, v) = \frac{1}{2\pi }\int _{-\infty }^{\infty } \int _{0}^{2\pi } f(e^\rho , \theta )e^{-ju\rho }e^{-jv\theta }d\theta d\rho \end{aligned}$$
(2)

Log-Polar Transform: In practice, the variable substitution is realized using the log-polar transform. The log-polar transform is performed by remapping points from the 2D Cartesian coordinate system \((x,y)\) to the 2D log-polar coordinate system \((\rho ,\theta )\) [15]:

$$\begin{aligned} \begin{aligned}&\rho = \log (\sqrt{(x - x_c)^2 + (y - y_c)^2}) \\&\theta = {{\mathrm{atan2}}}(y - y_c, x - x_c) \end{aligned} \end{aligned}$$
(3)

where \(\rho \) is the logarithm of the distance of a given point, \((x,y)\), from the centre, \((x_c,y_c)\), and \(\theta \) is the angle of the line through the point and the centre. This transform converts rotation and scaling in the Cartesian coordinate system to translations in the log-polar coordinate system.

Phase Correlation: Phase correlation, introduced by Kuglin et al. [16], is a global method that can retrieve the translation between two images. The method is based on the Fourier shift theorem, which states that shifting a signal by \(\tau \) in the time/space domain multiples the Fourier transform by \(e^{-j\omega \tau }\). The phase difference can be calculated by using the normalized Cross Power Spectrum (CPS) in Eq. 4 [2],

$$\begin{aligned} \begin{aligned} f_2(x,y) = f_1(x - t_x, y - t_y) \\ F_2(\xi , \eta ) = e^{-j2\pi (\xi t_x + \eta t_y)}F_1(\xi , \eta ) \\ CPS = e^{-j2\pi (\xi t_x + \eta t_y)} = \frac{F_1(\xi , \eta )F_2^*(\xi , \eta )}{|F_1(\xi , \eta )F_2(\xi , \eta )|} \end{aligned} \end{aligned}$$
(4)

where \(F_1\) and \(F_2\) are the Fourier transforms of \(f_1\) and \(f_2\), and \(\xi \) and \(\eta \) are the spatial frequencies in \(x\) and \(y\).

The term \(e^{-j2\pi (\xi t_x + \eta t_y)}\) is equivalent to the Fourier transform of a shifted Dirac delta function; hence, if we take the inverse Fourier transform of CPS, the result is a signal with a peak at \((t_x, t_y)\). By finding the location of the peak we retrieve the translation between the two images.

Rotation Ambiguity: Since the Fourier magnitude plots are conjugate symmetric, there is an ambiguity in the recovered rotation. If the calculated rotation angle is \(\theta \), the actual rotation of the image could be \(\theta \) or \(\theta + \pi \). For this application, we assume that consecutive frames are never rotated by more than \(\pi \) radians and hence do not perform an extra step to resolve the ambiguity.

High-pass Filtering: During the phase-correlation step, apart from the peak at the actual rotation angle, there are additional peaks at multiples of 90\(^{\circ }\). Sometimes the peak at 0\(^{\circ }\) is higher than the peak at the required rotation angle. Both [17] and [2] suggest applying a high-pass filter in the Fourier domain to prevent the false peak at 0\(^{\circ }\).

II. Motion Detection Using Temporal Differencing: We use temporal differencing for motion detection, and additional operations, such as thresholding, edge masking and clustering of contours, are used to eventually output a set of bounding boxes representing independently moving objects in the scene (see Fig. 3). Temporal differencing is performed by taking the absolute difference of the pixel intensities between the registered frame \(im3\) and the current frame \(im1\). A binary threshold is applied on the difference image with intensities above the threshold being set to 255 (white) and those below being set to 0 (black), hence selecting pixels where a large difference is seen.

Fig. 3.
figure 3

Motion detection pipeline

Edge Mask: Edges are regions in the image where there are discontinuities in the pixel intensities. If the images are imprecisely registered, the edges might not overlap exactly with each other. This will result in large values in the difference image and will likely be present in the thresholded image (such as the stripes of the sofa in Fig. 4b). In order to remove the edges (false detections), we construct an edge mask from the thresholded image by first applying Canny edge detection to the thresholded image and fitting contours to them. Oriented rectangles are then fit to the contours and are classified as edges if the aspect ratio is very high or very low. The rectangles which are classified as edges are then masked out as seen in Fig. 4c. It is also worth noting that this process causes some degradation in the detection of the moving stuffed toy as well; this is discussed further in the evaluation.

Fig. 4.
figure 4

Edge mask on a scene with egomotion and independent motion

Clustering: In order to separate the resultant threshold image into a set of regions, we cluster the white pixels based on their distance from each other. This is done by first finding contours, and then applying Euclidean clustering on the contour points.

Small clusters are discarded and bounding boxes are fit to each cluster. Some bounding boxes are ignored if the ratio of white to black pixels is low. The intermediate outputs from the motion detection pipeline can be seen in Fig. 5.

Fig. 5.
figure 5

The main steps of the motion detection pipeline: Top: previous, current and registered frame. Bottom: thresholded image, edge-masked image, bounding box of clustered contour points

Runtime: An open-source Python implementation of the FMT image registration methodFootnote 1 was found to be too slow for our application (1.6 Hz). Our C++ portFootnote 2 is able to process 320\(\,\times \,\)240 images at 14.5 Hz on an Intel Core i3, 1.7 GHz processor, while the overall motion detection pipeline runs at 11 Hz.

4 Evaluation

For evaluation, we collected fifteen image sequences and annotated them with ground truth (GT) bounding boxes of moving objects. A Care-O-bot-3 with a head-mounted ASUS Xtion Pro 3D camera was used for recording the sequences. All sequences involve robot egomotion (linear: 0–0.3 m/s, angular: 0–0.3 rad/s) and the set of moving objects in the scene includes humans, doors and a stuffed toy. The experiments are run on the recorded sequences, but the algorithm is able to run on the robot as well with a ROS (Robot Operating System) wrapper.

In [18], the authors discuss evaluation methods and metrics for motion detection algorithms, suggesting the following for object-based metrics:

  • True Positive (TP): “A detected foreground blob which overlaps a GT bounding box, where the area of overlap is greater than a proportion \(\varOmega _b\) of the blob area and greater than a proportion \(\varOmega _g\) of the GT box area."

  • False Negative (FN): “A GT bounding box not overlapped by any detected object."

  • False Positive (FP): “A detected foreground object which does not overlap a GT bounding box."

We use these definitions and choose \(\varOmega _b = \varOmega _g = 0.5\). We regard a single GT object that is overlapped by multiple detected bounding boxes as a true positive (given that the total overlap proportion criteria holds). Since the definitions have been modified, \(TP + FN\) does not equal the number of GT objects (as is usually the case). A TP in this case is an object that has been reliably detected, whereas a FN is an object that has not been detected at all. This gives us two ways of interpreting the results: one which considers objects reliably detected (true positive rate, TPR), and one which considers objects not detected at all (false negative rate, FNR). These two metrics and the false discovery rate, FDR, are defined as:

$$\begin{aligned} TPR = \frac{N_{TP}}{N_{gt}}, FNR = \frac{N_{FN}}{N_{gt}}, FDR = \frac{N_{FP}}{N_{d}} \end{aligned}$$
(5)

where \(N_{TP}\), \(N_{FN}\), \(N_{FP}\), \(N_{gt}\) and \(N_{d}\) are the number of true positives, false negatives, false positives, ground truth objects and detected objects.

4.1 Experiments

Six sequences are chosen for the first two experiments since they cover both translational and rotational robot motion, and different types of object motions.

Experiment 1: Frame Rate. For this experiment, the frame rate of the camera is kept constant at 30 Hz, but, by skipping frames, the effective frame rate of the algorithm is altered. This results in larger differences between frames (due to motion) at slower frame rates without additional effects like motion blur. As seen in Fig. 6, the frame rate has a large impact on the performance, most significantly when halving the frame rate to 15 Hz. Although in most cases 6 Hz results in the best TPR, the FDR also rises significantly at this frame rate. The object-level annotation is also a cause of the poor performance at higher frame rates: at 30 Hz, the small motions of parts of objects are detected instead of the motion of the entire object, which is observable at lower frame rates. Overall, this result suggests that the frame rate of the algorithm should be dynamically altered based on the speed of the robot and expected speed of the objects.

Fig. 6.
figure 6

Impact of altering time between consecutive frames

Experiment 2: Edge Mask. In general, applying the edge mask to the thresholded image reduces both the TPR and FDR, as seen in Fig. 7. Depending on the application, reducing the false detection rate to nearly zero at the expense of a decrease in true positive rate can be an acceptable compromise.

Fig. 7.
figure 7

Impact of applying an edge mask on the difference image

Comparison with a Feature-Based Approach. Here, we compare the results of the FMT method to a feature-based method. The monocular camera code from LIBVISO2 [19] is used for feature extraction and matching and the inlier vectors are used to estimate an affine transformation between consecutive frames. This affine transform is used to register the frames, after which the temporal differencing pipeline is used for motion detection. For both methods, we use the same parameters for temporal differencing, use frames of size 320\(\,\times \,\)240, and skip every two frames (10 Hz effective frame rate). The experiments are run on a PC with an Intel Core i3, 1.7 GHz processor and 4 GB of RAM. The results of the comparison are seen in Table 1Footnote 3.

Table 1. Detection rates for the FMT and feature-based methods

The TPR and FNR are comparable for the two methods; the sequences where the motions are far away or short and quick tend to have low TPR and high FNR in both cases. The reason for the higher FDR for the feature-based method is evident from Fig. 8, which shows the \(x\) and \(y\) translations for sequence 10. The sequence consists of the robot rotating to the left at a constant speed (0.3 rad/s), with some people moving in the scene. The FMT method shows a relatively constant translation in \(x\) and no translation in \(y\), whereas the feature-based method is a lot noisier, with some large spikes; these spikes most likely account for the increased FDR. Similar behaviour was seen in the other sequences as well. The FMT method is able to run at about 11 Hz, while the feature-based method runs at 12 Hz, with comparable CPU and memory usage. This suggests that the FMT method is more robust while using the same amount of resources.

Fig. 8.
figure 8

Pixel translations for sequence 10 along the x-axis (left) and y-axis (right) for the FMT and feature-based methods.

5 Conclusions and Future Work

In this paper, we combined existing approaches for independent motion detection from a moving camera. The Fourier-Mellin-based image registration method was used for egomotion compensation and temporal differencing for motion detection. Unlike other methods, FMT does not rely on the detection of good features, which is one of its advantages. For the set of sequences evaluated, a frame rate of 10–15 Hz was found to be ideal for the detection rate; at 30 Hz, motions between frames are sometimes too small to be detected. The algorithm processes frames at 11 Hz, which allows it to be run close to real-time on a robot within the range of the ideal frame rate. In comparison to a feature-based method, it performs better in terms of robustness of the registration and the false discovery rate. A more systematic evaluation is required to determine the limits on object and camera motion speed, depth variance of the scene and depth of the moving objects. Dynamically varying the frame rate based on the speed of the robot and applying the motion detection output to a task of the robot, such as safe navigation or turning towards a waving person, is future work.