Keywords

1 Introduction

Nowadays, robots and small-scale UAVs are used in several fields, such as SAR [1, 2], environment monitoring and inspection [3,4,5,6] and precision agriculture [7, 8] due to their low cost and easiness of deployment. These devices are equipped with different sensors, such as gyroscope, accelerometer, compass, and GPS, thus allowing to know the state of the UAV (e.g., breakdown occurrences, travel speed, etc.) with a very high precision. During the execution of a task, it may happen that the UAV loses the connection with the satellites, thus making it impossible to retrieve data such as the flight height, or to use the GPS coordinates to know the overflown route. In robotics and computer vision, the most common approaches to determine both position and orientation of the robot are Simultaneous Localization and Mapping (SLAM) [9] and Visual Odometry (VO) [10,11,12]. The simplest sensor that can be used in performing SLAM and VO is the RGB camera. There are two main categories of works using a RGB camera: methods using monoscopic cameras [13,14,15,16,17,18] or methods based on stereo camera [19,20,21]. The major difference between stereo and monoscopic cameras is that the former allows to feel the distance from objects within the scene, as if it had a third dimension. Human vision is stereoscopic by nature due to the binocular view and to our brain, which is capable of synthesizing an image with stereoscopic depth. It is important to notice that a stereoscopic camera must have at least two sensors to produce a stereoscopic image or video, while the monoscopic camera setup is typically composed by a single camera with a \(360^\circ \) mirror. Whenever the scene-to-stereo camera distance is much larger than the stereo baseline, stereo VO can be degraded to the monocular case, and stereo vision becomes ineffectual [22]. Authors in [11] present the first real-time large-scale VO with a monocular camera based on a feature tracking approach and random sample consensus (RANSAC) for outlier rejection. The new upcoming camera pose is computed through 3D to 2D camera-pose estimation. The work in [16] leverages a monocular VO algorithm for feature points tracking on the world ground plane surrounding the vehicle, rather than a traditional tracking approach applied on the perspective camera image coordinates. Two real-time methods for simultaneous localization and mapping with a freely-moving monocular camera, are proposed by the LSD-SLAM [17] and ORB-SLAM [18] algorithms. In [13], the FAST corners and optical flow are used to perform a motion estimation task and, subsequently, a mapping thread is executed through a depth filter formalized as a Bayesian estimation problem. Other works, such as [14, 15], propose a robust framework which makes direct use of the pixel intensity, without exploiting a feature extraction step.

In this paper, a feature-based SLAM algorithm for small-scale UAVs with a nadir view is proposed. In detail, a first calibration step is performed to know the ratio between pixels/meters and flight height. Then, during flights, keypoints extracted from the video stream are exploited to know if the flight height changed, while the center of mass of the frames are used for route estimation. Exhaustive experiments performed on the recently released UMCD dataset highlight the robustness and the reliability of the proposed approach. To the best of our knowledge, there are no works in literature that estimates both the trajectory and the flight height of a UAV. Hence, no comparisons with other SLAM algorithm are provided, and the obtained results are meant to be considered as a baseline for future works.

The remainder of the paper is structured as follows. In Sect. 2, the proposed method is described in detail. In Sect. 3, the performed experiments and the obtained results are discussed. Finally, Sect. 4 concludes the paper.

2 Proposed Method

The proposed method main idea is to exploit keypoint matching between two consecutive video frames, received from the UAV, in order to determine the flight height, while the frames center of mass is used to determine the overflown route. A necessary condition for the algorithm correctness, is that features must always be matched between frames, otherwise it is not possible to estimate the correct flight height.

2.1 System Calibration

In order to find the relation between the spatial resolution of the RGB sensor and the flight height, a calibration step is required. To perform this calibration, a marker of known dimensions (e.g., \(1 \times 1\) m) is placed on the ground and it is then acquired at a known height (e.g., 10 m), through the UAV sensors. During this process, the GPS sensor of the UAV is used to know the exact height. Markers have been chosen for this step due to their robustness and easiness of recognition within the observed environment [23, 24]. With this procedure, it is possible to compute the pixels/meters ratio needed to initialize the system. In Fig. 1, the marker detected during the calibration step is depicted.

Fig. 1.
figure 1

System calibration step example. By knowing both the UAV height and the marker size, it is possible to estimate the pixels/meters ratio which is a requirement for the algorithm.

In case information about the camera focal length f is not available, this calibration step also allows its estimation. Let us consider the height h of the UAV during the calibration step, the marker to have a real size of w meters and a pixel size at height h of p pixels. Then, f can be computed as follows:

$$\begin{aligned} f = \frac{(h \times p)}{w} \end{aligned}$$
(1)

By knowing how to compute the focal length, the height estimation step can be performed with any kind of sensor, as shown in the next Section.

2.2 Flight Height Estimation

For the flight height estimation, keypoints extracted from the video stream frames are used. The A-KAZE [25, 26] feature extractor is adopted due to its performance, allowing for a faster feature extraction with respect to SIFT, SURF, and ORB [27]. In more details, features are extracted and matched between two consecutive frames \(f_{t-1}\) and \(f_{t}\), creating two sets of keypoints \(K_{t-1}\), \(K_t\) and a set of matches \(\varTheta _t\). Subsequently, the affine transformation matrix is computed from these matches, and is used to determine the scale changes between keypoints. More thoroughly, if we have an incremental scale change it means that a zoom in operation is performed, so the UAV is lowering the flight height. On the contrary, if we have a decremental scale change it means that a zoom out operation is performed, so the UAV is increasing the flight height. In the flight height estimation, two goals are pursued:

  • To filter the identified matches and exclude keypoints belonging to the foreground component (i.e., dynamic elements within the scene) during the drone movement estimation, in order to avoid moving objects negatively influencing the height estimation;

  • To estimate all altitude variations.

The first goal is obtained using the set of matches \(\varTheta _t\), and the homography matrix \(H_t\) that maps the coordinates of a keypoint \(k \in K_{t-1}\) into the coordinates of a keypoint \(\hat{k} \in K_t\). The matrix \(H_t\) is computed applying the RANdom SAmple Consensus (RANSAC) algorithm [28] on the matches contained in \(\varTheta _t\). The re-projection error in \(H_t\) can be minimized through the use of the Levenberg-Marquardt optimization [29]. To find the keypoints belonging to the moving objects present in the scene, the following check is performed for each match \((k,\hat{k}) \in \varTheta _t\):

$$\begin{aligned} \gamma = \left\{ \begin{array}{ll} 1 &{} \text {if } \sqrt{(k- \hat{k})^2} - \sqrt{(k-(Hk))^2} \ge \rho \\ 0 &{} \text {otherwise}.\end{array} \right. \end{aligned}$$
(2)

where \(\rho \) is a tolerance applied on the difference between the estimated distance obtained by homography and the estimated distance obtained by \(\varTheta _t\). If \(\rho \) has a low value, then a large number of keypoints found in the background result as static and, consequently, many false positives can occur for the background keypoint estimation. Instead, if \(\rho \) has a high value, the estimation of the keypoint movements is less restrictive, but a large number of false negatives can occur. According to [30], the value of \(\rho \) has been fixed to 2.0. In more details, if \(\gamma =0\), then the keypoint \(\hat{k}\) is a background keypoint, otherwise \(\hat{k}\) is a foreground keypoint. Finally, all background keypoint matches are used to compose a new filtered set of matches called \(\hat{\varTheta }_t\).

In order to achieve the second goal, an affine transformation matrix A is computed. Given three pairs of matches \((k_a, k_b)\), \((k_c, k_d)\), and \((k_e, k_f)\) \(\in \hat{\varTheta }_t\) with \(k_a\), \(k_c\), \(k_e\) \(\in K_{t-1}\) and \(k_b\), \(k_d\), \(k_f\) \(\in K_{t}\), the A matrix can be calculated as follows:

$$\begin{aligned} A = \begin{bmatrix} \lambda _x &{} 0 &{} \tau _x \\ 0 &{} \lambda _y &{} \tau _y \\ 0 &{} 0 &{} 1 \end{bmatrix} = \begin{bmatrix} x_{k_b} &{} x_{k_d} &{} x_{k_f} \\ y_{k_b} &{} y_{k_d} &{} y_{k_f} \end{bmatrix} \begin{bmatrix} x_{k_a} &{} x_{k_c} &{} x_{k_e} \\ y_{k_a} &{} y_{k_c} &{} y_{k_e} \\ 1 &{} 1 &{} 1 \end{bmatrix}^{-1} \end{aligned}$$
(3)

The translations on the x and y axes are indicated by the \(\tau _x\) and \(\tau _y\), respectively. Drone altitude variations are estimated using the \(\lambda _x\) and \(\lambda _y\), representing the scale variation on the x and y axes. Once \(\lambda _x\), \(\lambda _y\) values are computed, we can multiply them by the original pixels/meters ratio to determine the UAV flight height variation. Notice that, altitude changes cause zoom-in (or zoom-out) operations in the frames acquired by the drone and, in those cases, we obtain \(\lambda _x = \lambda _y\). Also recall that in order to know the altitude variation, there must always be a match between two consecutive frames, so that it is possible to estimate the transformation matrix and the \(\lambda _x\), \(\lambda _y\) values. Otherwise, it is unfeasible to correctly estimate the variation.

Fig. 2.
figure 2

Example of route estimated with the proposed method. In 2(a), the mosaic of the overflown area is shown, while in 2(b) the route estimates through frames center of mass is depicted.

By using Eq. 1, it is possible to estimate the flight height \(h'\) through the triangle similarity:

$$\begin{aligned} w' = \lambda \times w \end{aligned}$$
(4)
$$\begin{aligned} h' = \frac{w' \times f}{p} \end{aligned}$$
(5)

where \(\lambda \) can be either \(\lambda _x\) or \(\lambda _y\).

2.3 Route Estimation

Concerning the UAV route estimation, centers of mass from the received video stream frames are used. In order to know where the new center of mass must be positioned with respect to the others, a reference coordinate system must be used. In the proposed method, we use the mosaic of the area overflown by the UAV as reference for the centers of mass. By following the steps shown in [31], a mosaic is built incrementally and in real-time in the following way:

  1. 1.

    Frame Correction: In this step, the radial and tangent distortions are removed (if needed) from the received frame. To perform this step, a matrix containing the calibration values of the camera is required, and it is computed by using well-known methods [32];

  2. 2.

    Feature Extraction and Matching: In this step, keypoints are extracted from the current video frame and the partial mosaic is built up to the previous algorithm iteration. Then, features are matched together and a similarity transformation matrix is generated;

  3. 3.

    Frame Transformation: The similarity transformation matrix generated at the previous step is used to scale, rotate, and translate the received frame in order to align it with the partial mosaic;

  4. 4.

    Stitching: The last step consists in merging together the frame and the partial mosaic seamlessly, using some well-known techniques such as the multiband blending [33].

For each new received frame, the coordinates of all the centers of mass are recomputed. This is due to the fact that when a new frame is added to the partial mosaic, space within the latter must be allocated for the new frame. This operation is performed by appropriately translating the partial mosaic, as well as the centers of mass of the frames composing it, in the new mosaic image. Notice that the centers of mass can be associated with the real GPS coordinates of the UAV frame acquisition. In this way, it is possible to map the estimated route to the real world. In Fig. 2, an example of mosaic and the corresponding estimated route is shown. To summarize, Algorithm 1 shows all the performed steps for both route and flight height estimation.

figure a
Fig. 3.
figure 3

Example of paths used for testing the proposed method. In 3(a), (b) and (c), the ground truth data is reported, while in 3(d), (e) and (f) the estimated data is shown. In the x and y axes the centers of mass coordinates are represented, while in the z axis the flight height is shown.

3 Experiments

In this section, the results obtained in the performed experiments are reported.

3.1 Dataset

In our experiments, the recently released UMCD dataset [34] is used. The latter provides 50 geo-referenced aerial videos that can be used for mosaicking and change detection tasks at very low altitudes. The authors provide, together with the videos and the GPS coordinates, a basic mosaicking algorithm that has been used in our experiments as ground truth. In addition to the dataset, we have acquired 12 new videos. The latter have been acquired by following the same protocol of the used dataset, in order to have homogeneous testing data. Moreover, the same drone used for building the UMCD dataset, i.e., the DJI Phantom 3, has been used. Since within the dataset there are no videos with a ground marker, the calibration step for those videos has been performed by using the change detection procedure. This is possible due to the fact that the authors also provide the real size of the objects, in conjunction with the videos. Finally, through the given GPS file, it is possible to know the UAV flight height when in proximity of an object, allowing to compute the pixels/meters ratio needed for the calibration.

3.2 Qualitative Results

For each test, a mosaic of the overflown area has been built to extract the center of mass of each frame, and to estimate the flight route. Since the proposed method relies on the mosaicking algorithm, whenever the mosaic generation failed only a partial route and flight height estimation was given as a result. In Fig. 3, some experimental results are shown. The Figs. 3(a), (b) and (c) show the ground truth for both flight height and route, while the Figs. 3(d), (e) and (f) present the results obtained with the proposed method. In detail, Figs. 3(a) and (d) depict the route of an area of our own acquisitions, while the other figures show two paths provided in the UMCD dataset. As presumed, the results obtained with the proposed algorithm reflect, approximately, the ground truth data. Despite the estimated route and the ground truth route being almost similar, we have more variations on the estimated flight height. This is due to the features matching problem being sensible to outliers, as well as features mismatches. While for Figs. 3(a), (d) and (b), (e) the estimated height and the ground truth height are similar, this is not true for Figs. 3(c) and (f). This is due to the fact that in this specific path the GPS sensor fails in acquiring data, highlighting the potentialities of the proposed algorithm.

Fig. 4.
figure 4

(a) Comparison between raw data (blue bars), and estimated data (orange bars), and (b) difference between raw and estimated data. (Color figure online)

3.3 Quantitative Results

In Fig. 4, the ground truth and estimated data, together with their difference, is reported. As shown in both Figs. 4(a) and (b), the results obtained with the proposed method are very close to the raw data obtained through the sensors. From Fig. 4(a) it is possible to notice that, in average, the estimated data is slightly overestimated with respect to the ground truth. An exception regards the third flight path, which corresponds to the example shown in Fig. 3(c). In this case, we have a higher distance since the UAV lost the GPS signal during the experiments.

Concerning the execution time, the proposed method strongly depends on the mosaicking algorithm since both keypoints and centers of mass are computed during the process. This means that using new generation hardware and optimizing the algorithm for multicore CPUs or GPUs allows to reach real-time performances.

4 Conclusion

In this paper, a feature-based SLAM algorithm for small-scale UAV with a nadir view is presented. The proposed method exploits a state-of-the-art mosaicking algorithm to estimate the UAV flight route, while image features in conjunction with an affine transformation are used to estimate the flight height. Experimental results are performed on our aerial acquisitions and on the recently released UMCD dataset, showing the effectiveness of the proposed approach.