1 Introduction

The prototypical problem in geometric computer vision is the so called Structure from Motion (SfM) problem [12, 24]; the objective of which is to recover the scene geometry and camera poses from a collection of images of a scene. The SfM problem has, in some form or other, been studied since the very earliest days of photography, and many fundamental aspects of SfM were well understood already by the end of the 19th century [23]. Solving SfM problems of meaningful size and with actual image data, however, has been made possible only through the computerisation efforts that were commenced in the late 1970s, and which have since led to increasingly automatic methods for SfM. Modern SfM systems, e.g.  Bundler [22] and other systems under the wider BigSFM bannerFootnote 1 [1, 8], have managed to produce impressive city-scale reconstructions from large unordered and unlabelled sets of images.

A major paradigm in SfM, which has proven hugely successful, is Bundle Adjustment (BA) [26], which treats SfM as a large optimisation problem. With a parameterisation describing the scene geometry and the cameras, BA employs numerical optimisation techniques to find parameter values which best explain the observed images. Here, ‘best’ is determined by evaluating a cost function which is often—but not always—chosen as the sum of squared geometric reprojection errors. The BA formulation of the SfM problem puts it in a unified framework which still has extensive model flexibility, e.g.  with regards to (a) assumptions on the camera calibration, (b) different cost functions, and (c) different parameterisations of the cameras and the scene geometry—including implicit and explicit constraints to enforce a particular motion model.

While camera based Simultaneous Localisation and Mapping (SLAM) and Visual Odometry (VO) can be thought of as special classes of SfM, the computational effort to approach SLAM and VO via BA has traditionally been inhibiting, and for this reason, BA has mostly been used in offline batch processing systems such as the BigSFM systems mentioned earlier. During the last two decades, however, SLAM and VO systems have started incorporating regular BA steps to improve the consistency of the reconstruction and the precision of the camera pose estimation. Performance improvements across the spectrum—the algorithms, their implementation, the hardware—are paving the way for application specific BA to make its entrance in the area of real-time systems.

Especially in the case of visual SLAM, there are a number of factors which can be exploited to alleviate the computational burden compared to a more generic SfM system. The images are acquired in an ordered sequence, and this can significantly speed up the search for correspondences by avoiding the expensive ‘all-vs-all’ matching. Additionally, a suitable motion model may often be incorporated in a SLAM system, which can be used e.g. (a) to further speed up the search for correspondences by predicting feature locations in subsequent images [5, 6], (b) to facilitate faster and more accurate local motion estimation via nonholonomic constraints [20, 21, 39] or other constraints which reduce the set of parameters [29, 33], or (c) to enforce globally a planar motion assumption on the camera motion [10, 18, 32].

In this paper, we present a BA approach to visual SLAM for the case of a stereo rig, where the cameras do not necessarily have an overlapping field of view, and where each of the two cameras move in parallel to a common ground plane. The present paper is an extension of the system described earlier in [30], to which a more extensive experimental evaluation has been added. In particular, we have investigated how initialisation using planar motion compatible homographies based on minimal [33] or non-minimal [29] polynomial solvers affect the final reconstruction.

2 Related Work

Planar Motion is a frequently occurring constrained camera motion, which arises naturally when cameras are attached to a ground vehicle operating on a planar ground surface. As mentioned in the introduction, deliberately enforcing planar motion can help to improve the quality of the reconstruction.

An early SfM approach to plane constrained visual navigation was proposed by Wiles and Brady [34, 35]. They suggested a hierarchical framework of camera parameterisations, and explored in detail the remaining structural ambiguity for each of these. The lasting contribution of this work lies chiefly in its classification and description of the different modes of motion. The least ambiguous level in the case of planar motion—which they called \(\alpha \)-structure—contains only an arbitrary global scaling ambiguity and an arbitrary planar Euclidean transformation parallel to the ground plane, and is precisely the level aimed at in the present paper.

If the optical axis of the camera is either orthogonal or parallel to the ground plane, the parameterisation can be much simplified compared to the general case described by Wiles and Brady. This situation can of course also be achieved if the camera tilt is known with sufficient precision to allow a transformation to, e.g. , an overhead view. An approach for this case by Ortín and Montiel parameterises the essential matrix explicitly in the motion parameters, and then estimates the parameters using either a linear three-point method or a non-linear two-point method [18]. Scaramuzza used essentially the same parameterisation of the essential matrix, but combined it with an additional nonholonomic constraint based on the assumption that the local motion is a circular motion [20, 21]. Because of this additional constraint, the local motion can be computed from only one point correspondence, and this allows for an exceptionally efficient outlier removal scheme based on histogram voting.

Since the essential matrix is a homogeneous entity, it does not capture the length of the translation, and the maintaining of a consistent global scale then requires some additional information. One possibility for this, explored by Chen and Liu, is to add a second camera [4]. This allows the length of the local translation to be computed in terms of the distance between the two cameras, and since this remains constant, it provides a way to prevent scale drift.

If the camera is oriented such that it views a reasonable part of the ground plane, an alternative to using the essential matrix is to instead use homographies for the local motion estimation. This has the advantage that the length of the translation between frames can be expressed in terms of the height above the ground plane, which thus defines the global scale. The homography based approach by Liang and Pears is based on an eigendecomposition of the homography matrix, and it is shown that the rotation about the vertical axis can be determined from the eigenvalues, regardless of the camera tilt [14]. Hajjdiab and Laganière parameterised the homography matrix under the assumption of only one tilt angle, and then transformed the images into a synthetic overhead view to compute the residual rigid body motion in the plane [10].

A more recent homography based method by Wadenbäck and Heyden, which also exploits a decoupling of the camera tilt and the camera motion, uses an alternating iterative estimation scheme to compute the two tilt angles and the three motion parameters [31, 32]. Zienkiewicz and Davison solved the same 5-DoF problem through a joint non-linear optimisation over all five parameters to achieve a dense matching of successive views, with the implementation running on a GPU to reach very high frame rates [39].

Valtonen Örnhag and Heyden extended the general 5-DoF situation to handle a binocular setup, where the two cameras are connected by a fixed (but unknown) rigid body motion in 3D, and where the fields of view do not necessarily overlap [27, 28].

Bundle Adjustment is used to optimise a set of structure and motion parameters, and is typically performed over several camera views. Triggs et al. give an excellent overview [26]. Since the number of parameters optimised over is in most cases very large, naïve implementations will not work, and care must be taken to exploit the problem structure (e.g.  the sparsity pattern of the Jacobian).

Generic software packages for bundle adjustment, which use sparsity of the Jacobian matrix together with Schur complementation to speed up the computations, include SBA (Sparse Bundle Adjustment) by Lourakis and Argyros, sSBA (Sparse Sparse Bundle Adjustment) by Konolige, and SSBA (Simple Sparse Bundle Adjustment) by Zach [13, 16, 37].

Additional performance gains may sometimes be obtained through parallelisation. GPU accelerated BA systems using parallelised versions of the Levenberg–Marquardt algorithm [11] and the conjugate gradients method [36] have been presented e.g. by Hänsch et al. and by Wu et al. . More recently, distributed approaches by e.g. Eriksson et al. and by Zhang et al. have employed splitting methods to make very large SfM problems tractable [7, 38].

The present paper extends the sparse bundle adjustment system for the binocular planar motion case by Valtonen Örnhag and Wadenbäck. The aim of our approach is to exploit the particular structure in the Jacobian which arises due to the planar motion assumption for the two cameras. We demonstrate how this particular situation can be attacked via the use of nested Schur complementations when solving the normal equations. In comparison to the earlier paper [30], we have significantly extended the experimental evaluation of the system. Additionally, we have investigated the effect of enforcing the planar motion assumption earlier on a local level, by using homographies estimated such that they are compatible with this assumption [29, 33].

3 Theory

3.1 Problem Geometry

The geometrical situation we consider in this paper is that of two cameras which have been rigidly mounted onto a mobile platform. Due to this setup, which is illustrated in Fig. 1, the cameras are connected by a rigid body motion which remains constant over time but which is initially not known. Each camera is assumed to be mounted in such a way that it can view a portion of the ground plane, but it is not a requirement that the cameras have any portion of their fields of view in common. The world coordinate system is chosen such that the ground plane is positioned at \(z=0\), whereas the cameras move in the planes \(z=a\) and \(z=b\), respectively. We may also, without loss of generality, assume that the centre of rotation of the mobile platform coincides with the centre of the first camera.

Fig. 1.
figure 1

Figure reproduced from [30].

Illustration of the problem geometry considered in this paper. Two cameras are assumed to be rigidly mounted on a mobile platform, and may be positioned at different heights above, the ground floor, hence move in the planes \(z=a\) and \(z=b\). Due to the rigidity assumption, the relative orientation between them are constant, and so is the overhead tilt.

3.2 Camera Parameterisation

We shall adopt the camera parameterisation for internally calibrated monocular planar motion that was introduced in [31]. With this parameterisation, the camera matrix associated with the image taken at position j will be

$$\begin{aligned} \varvec{P}^{(j)} = \varvec{R}_{\psi \theta }\varvec{R}_\varphi ^{(j)}[\varvec{I}\;|\;-\varvec{t}^{(j)}], \end{aligned}$$
(1)

where \(\varvec{R}_{\psi \theta }\) is a rotation \(\theta \) about the y-axis followed by a rotation of \(\psi \) about the x-axis. The motion of the mobile platform contains for each frame a rotation \(\varphi ^{(j)}\) about the z-axis, encoded as \(\varvec{R}_\varphi ^{(j)}\), and a vector \(\varvec{t}^{(j)}\) for the translational part. The second camera, which is related to the first camera through a constant rigid body motion, uses the parameterisation

$$\begin{aligned} \varvec{P}'^{(j)} = \varvec{R}_{\psi '\theta '}\varvec{R}_{\eta }\varvec{T_\tau }(b)\varvec{R}_\varphi ^{(j)}[\varvec{I}\,|\,-\varvec{t}^{(j)}], \end{aligned}$$
(2)

introduced in [27]. Here, \(\psi '\) and \(\theta '\) are the tilt angles (defined in the same way as for the first camera), \(\varvec{\tau }\) is the relative translation between the camera centres and \(\eta \) is the constant rotation about the z-axis relative to the first camera. We do not assume any prior knowledge of these constant parameters. Define the translation matrix \(\varvec{T}_{\varvec{\tau }}(b)\) as \(\varvec{T}_{\varvec{\tau }}(b)= \varvec{I}-\varvec{\tau n}^{\intercal }/b\), where , \(\varvec{n}\) is a floor normal and b is the height above the ground floor. The global scale ambiguity allows us to set \(a=1\) without any loss of generality.

4 Prerequisites

4.1 Geometric Reprojection Error

The particular BA problem considered in this paper concerns the minimisation of the geometric reprojection error in the two views over the entire motion sequence. In order to write down this cost function explicitly we need to introduce some additional notation.

For this purpose, let the two cameras at a particular position j be given by the expressions in (1) and (2), respectively. We use the homogeneous representation to parameterise the estimate of the i:th 3D point, corresponding to the measured image point with inhomogeneous representations \(\varvec{x}_i^{(j)}\) in the first camera and \(\varvec{x}_i'^{(j)}\) in the second. Let \(\hat{\bar{\varvec{x}}}_i^{(j)}\) and \(\hat{\bar{\varvec{x}}}_i'^{(j)}\) be the inhomogeneous representations for the projections into the two views, i.e.

(3)

Given N stereo camera locations and M scene points, the geometric reprojection error that we seek to minimise can now be written concisely as

(4)

where \(\varvec{\beta }\) is the parameter vector consisting of the camera parameters and the scene point parameters, and where \(\varvec{r}_{ij}\) and \(\varvec{r}'_{ij}\) are the residuals

$$\begin{aligned} \varvec{r}_{ij} = \varvec{x}_i^{(j)}-\hat{\bar{\varvec{x}}}_i^{(j)} \qquad \text {and} \qquad \varvec{r}'_{ij} = \varvec{x}_i'^{(j)}-\hat{\bar{\varvec{x}}}_i'^{(j)}. \end{aligned}$$
(5)

4.2 The Levenberg–Marquardt Algorithm

We will in this approach use the Levenberg–Marquardt algorithm (LM) when minimising (4). There are of course other alternatives to the LM algorithm, e.g. the dog-leg solver [15] and preconditioned CG [3]; however, LM is one of the most commonly used algorithms for BA, and is used in major modern systems such as SBA [16] and sSBA [13]. Note that these systems do not account for the particular problem geometry that we consider in this paper, which forces some extrinsic parameters to be shared among all camera matrices.

We will not go into details of the LM algorithm here—please refer to more extensive treatments in e.g. [26] and [16] for a more complete discussion—but for future reference we simply recall that it works by iteratively solving the augmented normal equations

$$\begin{aligned} \left( \varvec{J}^{\intercal }\varvec{J}+\mu \varvec{I}\right) \varvec{\delta } = \varvec{J}^{\intercal }\varvec{\varepsilon } \end{aligned}$$
(6)

until some convergence criteria have been met. Here \(\varvec{J}\) is the Jacobian associated with the cost function (4), \(\varvec{\varepsilon }\) is the residual vector, and \(\mu \ge 0\) is the iteratively adjusted damping parameter of the LM algorithm.

4.3 Obtaining an Initial Solution for the Camera Parameters

Homographies can be estimated in a number of different ways; however, the classical approach is to compute point correspondences from matching robust feature points in subsequent images. Popular feature extraction algorithms include SIFT [17] and SURF [2], but many more are available and implemented in various computer vision software. When the putative point correspondences have been matched a popular choice is to use RANSAC (or similar frameworks) to robustly estimate a homography. Such an approach is suitable in order to discard mismatched feature points. A well-known method is the Direct Linear Transform (DLT); however, it requires four point correspondences, and does not generate a homography compatible with the general planar motion model. A good rule of thumb is to use a minimal amount of point correspondences, since the probability of finding a set of points containing only inliers decreases with each additional point that is used. However, as e.g. Pham et al. point out, for very severely noisy data it may in some cases still be preferable to use a non-minimal set [19].

In [33] a minimal solver compatible with the general planar motion model was studied. It was shown that a homography compatible with the general planar motion model must fulfil 11 quartic constraints, and that, a minimal solver only requires 2.5 point correspondences. In a recent paper, a variety of different non-minimal polynomial solvers are considered, partly because of execution time, but also because of sensitivity to noise [29]. These non-minimal solvers enforce a subset of the necessary and sufficient conditions for compatibility with the general planar motion model, thus enforcing a weaker form of it. By accurately making a trade-off between fitting the model constraints (i.e. using more model constraints) and tuning to data (i.e. using more point correspondences), one can increase the performance for noisy data. It is important to note that the assumption of constant tilt parameters cannot be enforced by only considering a single homography, and, therefore, pre-optimisation in an early step of the complete SfM pipeline is not guaranteed to yield better performance.

Once the homographies are obtained, one may enforce the constant tilt constraint by employing the method proposed by Wadenbäck and Heyden [32], to obtain a good initial solution for the monocular case. The method starts by computing the overhead tilt \(\varvec{R}_{\psi \theta }\) from an arbitrary number of homographies, followed by estimating the translation and orientation about the floor normal.

The method by Valtonen Örnhag and Heyden [27] extended the method to include the stereo case, and starts off by treating the two stereo trajectories individually, and estimates the tilt parameters by employing the monocular method described in the previous paragraph. Once the monocular parameters are known for the individual tracks, the relative pose can be extracted by minimising an algebraic error in the relative translation between the cameras, followed by estimating the relative orientation about the floor normal.

4.4 Obtaining an Initial Solution for the Scene Points

Linear triangulation of scene points does not guarantee that all points lie in a plane, and the resulting initial solution would not be compatible with the general planar motion model. In order to obtain a physically meaningful solution we make use of the fact that there is a homography relating the measured points and the ground plane positioned at \(z=0\).

Given a camera \(\varvec{P}\), an image point \(\varvec{x}\) and the corresponding scene point , they are related by \(\varvec{x}\sim \varvec{PX}=\varvec{H}\tilde{\varvec{X}}\), where \(\varvec{H}\) is the sought homography. By denoting the i:th column of \(\varvec{P}\) by \(\varvec{P}_i\), it may be expressed as , where contains the unknown scene point coordinates. It follows that the corresponding scene point can be extracted from \(\tilde{\varvec{X}}\sim \varvec{H}^{-1}\varvec{x}\).

In the presence of noise, using more than one camera results in different scene points, which all will be projected onto the plane \(z=0\). In order to triangulate the points we compute the centre mass; such an approach is computationally inexpensive, however, it is not robust to outliers, which have to be excluded in order to get a reliable result.

5 Planar Motion Bundle Adjustment

5.1 Block Structure of the Jacobian

Denote the unknown and constant parameters for the first camera path by and the second camera path by . Furthermore, let the nonconstant parameters for position j be denoted by . Given N stereo camera positions and M scene points, the following, highly structured Jacobian \(\varvec{J}\), is obtained

(7)

where we use the following notation for the derivative blocks

$$\begin{aligned} \begin{aligned} \varvec{A}_{ij} = \dfrac{\partial \varvec{r}_{ij}}{\partial \varvec{\xi }_j},&\quad \varvec{B}_{ij} = \dfrac{\partial \varvec{r}_{ij}}{\partial \tilde{\varvec{X}}_i},&\quad \varvec{\varGamma }_{ij} = \dfrac{\partial \varvec{r}_{ij}}{\partial \varvec{\gamma }}, \\ \varvec{A}'_{ij} = \dfrac{\partial \varvec{r}'_{ij}}{\partial \varvec{\xi }_j},&\quad \varvec{B}'_{ij} = \dfrac{\partial \varvec{r}'_{ij}}{\partial \tilde{\varvec{X}}_i},&\quad \varvec{\varGamma }'_{ij} = \dfrac{\partial \varvec{r}'_{ij}}{\partial \varvec{\gamma }'\!\!}, \end{aligned} \end{aligned}$$
(8)

where are the unknown scene coordinates. This can be written in a more compact manner as

$$\begin{aligned} \varvec{J} = \begin{bmatrix} \varvec{\varGamma } &{} \varvec{0} &{} \varvec{A} &{} \varvec{B} \\ \varvec{0} &{} \varvec{\varGamma }' &{} \varvec{A}' &{} \varvec{B}' \end{bmatrix}. \end{aligned}$$
(9)

5.2 Utilising the Sparse Structure

In SfM, the number of scene points is often significantly larger than the number of cameras, which makes Schur complementation tractable, and can significantly decrease the execution time. Standard Schur complementation is, however, not directly applicable due to the constant parameters giving rise to the blocks \(\varvec{\varGamma }\) and \(\varvec{\varGamma }'\). We will, however, show in this section, that it is indeed possible to use nested Schur complements, i.e. to recursively apply Schur complements to different parts, and that, in fact, several of the intermediate computations can be stored, thus drastically decreasing the computational time. First, note that the approximate Hessian \(\varvec{J}^{\intercal }\varvec{J}\), in compact form, can be written

$$\begin{aligned} \varvec{J}^{\intercal }\varvec{J} = \begin{bmatrix} \varvec{C} &{} \varvec{E} \\ \varvec{E}^{\intercal } &{} \varvec{D} \end{bmatrix}. \end{aligned}$$
(10)

Here the contribution from the constant parameters are stored in \(\varvec{C}\), the contribution from the nonconstant parameters and the scene points are stored in \(\varvec{D}\), and the mixed contributions are stored in \(\varvec{E}\). Furthermore, the matrix \(\varvec{D}\) can be written as

$$\begin{aligned} \varvec{D}= \begin{bmatrix} \varvec{U} &{} \varvec{W} \\ \varvec{W}^{\intercal } &{} \varvec{V} \end{bmatrix}, \end{aligned}$$
(11)

with block diagonal matrices \(\varvec{U}=\mathrm {diag}(\varvec{U}_1,\ldots ,\varvec{U}_N)\) and \(\varvec{V} = \mathrm {diag}(\varvec{V}_1,\ldots ,\varvec{V}_M)\), where

$$\begin{aligned} \begin{aligned} \varvec{U}_j&= \sum _{i=1}^M\varvec{A}_{ij}^{\intercal }\varvec{A}_{ij}+\varvec{A}_{ij}^{\prime \intercal }\varvec{A}_{ij}', \\ \varvec{V}_i&= \sum _{j=1}^N\varvec{B}_{ij}^{\intercal }\varvec{B}_{ij}+\varvec{B}_{ij}^{\prime \intercal }\varvec{B}_{ij}', \\ \varvec{W}_{ij}&= \varvec{A}_{ij}^{\intercal }\varvec{B}_{ij}+\varvec{A}_{ij}^{\prime \intercal }\varvec{B}_{ij}'. \end{aligned} \end{aligned}$$
(12)

First, note that the system \((\varvec{D}+\mu \varvec{I})\varvec{\delta }=\varvec{\varepsilon }\), where \(\varvec{D}\) is defined as in (11), is not affected by the constant parameters. Such a system reduces to that of the unconstrained case, which can be solved using standard SfM frameworks, such as SBA, or other packages utilising Schur complementation.

We will now show how to efficiently treat the decomposition of (10) as nested Schur complements, by reducing the problem to a series of subproblems of the form used in SBA and other computer vision software packages. In order to do so, consider the augmented normal equations (6) in block form

$$\begin{aligned} \begin{bmatrix} \varvec{C}^{*} &{} \varvec{E} \\ \varvec{E}^{\intercal } &{} \varvec{D}^{*} \end{bmatrix} \begin{bmatrix} \varvec{\delta }_c\\ \varvec{\delta }_d \end{bmatrix} = \begin{bmatrix} \varvec{\varepsilon }_c\\ \varvec{\varepsilon }_d \end{bmatrix}, \end{aligned}$$
(13)

where \(\varvec{C}^{*}=\varvec{C}+\mu \varvec{I}\) and \(\varvec{D}^{*}=\varvec{D}+\mu \varvec{I}\) denote the augmented matrices, with the added contribution from the damping factor \(\mu \), as in (6). Now, utilising Schur complementation yields

$$\begin{aligned} \begin{bmatrix} \varvec{C}^{*}-\varvec{ED}^{*-1}\varvec{E}^{\intercal } &{} \varvec{0} \\ \varvec{E}^{\intercal } &{} \varvec{D}^{*} \end{bmatrix} \begin{bmatrix} \varvec{\delta }_c\\ \varvec{\delta }_d \end{bmatrix} = \begin{bmatrix} \varvec{\varepsilon }_c-\varvec{ED}^{*-1}\varvec{\varepsilon }_d \\ \varvec{\varepsilon }_d \end{bmatrix}. \end{aligned}$$
(14)

Let us take a step back and reflect over the consequences of the above equation. First, note that \(\varvec{D}^{*-1}\) is present in (14) twice, and is infeasible to compute explicitly. This can be avoided by introducing the auxiliary variable \(\varvec{\delta }_{\text {aux}}\), defined as

$$\begin{aligned} \varvec{D}^{*}\varvec{\delta }_{\text {aux}} = \varvec{\varepsilon }_d. \end{aligned}$$
(15)

Again, such as system is not affected by the constraints of the constant parameters, and can be solved with standard computer vision software. Furthermore, we may introduce \(\varvec{\varDelta }_{\text {aux}}\) and solve the system \(\varvec{D}^{*}\varvec{\varDelta }_{\text {aux}}=\varvec{E}^{\intercal }\) in a similar manner by iterating over the columns of \(\varvec{E}^{\intercal }\). Since the number of constant parameters are low, such an approach is highly feasible, but the performance can be further boosted by storing the Schur complement and the intermediate matrices not depending on the right-hand side, from the previous computations of obtaining \(\varvec{\delta }_{\text {aux}}\) from (15).

When the auxiliary variables have been obtained, we proceed to compute \(\varvec{\delta }_c\) from

$$\begin{aligned} \left( \varvec{C}^{*}-\varvec{E\varDelta }_{\text {aux}}\right) \varvec{\delta }_c = \varvec{\varepsilon }_c-\varvec{E}\varvec{\delta }_{\text {aux}}, \end{aligned}$$
(16)

and, lastly, \(\varvec{\delta }_d\) by back-substitution

$$\begin{aligned} \varvec{D}^{*}\varvec{\delta }_d = \varvec{\varepsilon }_d-\varvec{E}^{\intercal }\varvec{\delta }_c. \end{aligned}$$
(17)

Again, by storing the computation of the Schur complement and intermediate matrices, these can be reused to solve (17) efficiently.

6 Experiments

6.1 Initial Solution

The inter-image homographies were estimated using the MSAC algorithm [25] from point correspondences by extracting SURF keypoints and applying a KNN algorithm to establish the matches. In the first experiment, we use the standard DLT solver, the minimal 2.5 pt solver [33] and the four different polynomial solvers studied in [29].

In all experiments we use all available homographies, and extract the monocular parameters using the method proposed in [32]. Similarly, the binocular parameters were extracted using [27]. When all motion parameter have been estimated the camera path is reconstructed by aligning the first camera position to the origin, and use the estimated camera poses to triangulate the scene points as in Sect. 4.4.

Fig. 2.
figure 2

Errors before applying BA. The angles are measured in degrees, and the translation in pixels.

Fig. 3.
figure 3

Errors after applying BA. The angles are measured in degrees, and the translation in pixels.

6.2 Impact of Pre-processing Steps

In this section we work with synthetic data in order to have access to accurate ground truth data. We generate an image sequence from a high-resolution image, depicting a floor, which is the typical use case for the algorithm. This is done by constructing a path compatible with the general planar motion model, and project that part of the floor through the camera and extract the corresponding image. The resulting image is \(400\times 400\) pixels, and all cameras are set to a field of view of 90\(^{\circ }\), with parameters \(\psi =-2^\circ \), \(\theta =-4^\circ \), \(\psi '=6^\circ \), \(\theta '=4^\circ \), , \(\eta =20^\circ \) and \(b=1\). In total, the image sequence consists of 20 images. Lastly, to simulate image noise, we add Gaussian noise with a standard deviation of five pixels, where the pixel depth allows 256 different intensities per channel.

In order to study the difference in accuracy for the constant parameters, we proceed by obtaining homographies as described in Sect. 6.1, using the minimal 2.5 point solver [33], four non-minimal solvers [29] and the DLT equations (4 point). The accuracy, over 50 iterations, is reported before BA, in Fig. 2, and after BA, in Fig. 3. In general, the overall performance of the solvers are almost equal; however, some tendencies are present. The minimal solver performs worse than the other before BA, but this deviation is smaller after BA, although present. One possible explanation is that the general planar motion model is enforced too early in the pipeline—in fact, since it is enforced between two consecutive image pairs only, it does not guarantee that the overhead tilt is constant throughout the entire sequence, and thus, in the presence of noise, the error propagates differently, compared to the other methods that partially (non-minimal) or completely (DLT) tune to the data.

Fig. 4.
figure 4

Mean reprojection error vs execution time (s) over 50 iterations.

Overall, the performance is acceptable after BA, regardless of how the homographies are obtained. Hence, the differentiating factors come down to convergence rates. For the same problem instances as in the previous section we also save the convergence history in terms of the mean reprojection error and the execution time in seconds. The results are shown in Fig. 4. It is clear that the execution time for reaching convergence increase with the number of point correspondences required by the polynomial solvers. This suggests that one can make a trade-off between speed and accuracy when designing a planar motion compatible BA framework by choosing different solvers, in order to suit ones specific needs. Note, however, that the implementation used in this paper is a native Matlab implementation, and that the absolute timings can be greatly improved by careful implementation; however, the relative execution time between the solvers will be similar.

6.3 Bundle Adjustment Comparison

In this section we compare the qualitative difference between enforcing the general planar motion model versus the general unconstrained six degree of freedom model on a real dataset. Currently, there is not a good or well-established dataset compatible with the general planar motion model, and as a substitute, we use the KITTI Visual Odometry/SLAM benchmark [9]. Since many sequences or subsequences depict urban environments with paved roads, the general planar motion model can roughly be applied. In case of clear violation of the general planar motion model, we proceed to use only subsequences where the model is applicable. As we are only interested by the road in front of the vehicle, and not the sky and other objects by the roadside, we proceed to crop a part of the image prior to estimating the homography. An example of this is shown in Fig. 5.

Fig. 5.
figure 5

Image credit: KITTI dataset [9].

Images from the KITTI Visual Odometry/SLAM benchmark, Sequence 01 (left) and 03 (right). Since the algorithm is homography-based the images are cropped a priori in order to contain a significant portion of planar or near planar surface. Such an assumption is not valid on all sequences of the dataset, however, certain cases, such as the highway of Sequence 01 (left) is a good candidate. There are several examples where occlusions occur, such as the car in Sequence 03 (right). These situations typically occur at crossroads and turns.

We use SBA [16] to enforce the general 6-DoF model from the initial trajectory obtained using the traditional 4-point DLT solver, and from the same trajectory our proposed BA algorithm is used. The same thresholds for absolute and relative errors, termination control and damping factors are used for both methods. Furthermore, we do not match features between the stereo views, in order to demonstrate that enforcing the model is enough to increase the overall performance. The results are shown in Fig. 6.

Fig. 6.
figure 6

Estimated trajectories of subsequences of Sequence 01, 03, 04 and 06. In order to align the estimated paths with the ground truth, Procrustes analysis has been carried out. N.B.  the different aspect ratio in (c), which is intentionally added in order to clearly visualise the difference. Figure reproduced from [30].

In most cases it is favourable to impose the proposed method compared to the general 6-DoF method, using SBA. Furthermore, note that irregularities that are present in the initial trajectory is often transferred to the solutions obtained by SBA, thus producing physically improbable solutions. These irregularities are rarely seen using the proposed method, which results in smooth realistic trajectories under general conditions, regardless of whether the initial solution contains irregularities or not.

In fact, it is interesting to see what happens in cases where the general planar motion model is violated. Such an instance occurs in Fig. 6(b) depicting Sequence 03, and is due to the car approaching a crossroads, where a passing vehicle enters the field of view. The observed car, and the surroundings, are highly non-planar; one would, perhaps, expect such a clear violation to result in completely unreliable output, however, the only inconsistency in comparison to the ground truth, is that the resulting turn is too sharp, and the remaining path is consistent with the ground truth. This is not true for the general 6-DoF model, where several obvious inconsistencies are present.

7 Conclusion

In this paper a novel bundle adjustment method has been devised, which enforces the general planar motion model. We provide an efficient implementation scheme that exploits the sparse structure of the Jacobian, and, additionally, avoids recomputing unnecessary quantities, making it highly attractive for real-time computations.

The performance of different polynomial solvers are studied, in terms of both accuracy and speed, taking the entire bundle adjustment framework into account. We discuss how enforcing different polynomial constraints, through planar motion compatible homography solvers, in an early part of the bundle adjustment framework affect the end results. Furthermore, we discuss which trade-offs between speed and accuracy that can be made to suit ones specific priorities.

The proposed method has been tested on real data and was compared to state-of-the-art methods for sparse bundle adjustment, for which it performs well, and gives physically accurate solutions, despite some model assumptions not being fulfilled.