1 Introduction

Augmented Reality (AR) is a new computer technology that combines virtual computer-generated 3D graphics with realistic environments in natural visual perception. It has been widely used in industrial applications, such as virtual medical surgery, virtual computer aided education, military, industry, and entertainment [2, 22, 25]. The key problem in AR system is the accurate and robust 3D registration, which entails aligning virtual objects with a real environment in 3D coordinates. In general, the registration process can be realized in the following three steps: positioning, rendering, and merging [18]. Positioning entails transforming and rotating the virtual objects relative to the observer’s location. Rendering means computing the projected 2D image from the 3D model, which is the real image observed by the user. Merging is an image processing procedure to merge the virtual objects with the real environments in order to make them look like real parts of the scene.

A real-time registration method for a markerless AR system is proposed in this paper, which combines fixed region tracking and perspective motion estimation. A texture-full region is first selected manually and stored as the reference template for initialization. When the camera undergoes freely motion, the region can be tracked in real time, and its position and pose can be estimated, combining the camera intrinsic parameters. Considering the computing efficiency, a set of sparse points is randomly sampled to represent the region.

Hence the proposed registration process can be divided into two parts. First the tracking is achieved by minimizing the sum-of-squared differences (SSD) between the reference template and the target region. The next step is to compute the homography of the tracked region and estimate its position and pose. With robust 3D registration, the virtual generated model can be rendered with correct pose and orientation and merged to the real scene seamless. The AR system and its working flow are also proposed.

The main contributions of the presented registration algorithm and AR system are: (1) using the informative tracked region to replace manual markers in previous AR system [1, 2, 11, 17, 19, 25], which enhances the system applicability and robustness. (2) Discuss the illumination insensitive tracking by incorporating illumination parameters estimation with tracking process. (3) Propose the integrated solution of 3D registration based on template tracking and the camera intrinsic parameters.

The rest of this paper is organized as follows. A number of related works are introduced in section 2. Section 3 then explains the principle of 3D registration, including illumination insensitive template tracking and perspective motion estimation, and section 4 describes the AR system with this registration approach. The experimental results with comparisons are showed in section 5, and the final conclusion is drawn in the last section 6.

2 Related works

3D registration for AR system is a challenging problem, particularly for lacking manual markers in the scenes. There are three categories of registration methods, e.g. registration based on direction-finding equipments, such as GPS (Global Positioning System) and gyrometer for computing the 6DOF (Degree of Freedom) of a user [4, 16, 24], registration based on computer vision methods [3, 7, 11, 18, 19, 23] and hybrid registration to combine the virtue of both hardware and computer vision [7, 16, 24]. All of these registration methods have their own advantages and shortcomings. For example, registration with GPS and gyrometer is fast and robust but generally has low accuracy. Many computer vision registration methods used in AR systems require placing markers in the scene beforehand. These markers are designed for easily detection, through giving them distinctive photometric characteristics (e.g. colored markers [16], LEDs [19], or special shapes (like squares [17, 19], dot codes [11] or even 2D barcodes [1])). Such approaches have been successfully applied in some AR systems [14, 23]. However, manual markers narrow the applicability of the AR systems, and the easy-designed markers are always not robust enough against environment noise. In addition, the systems sometimes crash when some of markers are occluded.

In last a few years, the researches attempt to track patches of the natural scene as landmarks, to achieve markerless registration [5, 10, 13, 16, 17, 20].

The traditional region tracking approaches can be divided into two categories: approaches using local independent correspondences (feature-based approaches [11, 16, 17, 20]) and those approaches using template correspondences (template-based approaches [5, 13]). The former uses the local features, such as the key points, line segments, and structure primitives, and the latter uses the texture-full image patches as a whole. Although feature-based methods have the advantage of fast computing, the strength of these global methods lies in their ability to handle complex patterns that cannot be modeled by local features.

For the template-based approaches, an L 2 norm is generally used to measure the error between a reference template and a candidate region [9]. Historically a brute force search was used to match the template [16, 18]. However, this strategy is impractical under perspective transformations, a situation that involves higher dimensional parameter spaces than simple 2D translation. More recent methods treat the problem as a nonlinear optimization problem using Newton type or Levenberg–Marquardt based algorithms [12]. However, they can not be implemented in real time due to their expensive nonlinear computations.

The landmarks correspond to regions that are “good feature” according to the criteria of Shi and Tomasi [21]. Recently, some region tracked based registration methods [9, 1820] were proposed for both outdoor and indoor AR system, and they achieve the real-time performance with fixed small regions. However, they have two non-trivial weaknesses: (1) The tracking error is estimated by the gray intensity of tracked region (template), and thus the registration is sensitive to illumination; (2) They only perform well for 2D virtual object augmentation, due to solving the registration by estimating the affine motion transformation of the tracked region.

3 Principle of 3D registration

We first introduce the fixed template tracking method, the projective motion estimation via region tracking then is described, and the 3D environment registration is proposed finally.

3.1 Fixed template tracking

Since a weak displacement of the camera should result in a change in pixel intensities, we can determine the motion of the moving region from these intensity variations, assuming that the capture rate of the camera is high enough.

We define some notations first. Let I(X,t) be the brightness value at the location X = (x, y)t in an image acquired at time t and \(\nabla I_{\rm X} \left( {{\mathbf{{\rm X}}},t} \right)\) be its intensity gradient. Let the set R = {X 1,X 2,X 3...X N } be the set of N image locations which define a target region. I(R,t) = (I(X 1,t),I(X 2,t,),...,I(X N ,t)) is a vector of brightness values of the target region at time t and I(R,t 0) is referred as the reference template, which is the template to be tracked. t 0 is the initial time (t = 0). In the process of tracking, the relative motion between the camera and the scene results in a deformation of the tracking target. Therefore, a model f(X;μ) is adopted to describe the motion of the target, where μ is a vector modeled by n parameters, obviously, f(X;0) = X and N > n. Thus the tracking process can be reduced to motion vector computation in every frame related to the variations in intensity. Suppose μ*( t ) is the true value of the motion vector at time t and μ*(t) = μ*(t) = 0 and at arbitrary time t > t 0 we have:

$${\mathbf{I}}\left( {{\rm X},t_0 } \right) = {\mathbf{I}}\left( {f\left( {x,\mu *\left( t \right)} \right),t} \right)$$
(1)

Eq. (1) is the equation denotes gray-scale invariable attribute in region tracking. Least-squares can be used to estimate the motion parameters at time \(t >t_0 \) as:

$$O\left( {\mathbf{\mu }} \right) = \left\| {\left( {{\mathbf{I}}\left( {f\left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right),t} \right) - {\mathbf{I}}\left( {{\mathbf{{\rm X}}},t_0 } \right)} \right)} \right\|^2 $$
(2)

In order to simplify notations, I(f(X,μ),t) is denoted by I(μ,t), which describes the intensity of the target at time t > t 0. Assuming μ = 0 at time t 0, Eq. (2) can be simplified as:

$$O\left( {\mathbf{\mu }} \right) = ||\left( {{\mathbf{I}}\left( {{\mathbf{\mu }},t} \right) - {\mathbf{I}}\left( {0,t_0 } \right)} \right)||^2 $$
(3)

In general, Eq. (3) is a non-convex objective function. Lacking a good starting point, this problem will usually require some type of time-consuming global optimization procedure [12]. However, in the case of the tracking problem, an exact value of target position and orientation can be obtained before tracking, and at time t > t 0, the target motion can be described as μ(t). Therefore, the problem can be recast to the process of computing δμ, which denotes the variable of motion parameters, where \({\mathbf{\mu }}\left( {t + \tau } \right) = {\mathbf{\mu }}\left( t \right) + \delta {\mathbf{\mu }}\). In this case, Eq. (3) can be transformed to

$$O\left( {\delta {\mathbf{\mu }}} \right) = \left\| {{\mathbf{I}}\left( {{\mathbf{\mu }}\left( t \right) + \delta {\mathbf{\mu }},t + \tau } \right) - {\mathbf{I}}\left( {0,t_0 } \right)} \right\|^2 $$
(4)

Using a high-capture-rate camera can ensure the deformation between frames is small, which means the reduction of δ μ is also small. Thus, the linearization can be carried out by expanding \({\mathbf{I}}\left( {{\mathbf{\mu }}\left( t \right) + \delta {\mathbf{\mu }},t + \tau } \right)\) using a Taylor series as:

$${\mathbf{I}}\left( {{\mathbf{\mu }} + \delta {\mathbf{\mu }},t + \tau } \right) = {\mathbf{I}}\left( {{\mathbf{\mu }},t} \right) + {\mathbf{M}}\left( {{\mathbf{\mu }},t} \right)\delta {\mathbf{\mu }} + \tau \times {\mathbf{I}}_t \left( {{\mathbf{\mu }},t} \right) + {\text{h}}{\text{.o}}{\text{.t}}$$
(5)

where h.o.t denotes higher-order terms of the expansion and M(μ,t) is the Jacobian matrix of the captured image, which corresponds to motion parameters and intensity variables for the target. M(μ,t) is an N × n matrix of partial derivatives, because the number of sparse points is N and the dimension of motion parameters are n. Each element of this matrix is given by

$$m_{ij} = {\mathbf{I}}_{{\mathbf{\mu }}_j } \left( {f\left( {{\mathbf{{\rm X}}}_i ;{\mathbf{\mu }}} \right),t} \right) = \nabla _f {\mathbf{I}}\left( {f\left( {{\mathbf{{\rm X}}}_i ;{\mathbf{\mu }}} \right),t} \right)^t f_{\mathbf{\mu }} \left( {{\mathbf{{\rm X}}}_i ;{\mathbf{\mu }}} \right)$$
(6)

where \(\nabla _f {\mathbf{I}}\) is the gradient of the target with respect to the motion model.

Combining Eq. (5) with Eq. (4) and neglecting the higher-order terms, we have:

$$O\left( {\delta {\mathbf{\mu }}} \right) = \left\| {{\mathbf{I}}\left( {{\mathbf{\mu }},t} \right) + {\mathbf{M}}\delta {\mathbf{\mu }} + \tau \times {\mathbf{I}}_t - {\mathbf{I}}\left( {0,t_0 } \right)} \right\|^2 $$
(7)

Letting \(\tau \times {\mathbf{I}}_t \left( {{\mathbf{\mu }},t} \right) \approx {\mathbf{I}}\left( {{\mathbf{\mu }},t + \tau } \right) - {\mathbf{I}}\left( {{\mathbf{\mu }},t} \right)\), Eq. (7) can be simplified to:

$$O\left( {\delta {\mathbf{\mu }}} \right) \approx \left\| {{\mathbf{I}}({\mathbf{\mu }},t + \tau ) + {\mathbf{M}}\delta {\mathbf{\mu }} - {\mathbf{I}}\left( {0,t_0 } \right)} \right\|^2 $$
(8)

To maximize the right side of Eq. (8), δ μ can be expressed as

$$\delta {\mathbf{\mu }} = - \left( {{\mathbf{M}}^t {\mathbf{M}}} \right)^{ - 1} {\mathbf{M}}^t \left[ {{\mathbf{I}}\left( {{\mathbf{\mu }},t + \tau } \right) - {\mathbf{I}}\left( {0,t_0 } \right)} \right]$$
(9)

Eq. (9) is the basic model for target tracking. From this model, the Jacobian constraint should be computed in every frame for δ μ. Because M(μ,t) depends on time-varying quantities, it may appear that it must be recomputed at each time step, which is a computationally expensive procedure. Therefore, Eq. (6) should be analyzed in another way to obtain the constant part of M(μ,t) and compute it before the tracking process.

The gradient expressions are used to analyze \(\nabla _{f} {\mathbf{I}}\) for the purpose of simplifying M(μ, t), as shown in Eq. (10):

$$\nabla _{\mathbf{{\rm X}}} I\left( {{\mathbf{{\rm X}}},t{}_0} \right) = f_{\mathbf{{\rm X}}} \left( {{\mathbf{{\rm X}}};{\mathbf{\mu }}} \right)^t \nabla _f {\mathbf{I}}\left( {f\left( {{\mathbf{{\rm X}}};{\mathbf{\mu }}} \right),t} \right)$$
(10)

By substituting Eq. (10) into Eq. (6), we have:

$${\mathbf{M}}{\left( {\mathbf{\mu }} \right)} = {\left( {\begin{array}{*{20}c} {{\nabla _{{\rm X}} {\mathbf{I}}{\left( {{\mathbf{{\rm X}}}_{1} ,{\mathbf{\mu }}_{0} } \right)}^{t} f_{{\rm X}} {\left( {{\mathbf{{\rm X}}}_{1} ,{\mathbf{\mu }}} \right)}^{{ - 1}} f_{{\mathbf{\mu }}} {\left( {{\mathbf{{\rm X}}}_{1} ,{\mathbf{\mu }}} \right)}^{{ - 1}} }} \\ {{\nabla _{{\rm X}} {\mathbf{I}}{\left( {{\mathbf{{\rm X}}}_{2} ,{\mathbf{\mu }}_{0} } \right)}^{t} f_{{\rm X}} {\left( {{\mathbf{{\rm X}}}_{2} ,{\mathbf{\mu }}} \right)}^{{ - 1}} f_{{\mathbf{\mu }}} {\left( {{\mathbf{{\rm X}}}_{2} ,{\mathbf{\mu }}} \right)}^{{ - 1}} }} \\ {{ \ldots .}} \\ {{\nabla _{{\rm X}} {\mathbf{I}}{\left( {{\mathbf{{\rm X}}}_{N} ,{\mathbf{\mu }}_{0} } \right)}^{t} f_{{\rm X}} {\left( {{\mathbf{{\rm X}}}_{N} ,{\mathbf{\mu }}} \right)}^{{ - 1}} f_{{\mathbf{\mu }}} {\left( {{\mathbf{{\rm X}}}_{N} ,{\mathbf{\mu }}} \right)}^{{ - 1}} }} \\ \end{array} } \right)}$$
(11)

where f X (X i ;μ) is the partial derivative of the motion model by with respect to X i and \(f_{\mathbf{\mu }} \left( {{\mathbf{{\rm X}}}_i ;{\mathbf{\mu }}} \right)\) is the partial derivative of the motion model with respect to μ. Obviously, \(\nabla _{\rm X} {\mathbf{I}}\left( {{\mathbf{{\rm X}}}_i ,{\mathbf{\mu }}_0 } \right)^t \) is constant over time while \(f_{\mathbf{\mu }} \left( {{\mathbf{{\rm X}}}_i ;{\mathbf{\mu }}} \right)\) is the time-varying part. However, f X (X i ;μ) is partially related to time and the constant part is supposed to be Г(X i ). Eq. (11) can be rewritten as:

$${\mathbf{M}}{\left( {\mathbf{\mu }} \right)} = {\left( {\begin{array}{*{20}c} {{\nabla _{{\mathbf{{\rm X}}}} {\mathbf{I}}{\left( {{\mathbf{{\rm X}}}_{1} ,{\mathbf{\mu }}_{0} } \right)}^{t} \Gamma {\left( {{\mathbf{{\rm X}}}_{1} } \right)}}} \\ {{\nabla _{{\mathbf{{\rm X}}}} {\mathbf{I}}{\left( {{\mathbf{{\rm X}}}_{2} ,{\mathbf{\mu }}_{0} } \right)}^{t} \Gamma {\left( {{\mathbf{{\rm X}}}_{2} } \right)}}} \\ {{ \ldots .}} \\ {{\nabla _{{\mathbf{{\rm X}}}} {\mathbf{I}}{\left( {{\mathbf{{\rm X}}}_{N} ,{\mathbf{\mu }}_{0} } \right)}^{t} \Gamma {\left( {{\mathbf{{\rm X}}}_{N} } \right)}}} \\ \end{array} } \right)}{\sum {{\left( {\mathbf{\mu }} \right)} = {\mathbf{M}}_{0} {\sum {{\left( {\mathbf{\mu }} \right)}} }} }$$
(12)

According to Eq. (12), M 0 is the constant part, which denotes the prior information of the tracking target and describes every pixels’ change in gray value due to movement. This part can be computed in the initialization phase, while \(\sum {\left( {\mathbf{\mu }} \right)} \) definitely depends on motion vector μ and will be recomputed in every frame. Thus, we bring this adjusted M(μ,t) into (9)

$$\delta {\mathbf{\mu }} = - \sum {^{ - 1} \left( {{\mathbf{M}}_0^t {\mathbf{M}}_0 } \right)^{ - 1} {\mathbf{M}}_0^t \left[ {{\mathbf{I}}\left( {f\left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right),t_n } \right) - {\mathbf{I}}\left( {{\mathbf{{\rm X}}},t_0 } \right)} \right]} $$
(13)

By the above analysis, the tracking algorithm can be divided into two parts: one can be completed in the initialization step and the other can be executed in real time during tracking. The flow of the tracking algorithm is shown in Fig. 1. The motion model of tracking will be described in section 3.3.

Fig. 1
figure 1

The work flow of template tracking algorithm

3.2 Illumination insensitive tracking

As the incremental estimation step is effectively computing a structured optical flow, and optical flow methods are well-known to be sensitive to illumination changes. In real environments brightness or contrast changes are unavoidable phenomena that cannot always be controlled. It follows that modeling light changes is necessary for visual trackers to operate in a general situation. As described in [10, 15], we can estimate the photometric parameters in each frame to ensure the tolerance for illumination changes.

Consider a light source L in the 3-D space and suppose we are observing a smooth surface S, as shown in Fig. 2. The intensity value of point P on the image plane depends on the portion of incoming light from the source L that is reflected by the surface S, and is described by the Bidirectional Reflectance Distribution Function (BRDF). Assuming that the surface S is Lambertian and is a plane in range of U, the BRDF simplifies considerably and the intensity observed at the point P(X) can be modeled as:

$${\mathbf{I}}\left( {\mathbf{x}} \right) = \lambda E\left( {\mathbf{X}} \right) + \delta ,\quad \forall {\mathbf{x}} \in W\left( U \right)$$
(14)
Fig. 2
figure 2

Image formation process when illumination is taken into account: photometric parameters via Lambertian assumption

where E(X) is an albedo function of surface S and W(U) is the image region related with the surface in range U, and λ and δ can be thought as parameters that represent respectively the contrast and brightness changes of image. Therefore, we can take the illumination changes into our model and rewrite Eq. (13) as:

$$\delta {\mathbf{\mu }} = - \sum {^{ - 1} \left( {{\mathbf{M}}_0^t {\mathbf{M}}_0 } \right)^{ - 1} {\mathbf{M}}_0^t \left[ {\lambda \left( {t_n } \right){\mathbf{I}}\left( {f\left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right),t_n } \right) - {\mathbf{I}}\left( {{\mathbf{{\rm X}}},t_0 } \right) + \delta \left( {t_n } \right)} \right]} $$
(15)

3.3 Projective motion model of region tracking

The purpose of the proposed algorithm is to obtain the registration information of the environment and thus camera motion recovery is necessary. The motion model is defined as a projective transformation during tracking [6].

Let X = (μ,v)t be the Cartesian coordinate and X h  = (r,s,t)t be the corresponding projective coordinate as shown in Fig. 3.

Fig. 3
figure 3

Projective model of target in camera coordinate

The relationship between them is:

$${\rm X}_{h} = {\left( {\begin{array}{*{20}l} {r \hfill} \\ {s \hfill} \\ {t \hfill} \\ \end{array} } \right)} \to {\rm X} = {\left( \begin{aligned} & r \mathord{\left/ {\vphantom {r t}} \right. \kern-\nulldelimiterspace} t \\ & s \mathord{\left/ {\vphantom {s t}} \right. \kern-\nulldelimiterspace} t \\ \end{aligned} \right)} = {\left( {\begin{array}{*{20}l} {u \hfill} \\ {v \hfill} \\ \end{array} } \right)}\forall {\mathbf{{\rm X}}} \in {\mathbf{I}}{\left( {\rm X} \right)}$$
(16)

Because we assume the tracking target is coplanar, the motion model for tracking can be defined as a projective transformation [6], as shown in Eq. (17)

$$f_{{\rm X}} {\left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right)} = {\mathbf{P}}{\rm X} = {\left( {\begin{array}{*{20}c} {a} & {d} & {g} \\ {b} & {e} & {h} \\ {c} & {f} & {1} \\ \end{array} } \right)}{\left( {\begin{array}{*{20}c} {r} \\ {s} \\ {t} \\ \end{array} } \right)}$$
(17)

In this case, the motion parameter vector μ can be defined as \({\mathbf{\mu }} = \left( {a,b,c,d,e,f,g} \right)^t \). After substituting it into the Jacobian matrix, we have:

$$\nabla _{\mathbf{{\rm X}}} {\mathbf{I}}\left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right)^t = \left( {\frac{{\partial {\mathbf{I}}}}{{\partial \mu }},\frac{{\partial {\mathbf{I}}}}{{\partial v}}, - \left( {u\frac{{\partial {\mathbf{I}}}}{{\partial u}} + v\frac{{\partial I}}{{\partial v}}} \right)} \right)$$
(18)
$$f_{\mathbf{{\rm X}}} \left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right)^{ - 1} = {\mathbf{P}}^{ - 1} $$
(19)
$$f_{\mathbf{\mu }} ({\mathbf{{\rm X}}},{\mathbf{\mu }}) = \left( \begin{aligned} & r\;0\;0\;s\;0\;0\;t\;0 \\ & 0\;r\;0\;0\;s\;0\;0\;t \\ & 0\;0\;r\;0\;0\;s\;0\;0 \\ \end{aligned} \right)$$
(20)

After combining Eq. (18) with Eq. (19), the result is:

$$f_{\mathbf{{\rm X}}} ({\mathbf{{\rm X}}},{\mathbf{\mu }})^{ - 1} f_{\mathbf{\mu }} ({\mathbf{{\rm X}}},{\mathbf{\mu }}) = \left( {r{\mathbf{P}}^{ - 1} \left| {s{\mathbf{P}}^{ - 1} } \right|t{\mathbf{P}}_{12} ^{ - 1} } \right) = \Gamma \left( {\mathbf{{\rm X}}} \right)\sum {\left( {\mathbf{\mu }} \right)} $$
(21)

where \({\mathbf{P}}_{12}^{ - 1} \) denotes the first two columns of P −1 and Г(X i ) is a constant. Thus, we have:

$$\Gamma \left( {\mathbf{{\rm X}}} \right) = \left( {r{\mathbf{I}}_{3 \times 3} \left| {s{\mathbf{I}}_{3 \times 3} } \right|t{\mathbf{I}}_{3 \times 3} } \right)$$
(22)
$${\sum {{\left( {\mathbf{\mu }} \right)}} } = {\left( {\begin{array}{*{20}c} {{{\mathbf{P}}^{{ - 1}} }} & {0} & {0} \\ {0} & {{{\mathbf{P}}^{{ - 1}} }} & {0} \\ {0} & {0} & {{{\mathbf{P}}^{{ - 1}}_{{12}} }} \\ \end{array} } \right)}$$
(23)

\(\sum {\left( {\mathbf{\mu }} \right)} \) is an invertible 9 × 8 matrix, and by combining with Eq. (15), the target tracking model based on projective transformation can be written as:

$$\delta {\mathbf{\mu }} = - \left( {\sum {^t {\mathbf{M}}_0^t {\mathbf{M}}_0 } \sum {} } \right)^{ - 1} \sum {^t {\mathbf{M}}_0^t \left[ {\lambda \left( {t_n } \right){\mathbf{I}}\left( {f\left( {{\mathbf{{\rm X}}},{\mathbf{\mu }}} \right),t_n } \right) - {\mathbf{I}}\left( {{\mathbf{{\rm X}}},t_0 } \right) + \delta \left( {t_n } \right)} \right]} $$
(24)

3.4 3D environment registration

The correct tracking for target motion results in accurate positions for each sparse point, which together compose the target, and the correspondences between every frame can be computed. Thus we have:

$$\left( {{\mathbf{X}}_{1..N} ,t} \right) = {\mathbf{H}}_0^n \left( {{\mathbf{X}}_{1..N} ,t_0 } \right)$$
(25)

where \({\mathbf{H}}_0^n \) denotes the homography of sparse points set respectively at time t n and t 0 . Therefore, the target’s position and orientation in camera coordinates can be estimated from homography. The relationship between camera model and tracking target is shown in Fig. 4.To simplify the projection equation, the world coordinate is defined as the tracking target’s coordinate.

Fig. 4
figure 4

The relationship between target and camera in tracking process

Let (x p p ) be the target’s true coordinates in the world frame, (x 0,y 0) be its coordinates at time t 0 in the camera’s projective plane, and (x n ,y n ) be its coordinates at time t n in the camera’s projective plane, as shown in Fig. 3. The relationships among them can be written as:

$${\left( {\begin{array}{*{20}c} {{x_{0} }} \\ {{y_{0} }} \\ {1} \\ \end{array} } \right)} = {\mathbf{P}}^{0}_{w} {\left( {\begin{array}{*{20}c} {{x_{P} }} \\ {{y_{P} }} \\ {1} \\ \end{array} } \right)}$$
(26)
$${\left( {\begin{array}{*{20}c} {{x_{n} }} \\ {{y_{n} }} \\ {1} \\ \end{array} } \right)} = {\mathbf{P}}^{n}_{w} {\left( {\begin{array}{*{20}c} {{x_{P} }} \\ {{y_{P} }} \\ {1} \\ \end{array} } \right)}$$
(27)
$${\left( {\begin{array}{*{20}c} {{x_{n} }} \\ {{y_{n} }} \\ {1} \\ \end{array} } \right)} = {\mathbf{H}}^{n} {\left( {\begin{array}{*{20}c} {{x_{0} }} \\ {{y_{0} }} \\ {1} \\ \end{array} } \right)}$$
(28)

The projective matrix \({\mathbf{P}}_w^0 \) at time t 0 can be computed in the initialization phase. In the literature, the projection matrix can be computed in the case of four known correspondence features [8], and more points can be used for higher accuracy. The vision projection equation [6] is as follows,

$${\left( {\begin{array}{*{20}c} {{x_{n} }} \\ {{y_{n} }} \\ {1} \\ \end{array} } \right)} = \lambda K{\left[ {\left. R \right|t} \right]}{\left( {\begin{array}{*{20}c} {{x_{P} }} \\ {{y_{P} }} \\ {{z_{P} }} \\ {1} \\ \end{array} } \right)}$$
(29)

where λ is the scale factor and K is the intrinsic parameters of the camera, \(\left[ {R\left| t \right.} \right]\) denotes the rotation and translation related to the real scene, which compose the extrinsic parameters of the camera. \(\left[ {R\left| t \right.} \right]\) is also the pose of the environment in camera coordinates, because the world coordinate has been defined with respect to the tracking target. Let the target plane be the Z plane in the world coordinate and z p  = 0, then by substituting Eqs. (26), (27), (28) into Eq. (29) we have:

$${\mathbf{H}}_0^n = P_w^n \left( {P_w^0 } \right)^{ - 1} = \lambda K\left[ {\left. R \right|T} \right]\left( {P_w^0 } \right)^{ - 1} $$
(30)

From Eq. (25), \({\mathbf{H}}_0^n \) can be computed using the result of the tracking process. Thus, the camera pose \(\left[ {\left. R \right|t} \right]\) can be completely registered by computing Eq. (30).

4 Architecture of marker-less AR system

In the literature, an AR system is composed of the following key components: real scene capture module, registration, rendering, merging, and stereo vision. Based on our registration algorithm presented in section 3, a marker-less AR system can be designed and its structure is shown in Fig. 5.

Fig. 5
figure 5

Architecture of AR system

A high-capture-rate calibrated camera is adopted for real scene capture and it is fixed with the user’s head. The real scene video is transferred to a portable computer, where the registration algorithm can be executed. Then, rendering and merging based on registration can be achieved and, finally, the augmented image sequences are transferred to the head mounted display (HMD).

The whole working flow of the proposed marker-less AR system is:

  1. 1.

    Initialization

    1. (a)

      Establish the computer generated 3D models.

    2. (b)

      Calibrate the camera and obtain the intrinsic parameters K.

    3. (c)

      Manually select a region as the tracking target and store it as the reference template.

    4. (d)

      Compute initial parameters.

  2. 2.

    Registration, rendering and merging

    1. (a)

      Track target with camera and estimate its pose and position.

    2. (b)

      Render 3D model to virtual image with respect to the target’s pose.

    3. (c)

      Merge the virtual image with the real-time image sequences.

  3. 3.

    Transfer the augmented scene to the HMD.

5 Experiment

The tracking-based registration algorithm has been implemented to run on a common workstation (Pentium 4 3.0 GHz CPU) in real time. The experiments are discussed with two phases: (1) Planar region tracking in the scene. We evaluate the tracking accuracy with pixel residues of tracked region and template; (2) 3D registration and augmentation reality via fixed region tracking.

5.1 Tracking evaluation

In phase one, a complex-texture plane moves and an interesting region is randomly defined as the target region. We test 400 frames sequence and the tracked region comes back to original position in the end. In the tracking sequence, we use a “intrude” object (a pen) and adjust environment illumination conditions to test the algorithm robustness. The errors between the rectified target and template are recorded. During the experiment, 120 sparse points are selected to represent the target region. The results are shown as follows: the template shown in Fig. 6(a) is defined during the initialization phase and stored, while Fig. 6 (b) (c) (d) (e) illustrate the registration process. The illumination condition changing is shown in Fig. 6(c).

Fig. 6
figure 6

Tracking a fixed region in real-time. (a) Template of Target. (b) Camera moving at 70th frame. (c) Camera moving at 120th frame with illumination changing. (d) Camera moving at 180th frame. (e) Camera moving at 380th frame

To quantitatively illustrate the algorithm accuracy, the tracking region is rectified with motion parameters and compared with the template to obtain tracking residua. Following [10], we define the residue as

$$\mathop {\operatorname{Re} }\limits^ \wedge = \frac{1}{Z}\sum\limits_X {\left\| {{\mathbf{I}}_X \left( {{\mathbf{\mu }}(t) + \delta {\mathbf{\mu }},t + \tau } \right) - {\mathbf{I}}_X \left( {0,t_0 } \right)} \right\|^2 } $$
(31)

where Z is normalize term and X denotes the tracked point. We test three indoor video sequences (400 frames for each), which are similar with scenes in Fig. 6. The plot shown in Fig. 7 (red curve) illustrates the error between the registered patch and our target can be control well and the region can be matched exactly. We also test Shi–Tomasi tracker [10], as shown in Fig. 7 (blue curve), and it can not track the region after approximately 90 frames.

Fig. 7
figure 7

Curve of residue errors between tracked regions and template. Red curve: our algorithm can control the illumination changing and “intrude” object well and the region can be matched exactly. Blue curve: the Tomasi Tracker (blue curve) failed to track the region, at around 90 frame

5.2 3D Registration evaluation

In phase two, the environment registration is achieved based on the tracking step, which helps us to merge the virtual object with the real scene.

Two representative experimental images are shown in Fig. 8 and the image resolution is 320 × 240. Figure 8(a) and (b) are the augmented scene observed with the HMD. Intuitively, it shows that the augmented effect is natural and seamless due to the high registration accuracy. Our registration method can also be used for outdoor AR solution. The experimental results in our other project about mobile augmented reality (MAR) system for historic-site navigation are shown in Fig. 9. The detailed of MAR technology will be discussed in future works.

Fig. 8
figure 8

(a) One of augmented scenes; (b) one of augmented scenes

Fig. 9
figure 9

Mobile augmented reality based on the proposed registration method

In order to further demonstrate the system advance on the registration accuracy and efficiency, we compare with two stat of the art methods of AR registration [16, 18]. For evaluate the accuracy of 3D registration, we utilize the VICON system (http://www.vicon.com/support), one well-known pose and motion capture platform. We assume the output of VICON system is the ground truth benchmark, and registration accuracy is thus normalized via the average errors of 6 freedom degrees, following the definition in [11, 17].

We first test the registration performance on randomly illumination changing, and compare with the color-markers based method. Note that it is difficult to measure the illumination condition precisely, and thus we manually adjust the lighting condition, including the spot light and global light. As shown in Fig. 10 (a), the red curve and green curve denote the 3D registration accuracy of our method and color markers based method [16] respectively. Along with the horizontal axe, the illumination changing became drastic gradually. The dashed marked nodes in the curves denote the critical value of visual perception, that is, the virtual objects cannot be matched well with the real objects (scene) below the critical registration accuracy. The following experiment shows the registration performance when the random occlusion increases. As shown in Fig. 10 (b), the red curve denotes our result and the blue denotes the result of FFT-based template matching method [18]. The two curve illustrate the accuracy reduces with the random occlusion increases, and the dashed marked nodes in the curves are the critical point of human visual perception as well.

Fig. 10
figure 10

Curves of 3D registration accuracy with comparison. (a) Shows the performance under illumination randomly changing; (b) Shows the performance under random occlusion occurring

In practice, to enhance the robustness of 3D registration, multiple fixed regions can be tracked simultaneously. Since we randomly sample points from fixed regions as tracking correspondences, the multiple regions tracking is solved equally as single. Hence the system efficiency should be evaluated via sampled points for tracking. The consuming time curve is illustrated in Fig. 11 it shows the cost time increase with point number in each iterative step. In the proposed AR system, we fix around 150 sampled points from three tracked regions empirically.

Fig. 11
figure 11

Consuming time curve for each computing step

6 Summary

In this paper, a novel 3D registration approach based on planar region tracking has been proposed. This algorithm combines fixed region tracking and perspective motion estimation. A texture-full region is first selected manually and stored as the reference template for initialization. When the camera undergoes freely motion, the region can be tracked in real time, and its position and pose can be estimated, combining the camera intrinsic parameters. Based on this registration method, an AR system is proposed.

However, one major limitation with this algorithm is that the template needs to be defined manually. That can be improved by adopting some planar region localization algorithms. In addition, we can not only define template with the intensity of sampled point set, but also use those discriminative features, such as object shape, patch gradient histogram.