Keywords

1 Introduction

Machine vision is becoming more and more widely applied in many fields, such as robot vision navigation and positioning, 3-D reconstruction, etc., while camera calibration is the most important part of machine vision. Traditional camera calibration methods are generally based on precision machined 2-D or 3-D targets. Direct linear transformation (DLT) calibration method [1] proposes the relationship between the two-dimensional plane image and the actual object coordinates in the three-dimensional space. This method is based on the ideal pinhole model. The parameters involved are relatively few and the calculation process is simple and fast, but the nonlinear distortion caused by the imaging process is not taken into account, so the precision is poor. Tsai proposed a two-step method based on radial constraint [2], combining the traditional linear method and nonlinear optimization method, but the highprecision three-dimensional calibration block is difficult to process and maintain in practice. In the literature [3], Kruppa equation is established based on the epipolar geometry and curvilinear geometry theory which is used in camera self-calibration technology and the calibration speed is fast. It is suitable for onsite calibration, but the precision is low and the stability is poor. Zhang proposed a flexible calibration method [4] according to the plane target images shot by camera from different locations to solve the camera parameter model. This method is widely used, but it also needs to establish a complex mathematical model.

Considering the problems that mathematical model constructed is complex, operation is complicated and precision is poor in traditional camera calibration methods, this paper proposes a camera calibration method based on improved RBF neural network, which has high precision and strong real-time, and the method can be applied to more complex environments.

2 Camera Imaging Model

The projection of three-dimension spatial point D on camera imaging plane is usually obtained by ideal pinhole model, which can be written as

$$\begin{aligned} s\left[ \begin{array}{l} u\\ v\\ 1 \end{array} \right] = \left[ {\begin{array}{*{20}{c}} {{f_u}}&{}0&{}{{u_0}}&{}0\\ 0&{}{{f_v}}&{}{{v_0}}&{}0\\ 0&{}0&{}1&{}0 \end{array}} \right] \left[ {\begin{array}{*{20}{c}} R&{}T\\ {{0^T}}&{}1 \end{array}} \right] \left[ \begin{array}{l} X\\ {} {} Y\\ {} Z\\ {} {} {} 1 \end{array} \right] = A\left[ {\begin{array}{*{20}{c}} R&{}T\\ {{0^T}}&{}1 \end{array}} \right] \left[ \begin{array}{l} X\\ {} {} Y\\ {} Z\\ {} {} 1 \end{array} \right] \end{aligned}$$
(1)

where \(X = {(X_w^T,1)^T}\) is the homogeneous coordinate of the three-dimensional spatial point D in the world coordinate system. \((u,v,1)\) is the homogeneous coordinate of the projection point on the image plane. R and T are the rotation matrix and the translation vector that the world coordinate system transforms into the camera coordinate system. \(({u_0},{v_0})\) is the optical center of the camera. \({f_u}\) and \({f_v}\) are the scale factors in u and v axis directions. With the impact of the camera lens optical distortion, as shown in Fig. 1, the actual imaging point is not the intersection of the image plane and the joint line between three-dimensional point and the optical center, but with a certain offset \((\varDelta u,\varDelta v)\). The distortion of the camera mainly includes radial distortion and tangential distortion.

Fig. 1.
figure 1

Camera imaging model of monocular vision

Fig. 2.
figure 2

Camera imaging model of binocular vision

Binocular stereo vision imaging principle is shown in Fig. 2. The binocular camera imaging equation is expressed as

$$\begin{aligned} \left\{ {{s_l}\left[ \begin{array}{l} {u_l}\\ {v_l}\\ 1 \end{array} \right] = {A_l}\left[ {\begin{array}{*{20}{c}} {{R_l}}&{{T_l}} \end{array}} \right] \left[ \begin{array}{l} {X_w}\\ {} {} {} {Y_w}\\ {} {} {} {Z_w}\\ {} {} {} {} {} {} 1 \end{array} \right] ,{s_r}\left[ \begin{array}{l} {u_r}\\ {v_r}\\ 1 \end{array} \right] = {A_r}\left[ {\begin{array}{*{20}{c}} {{R_r}}&{{T_r}} \end{array}} \right] \left[ \begin{array}{l} {X_w}\\ {} {} {Y_w}\\ {} {} {Z_w}\\ {} {} {} {} {} 1 \end{array} \right] } \right. \end{aligned}$$
(2)

where s is the scale factor. Subscript l and r represent left camera and right camera, respectively. For binocular camera calibration, in addition to calibrating the internal parameters of the cameras, the rotation matrix and the translation matrix between two cameras are also needed.

Above the mentioned, the traditional calibration method based on the camera imaging mathematical model involves a large number of unknown parameters, which takes many factors into account and constructs complexly. While the calibration method based on neural network calibration is able to overcome these problems [5, 6].

Fig. 3.
figure 3

The structure of RBF neural network

3 RBF Neural Network Based on K-means and Gradient Method

3.1 The Structure of RBF Neural Network

RBF neural network is generally three-tier structure. As shown in Fig. 3, the RBF neural network structure is \(n-h-m\), that means, the network has n inputs, h hidden nodes and m outputs. \(x = {({x_1},{x_2}, \cdots ,{x_n})^T} \in {R^n}\) is the network input vector. \(\omega \in {R^{h \times m}}\) is the output weight matrix. \(y = {\left[ {{y_1},{y_2}, \cdots ,{y_m}} \right] ^T}\) is the network out-put vector. \(\phi ( \bullet )\) is the activation function of the i-th hidden node. In this paper, the Euclidean distance function is used as the basis function represented as \(\left\| \bullet \right\| \) and the Gauss function is used as the activation function, which is as follow

$$\begin{aligned} \phi (u) = {\mathrm{{e}}^{ - \frac{{{u^2}}}{{{\delta ^2}}}}} \end{aligned}$$
(3)

where the spread constant of RBF \(\delta > 0\). The smaller \(\delta \) is, the smaller the width of the RBF is and more selective it becomes. In Fig. 3, the j-th output of the RBF neural network can be given as

$$\begin{aligned} {y_j} = \sum \limits _{i = 1}^h {{\omega _{ij}}{\phi _i}} (\left\| {x - {c_i}} \right\| ) , \quad 1 \le j \le m \end{aligned}$$
(4)

3.2 The Learning Algorithm of RBF Neural Network

Clustering method is the most classic RBF neural network learning algorithm [7, 8]. The idea is to use the unsupervised learning method to obtain the data centers of the hidden nodes in RBF neural network, and to determine the spread constants according to the distance between the data centers. Then we can use the supervised learning method to train the output weight values of the hidden nodes.

In this paper, we propose a novel method of clustering number selection based on k-means clustering error law to obtain the number of hidden nodes and data centers. In the training, we construct the objective function based on multi-output error and calculate the gradient to update the data centers, the spread constants and the weight values dynamically.

The basis of the k-means algorithm is error sum of squares [9]. If \({N_i}\) represents the sample size of the i-th class \({\varGamma _i}\), \({m_i}\) is the mean value of the samples in \({\varGamma _i}\), as follow \({m_i} = \frac{1}{{{N_i}}}\sum \limits _{y \in {\varGamma _i}} y \). The error sum of squares for each class \({\varGamma _i}\) is calculated and the error sum of squares for all classes are added, as follow \({J_e} = \sum \limits _{i = 1}^c {\sum \limits _{y \in {\varGamma _i}} {{{\left\| {y - {m_i}} \right\| }^2}} } \). \({J_e}\) is called the clustering criteria based on error sum of squares. The basic premise of the k-means algorithm is that the clustering number is given in advance, which is not satisfied in unsupervised learning problems such as camera calibration.

For the initial partitioning after a given clustering number, we can get good results by dividing according to the natural order of the samples (the order of the feature points on the calibration board).

Obviously, the clustering error function decreases monotonically with the clustering number increasing. When the clustering number is equal to the size of all samples, we can obtain \({J_e}(c) = 0\). Thus, each sample itself becomes a class. If there are \({c^ * }\) much clustered classes in the data, will decrease rap-idly when increases from 1 to \({c^ * }\). But the rate of decreasing will be significantly slower when outnumbers \({c^ * }\), because the original intensive samples will be separated [10]. Figure 4 shows the curve that represents the change of \({J_e}(c)\) following c. Based on this law, in order to obtain a better clustering number, we can start from 1 until we select a preferable number. Formula (5) is given below as a criterion of judgment

$$\begin{aligned} \left| {{J_e}({c^*}) - {J_e}({c^*} - 1)} \right| \le \alpha \bullet {J_e}({c^*} - 1) , \quad {c^*} = 2, \cdots ,c \end{aligned}$$
(5)

where \(\alpha \) is the convergence factor, the range is \(\alpha = 0.2 - 0.3\). When the formula (5) is satisfied, the constructed neural network can achieve the fitting accuracy, and there will be not too many nodes in the hidden layer that lead to overfitting and affect generalization ability.

Fig. 4.
figure 4

The law of clustering error

By using k-means clustering algorithm which selects the number of clusters automatically, the hidden nodes and data centers of RBF neural networks can be determined. The spread constants of RBF neural network can be determined by \(\delta = \frac{d}{{\sqrt{2h} }}\), where d is the maximum distance of all classes and h is the clustering number of RBF neural network.

When hidden nodes, data centers and the spread constants are initially determined, the gradient method is adopted to train the output weight values of hidden nodes [11]. The gradient method used in this paper not only updates the output weight values, but also updates the data centers and the spread constants dynamically on the basis of the k-means method to ensure the global optimization of the parameters.

We construct an objective function based on the errors of multiple output nodes. The objective function is defined as

$$\begin{aligned} \varepsilon = \frac{1}{2}\sum \limits _{n = 1}^N {\left\| {{e_n}} \right\| _2^2} \end{aligned}$$
(6)

where N is the number of training samples of the learning process. \({e_n}\) is the error signal corresponding to the J output nodes and \(J \times 1\) is the vector of n dimension.

The j-th component of the error vector (that is the error of the j-th output node) is established as

$$\begin{aligned} {}_j{e_n} = {}_j{d_n} - {}_j\left[ {\sum \limits _{i = 1}^I {{\omega _{ij}}{\phi _i}(\left\| {{X_n} - {c_i}} \right\| )} } \right] \end{aligned}$$
(7)

where I is the number of hidden nodes. d represents the expected output. The left subscript represents the component of the vector.

We aim at finding the parameters \({\omega _{ij}}\), \({c_i}\) and \({\delta _i}\) that minimize \(\varepsilon \). For the Gauss activation function used in this paper, the gradient of the objective function to each parameter is given as

$$\begin{aligned} \left\{ \begin{array}{l} \frac{{\partial \varepsilon (m)}}{{\partial {\omega _{ij}}(m)}} = - \sum \limits _{n = 1}^N {{}_j{e_n}(m){\phi _i}(\left\| {{X_n} - {c_i}(m)} \right\| )} \\ \frac{{\partial \varepsilon (m)}}{{\partial {c_i}(m)}} = - \sum \limits _{j = 1}^J {\sum \limits _{n = 1}^N {\frac{{2{\omega _{ij}}(m)}}{{\delta _i^2(m)}}{}_j{e_n}(m){\phi _i}(\left\| {{X_n} - {c_i}(m)} \right\| )\left\| {{X_n} - {c_i}(m)} \right\| } } \\ \frac{{\partial \varepsilon (m)}}{{\partial {\delta _i}(m)}} = - \sum \limits _{j = 1}^J {\sum \limits _{n = 1}^N {\frac{{2{\omega _{ij}}(m)}}{{\delta _i^3(m)}}{}_j{e_n}(m){\phi _i}(\left\| {{X_n} - {c_i}(m)} \right\| ){{\left\| {{X_n} - {c_i}(m)} \right\| }^2}} } \end{array} \right. {} {} {} {} {} {} {} {} \end{aligned}$$
(8)

The update law is given as

$$\begin{aligned} \left\{ \begin{array}{l} {\omega _{ij}}(m + 1) = {\omega _{ij}}(m) + {\eta _1}\frac{{\partial \varepsilon (m)}}{{\partial {\omega _{ij}}(m)}}\\ {c_i}(m + 1) = {c_i}(m) + {\eta _2}\frac{{\partial \varepsilon (m)}}{{\partial {c_i}(m)}}\\ {\delta _i}(m + 1) = {\delta _i}(m) + {\eta _3}\frac{{\partial \varepsilon (m)}}{{\partial {\delta _i}(m)}} \end{array} \right. {} {} {} {} {} {} \end{aligned}$$
(9)

where m represents the number of iterations. \({\eta _1}\), \({\eta _2}\) and \({\eta _3}\) are different convergence factors.

In the camera calibration experiment based on improved RBF neutral network, we let the image coordinates of the feature points, \(({u_l},{v_l})\) and \(({u_r},{v_r})\), as inputs and the spatial coordinates in the world coordinate system, \(({X_W},{Y_W},{Z_W})\), as outputs. A RBF neural network with 4 input nodes and 3 output nodes is constructed. The hidden nodes and data centers are determined automatically and the data centers, spread constants and weight values are updated dynamic until the parameters satisfy the precision requirement.

4 Experiment and Analysis

The calibration experiment utilizes a binocular vision system as shown in Fig. 5. The camera in measuring experiment is from EoSens 3CL series of Mikrotron company, whose model number is MC3010 and resolution is 1280 1024 pixel. The pixel size is 0.008 mm/pixel and the lens model number is AF Zoom-Nikkor 24–85 mm/1:2.8–4D.

Fig. 5.
figure 5

Binocular vision system

A flat plate with round feature points is used as a calibration board and the points on the calibration board are \(10 \times 9\), which arrange in a certain order. We place the calibration board vertically on the linear motion platform and establish the X-axis, Y-axis of the world coordinate system on the plane of the calibration board, with Z-axis perpendicular to the plane of the calibration board. The calibration board moves along Z-axis 30 mm, 60 mm, 90 mm and 120 mm. We use two cameras to capture simultaneously 5 pairs of images at 0, 30 mm, 60 mm, 90 mm and 120 mm. When Z is 0, the images acquired by the left camera and right camera are shown in Fig. 6.

Fig. 6.
figure 6

The images acquired by left camera and right camera

Zernike moments algorithm [12] is used to extract feature points, and 370 sets of data are obtained. These sets of data were divided into two groups, of which 280 sets of data were used as training data for the network, and the other 80 groups were used for testing.

RBF network with 4-10-3 structure is trained according to 370 sets of data. The formula for calculating the error of each spatial point is given as

$$\begin{aligned} e = \sqrt{{{({{\mathop X\limits ^ \sim }_W} - {X_W})}^2} + {{({{\mathop Y\limits ^ \sim }_W} - {Y_W})}^2} + {{({{\mathop Z\limits ^ \sim }_W} - {Z_W})}^2}} \end{aligned}$$
(10)

where \(({{\mathop X\limits ^ \sim }_W},{{\mathop Y\limits ^ \sim }_W},{{\mathop Z\limits ^ \sim }_W})\) expresses the reconstructed coordinates of the spatial points according to the trained RBF neural network. The results of the reconstruction of some points are shown in Table 1.

Table 1. The test result of some points (Unit: mm)

The average of the errors of all 80 testing points is 0.1719 mm. It can be seen from Table 1 that the precision of reconstruction in direction X and Y direction is higher than that in direction Z, because there is an error in the installation of the calibration board on motion platform and the precision of the motion platform is less than that of the calibration board.

The precision of our method is verified by measuring the length of the segment between the two feature points as shown in Fig. 7.

Fig. 7.
figure 7

Three segments that need to be measured

We place the calibration board in any position within the range of the field of view, and take the average of the actual length of the segment between five measurements. Zhang method [4] is used as a comparison. The results are shown in Table 2. It can be seen that the precision of our new method is better than that of Zhang method.

Table 2. Comparision of two methods (Unit: mm)

5 Conclusion

The camera calibration method based on improved RBF neural network do not need to consider the impact of the lens distortion and environmental factors. Our method can reduce the error caused by the imperfect mathematical model of the traditional calibration method and contribute to improving measurement precision. The reconstruction error of spatial points is 0.1719 mm in this paper. Measuring experiment shows that the precision of our new method is better than that of Zhang method. The data center selection method and dynamic learning of data centers, spread constants and weight values based on the gradient method further can improve the precision.