1 Introduction

With the rise of the industrial 4.0 strategy, the combination of vision and robotic systems has become an important means to improve the intelligence of robots [1]. In the current practical industrial applications, 2D vision technology is often used in combination with robots, but the two-dimensional image almost loses all the depth information of the object and it is difficult to obtain the three-dimensional information of the target. Therefore, it is necessary to reconstruct the three-dimensional information of the target from the two-dimensional image, for more comprehensive and true reflection of objective objects, and further improving the intelligence of the robot system [2]. Because binocular vision has the advantages of high efficiency, high precision, non-contact and depth information, it can be widely used in target recognition and positioning, and has important research significance for the precise positioning of mass production workpieces.

In this paper, we propose an recognition and localization algorithm based on HALCON binocular stereo vision technology. The main contribution of this paper is in three respects. First, the stereo correction of the target image is achieved by binocular system calibration. Second, considering various linear, nonlinear illumination changes and occlusion factors, this paper propose a pyramid search strategy and sub-pixel shape template matching algorithm with scaling to achieve accurate extraction of feature points. This algorithm can effectively cope with various linear and nonlinear illumination changes according to the gradient correlation of the edge of the object and it has strong resistance to occlusion and partial deletion. Lastly, we can quickly complete stereo matching of feature points through a normalized cross-correlation (NCC) based grayscale template matching algorithm, even if there is illumination variation in the image. The experimental results show that the proposed algorithm can realize the high-precision positioning of the target object under the premise of real-time and high efficiency, and the feasibility of the method can be verified by experiments.

2 Binocular Stereo Vision Positioning Principle

Binocular stereo vision can perceive the depth information of the three-dimensional world by simulating human eyes. By using any point in the space at the imaging position of the left and right cameras, the feature point matching relationship and the principle of triangular geometry, the parallax can be calculated to obtain the information of the object’s three-dimensional space [3].

In this experiment, we use an axis parallel system structure consisting of two cameras. The physical map is displayed in Fig. 1a, and the binocular system model is shown in Fig. 1b. Assume that the focal lengths of both cameras are \( f \), and the distance between the projection centers is \( b \) (also called the baseline). \( O_{l} uv \), \( O_{r} uv \) are two imaging plane coordinate systems whose coordinate directions coincide with the x-axis and y-axis directions, respectively. To simplify calculations, we take the coordinate system of the left camera as the world coordinate system \( O - XYZ \). The image coordinates of the space point \( P(x,y,z) \) on the left and right camera imaging planes are \( P(u_{l} ,v_{l} ) \) and \( P(u_{r} ,v_{r} ) \), respectively. The coordinates in the left and right camera coordinate systems are \( P(x_{l} ,y_{l} ,z_{l} ) \) and \( P(x_{r} ,y_{r} ,z_{r} ) \), respectively. Since the imaging planes of the two cameras are on the same plane, the image coordinates of the spatial point \( P(x,y,z) \) have the same \( v \) coordinates, that is, \( v_{{^{{_{{_{l} }} }} }} = v_{r} = v \). According to the triangular geometry, we have:

Fig. 1.
figure 1

Binocular stereo system

$$ u_{l} = f\frac{x_{l}}{z_{l}} \quad u_{r} = f\frac{{(x_{l} - b)}}{{z_{l} }}\quad v_{l} = v_{r} = f\frac{{y_{l} }}{{z_{l} }} $$
(1)

Since the left and right images are on the same plane, the disparity value of the corresponding point is defined as the difference between the coordinates of the corresponding point column in the left and right images. Hence, disparity can be computed:

$$ D = u_{l} - u_{r} = \frac{b \times f}{{z_{l} }} $$
(2)

Combining (1) and (2), the formula for calculating the three-dimensional coordinates of the spatial point can be represented by the Eq. (3):

$$ x^{{}} = x_{l} = \frac{{b \times u_{l} }}{d}\quad y = y_{l} = \frac{b \times v}{d}\quad z = z_{l} = \frac{b \times f}{d} $$
(3)

Where \( z = z_{l} \) is the depth distance of the space point \( P \) From the above formula, we can see that before the three-dimensional coordinates of the spatial point are obtained, the task to be solved by stereo vision positioning is to determine camera parameters and conjugate points.

3 Binocular System Calibration and Stereo Rectification

3.1 System Calibration

Define camera calibration is a crucial step in stereo imaging [4]. Camera calibration in binocular system is similar to single camera calibration. Firstly, the internal and external parameters of the two cameras are obtained by single camera calibration, and then the positional relationship between two cameras is obtained by using the external parameters of the two cameras.

Camera calibration refers to establish the relationship between the pixel coordinates of the camera image and the three-dimensional coordinates of the scene point. According to the camera model, the internal and external parameters of the camera are solved by the image coordinates and world coordinates of the known feature points [5]. Establishing a camera imaging model, that is, the model parameters is solved by projection relationship, experiment and calculation method. Throughout the projection process, the conversion relationship between the image pixel coordinate system and the world coordinate system is:

$$ \begin{aligned} & Z_{c} \left[ {\begin{array}{*{20}c} u \\ v \\ 1 \\ \end{array} } \right] = Z_{c} \left[ {\begin{array}{*{20}c} {\frac{1}{{d_{x} }}} & 0 & {u_{0} } \\ 0 & {\frac{1}{{d_{y} }}} & {v_{0} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} x \\ y \\ 1 \\ \end{array} } \right] \\ & = \left[ {\begin{array}{*{20}c} {\frac{1}{{d_{x} }}} & 0 & {u_{0} } \\ 0 & {\frac{1}{{d_{y} }}} & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} } \right]\left[ \begin{aligned} X_{c} \hfill \\ Y_{c} \hfill \\ Z_{c} \hfill \\ 1 \hfill \\ \end{aligned} \right] \\ & = \left[ {\begin{array}{*{20}c} {\frac{1}{{d_{x} }}} & 0 & {u_{0} } \\ 0 & {\frac{1}{{d_{y} }}} & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} \varvec{R} & \varvec{t} \\ 0 & 1 \\ \end{array} } \right]\left[ \begin{aligned} X_{w} \hfill \\ Y_{w} \hfill \\ Z_{w} \hfill \\ 1 \hfill \\ \end{aligned} \right] \\ & = \left[ {\begin{array}{*{20}c} {\frac{f}{dx}} & 0 & {u_{0} } \\ 0 & {\frac{f}{dy}} & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} \varvec{R} & \varvec{t} \\ 0 & 1 \\ \end{array} } \right]\left[ \begin{aligned} X_{w} \hfill \\ Y_{w} \hfill \\ Z_{w} \hfill \\ 1 \hfill \\ \end{aligned} \right] = M_{1} M_{2} \left[ \begin{aligned} X_{w} \hfill \\ Y_{w} \hfill \\ Z_{w} \hfill \\ 1 \hfill \\ \end{aligned} \right] \\ \end{aligned} $$
(4)

Where: \( uv \) is the image pixel coordinate system, \( d_{x} ,d_{y} \) is the pixel unit and \( f \) is the distortion coefficient. \( O_{c} X_{c} Y_{c} Z_{c} \) is the camera coordinate system and \( O_{w} X_{w} Y_{w} Z_{w} \) represents the world coordinate system, \( M_{1} \), \( M_{2} \) represents the internal and external parameters of the camera.

This experiment uses Zhengyou Zhang calibration method to calibrate the system. We use two identical cameras and lenses, which is the M-1614MP2 industrial lens and model MV-VS120 CCD color industrial cameras by Vision Digital Image Technology Co., Ltd. Combined with the HALCON software platform, calibration of the camera is achieved via using its algorithmic dynamic library. The calibration plate processing is shown in Fig. 2. The internal and external parameters of the two cameras and the relative positional relationship between the cameras are determined by averaging 20 calibrations. The internal parameter calibration results are shown in Table 1. Table 2 shows the external parameters of the two cameras before and after calibration.

Fig. 2.
figure 2

Calibration plate image processing

Table 1. Calibration results of internal parameter
Table 2. Calibration results of external parameter

According to the above data, the error level is lower than one pixel, and the calibration accuracy is high. After correction, the column coordinates of the pixel points of the two images are equal, and the position of the right image relative to the left image is only translated in the X-axis direction. This shows that the corrected binocular positioning system is a standard external polar line geometry [6], which can greatly save the time of stereo matching.

3.2 Stereo Rectification

After the system calibration, we can calculate the two correction maps by using the internal and external parameters of the two cameras and the relative positional relationship between the two cameras. The two map images combining functions map_image() is used to rectify the acquired stereo image pairs to the polar standard geometry. The rectified images of two cameras are shown in Fig. 3.

Fig. 3.
figure 3

Stereo rectification

4 Target Recognition and Feature Point Extraction

To improve the accuracy and speed of the matching algorithm in the process of object identification and position detection, subpixel-accurate-based template matching with scaling and image pyramid algorithm are applied in this paper, which is robust to occlusion, chaos, nonlinear illumination changes and contrast global inversion.

4.1 Subpixel Edge Extraction

For subpixel accurate contour extraction, the edge is extracted by a combination of canny operator and subpixel edge detection, which takes the image as input and returns to the XLD contour. Through the edges_sub_pix() operator, the canny filter is used to detect the gradient edge, and the canny operator repeats the gray value at the image boundary to obtain the optimal filter width by the Alpha option. This maintains greater noise invariance and enhances the ability to detect small details. The edge detection effect is shown in Fig. 4.

Fig. 4.
figure 4

Edge detection

In this paper, the Tukey weight function is used to fit through three iterations. The Tukey weight function is defined as:

$$ \omega (\sigma ) = \left\{ {\begin{array}{*{20}l} {[1 - (\frac{\sigma }{\tau })^{2} ],} \hfill & {\left| \sigma \right| \le \tau } \hfill \\ {\frac{\tau }{\left| \sigma \right|},} \hfill & {\left| \sigma \right| > \tau } \hfill \\ \end{array} } \right. $$
(5)

Where: parameter \( \tau \) represents the distance threshold. When the distance from the point to the circle is greater than the threshold, the weight function is equal to the reciprocal of the distance multiplied by the threshold. When the distance from the point to the circle is less than or equal to the threshold, the weight is the square of the difference between the square of the distance divided by the threshold and 1.

4.2 Shape Template Matching and Feature Point Extraction

To improve the speed and accuracy of the matching algorithm, support X/Y direction scaling and nonlinear illumination changes, subpixel-accurate-based template matching with scaling algorithm to detect position are applied in this paper. The algorithm is based on the direction vector of the edge point obtained by Sobel filtering, and defines the similarity measure. Furthermore, combined with the image pyramid hierarchical search strategy, the shape information is used for template matching.

The similarity measure of the shape-based template matching algorithm is the sum of the gradient vector of the point in the template and the gradient vector of the point in the image, the similarity measure \( s \):

$$ S = \frac{1}{n}\sum\limits_{i = 1}^{n} {d_{i}^{T} } e_{q + p} = \frac{1}{n}\sum\limits_{i = 1}^{n} {t_{i} } v_{{r + r_{i} ,c + c_{i} }} + u_{i} w_{{r + r_{i} ,c + c_{i} }} $$
(6)

Where: \( d \) is the gradient vector of the point in the template, and \( e \) is the gradient vector of the point in the image.

Normalized similarity measure \( s \):

$$ S = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{d_{i}^{T} e_{q + p} }}{{\left\| {d_{i}^{T} } \right\|\left\| {e_{q + p} } \right\|}}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{t_{i} v_{r + ri,c + ci} + u_{i} w_{r + ri,c + ci} }}{{\sqrt {t_{i}^{2} + u_{i}^{2} } \sqrt {v_{{_{r + ri,c + ci} }}^{2} + w_{{_{r + ri,c + ci} }}^{2} } }}} $$
(7)

Since the gradient vector is normalized, the similarity measure \( s \) will return a value less than or equal to 1. When \( s = 1 \), the template corresponds to the image one-to-one.

In the image matching process, each potential point of the search image is subjected to a normalization calculation of one traversal, the calculation amount is very large. Therefore, to improve the speed of the algorithm, it is necessary to try to reduce the number of poses examined and the points in the template, and the pyramid hierarchical search strategy can simultaneously reduce the two parts to improve the operation speed [7]. The image pyramid is shown in Fig. 5.

Fig. 5.
figure 5

Pyramid image

Performing the same edge detection and filtering on each layer of image when creating the template, and then searching from top to bottom layer by layer until the similarity measure is greater than the threshold, and finally we can get the row and column coordinates of the matching template. The results of workpiece recognition and center extraction are shown in Fig. 6a. The cross mark is the central feature point The fitted XLD edge is segmented by a basic geometric element such as a straight line and an arc by a Ramer [8] algorithm for a large number of subpixel edge coordinate data. The edge feature points obtained are shown in Fig. 6b.

Fig. 6.
figure 6

Feature point extraction

5 Target Three-Dimensional Position

5.1 Stereo Matching

Stereo matching is the most critical step in the binocular vision algorithm. Its main task is to find the corresponding relationship of the same point in space in different images under different observation angles [9]. To complete the stereo matching of feature points quickly, we use the polar line constraint and region matching algorithm based on normalized cross-correlation (NCC). Since the two cameras have different viewing angles, the illumination will also have a certain difference, and a method that does not change with the change of illumination is needed, that is, the normalized cross-correlation (NCC) [7]. The NCC is defined as:

$$ NCC(x,y,d) = \frac{1}{(2n + 1)(2m + 1)}\frac{{\sum\limits_{i = - n}^{n} {\sum\limits_{j = - m}^{m} {([I_{L} (x + i,y + j) - \overline{{I_{L} (x,y)}} ][I_{R} (x + i,y + j + d) - \overline{{I_{R} (x,y + d)}} ])} } }}{{\sqrt {\delta^{2} (I_{L} ) \times \delta^{2} (I_{R} )} }} $$
(8)

Here, \( \overline{I(x,y)} = \frac{{\sum\limits_{i = - n}^{n} {\sum\limits_{j = - m}^{m} {I(x + i,y + j)} } }}{(2n + 1)(2m + 1)} \) is the average gray value of all pixels in the neighborhood of the current location search point, \( \delta (I) = \sqrt {\frac{{\sum\limits_{i = - n}^{n} {\sum\limits_{j = - m}^{m} {I^{2} (x,y)} } }}{(2n + 1)(2m + 1)} - I(y,y)} \) represents the variance. The value of NCC is \( [ - 1,1] \). When the absolute value of NCC is larger, it indicates that the sub-window is more closely matched with the neighborhood of the search point. When \( NCC = 1 \) it indicates that the two polarities are the same; when \( NCC = - 1 \) the two polarities are opposite, this means that the results of the normalized correlation coefficients are not affected by linear illumination changes [10].

The stereo matching procedure can be summarized as the following steps: Centering the pixel point \( P_{L} \) to be matched in the image, and interceping a rectangular sub-window \( B_{L} \). In the image to be matched searching for window \( B_{R} \), which is most similar to the gray value of \( B_{L} \) according to the principle from left to right and from top to bottom. Calculating the center \( P_{R} \) of the window \( B_{R} \), and getting the matching pixel points \( P_{L} \) and \( P_{R} \) in the left and right image (see Fig. 7).

Fig. 7.
figure 7

Template matching schematic

According to the above steps, the point to be matched of the first feature point of the left image is searched; the point to be matched of the second feature point is searched… until the feature points of the left image are traversed, that is, the stereo matching task is completed. The stereo matching result is shown in Fig. 8.

Fig. 8.
figure 8

Stereo matching

5.2 Stereo Reconstruction and 3D Affine Transformation

After the system calibration, stereo correction and stereo matching are completed, our next task is to calculate the depth information of the target. The three-dimensional reconstruction of the binocular system is to acquire two images of the scene simultaneously by two cameras and find the matching point pairs of the same point in the two images in space. The three-dimensional coordinates of the point can be obtained by combining the principle of binocular vision imaging [11]. The three-dimensional point cloud map is shown in Fig. 9.

Fig. 9.
figure 9

The three-dimensional point cloud map

To improve the accuracy of the evaluation positioning system, we need to perform 3D affine transformation on the 3D point cloud coordinates of 3D reconstruction and convert it into the world coordinates under the left camera as the reference coordinate system. The principle expression is given by:

$$ \left( \begin{aligned} Q_{z} \hfill \\ Q_{y} \hfill \\ Q_{x} \hfill \\ 1 \hfill \\ \end{aligned} \right) = \left[ \begin{aligned} R \hfill \\ 0 \hfill \\ \end{aligned} \right.\left. \begin{aligned} T \hfill \\ 1 \hfill \\ \end{aligned} \right] \cdot \left( \begin{aligned} P_{z} \hfill \\ P_{y} \hfill \\ P_{x} \hfill \\ 1 \hfill \\ \end{aligned} \right) = (R \cdot \left( \begin{aligned} P_{z} \hfill \\ P_{y} \hfill \\ P_{x} \hfill \\ 1 \hfill \\ \end{aligned} \right) + T) $$
(9)

Where \( (P_{x} ,P_{y} ,P_{z} ) \) is the input point and returns the resulting point to \( (Q_{x} ,Q_{y} ,Q_{z} ) \).

6 Experimental Results and Analysis

To detect the size of the target object, each edge point and the corresponding center can be calculated by the space curve fitting formula \( l = \sqrt {(x_{1} - x_{2} )^{2} + (y_{1} - y_{2} )^{2} + (z_{1} - z_{2} )^{2} } \) to complete the detection of the workpiece radius. The experimental results are shown in Fig. 10.

Fig. 10.
figure 10

3D spatial information and detect results

To analyze the reconstruction accuracy of the algorithm, we compute the error between the four target detection data and the actual data. Table 3 shows comparison results. The measured actual radius error of the workpiece is less than 0.4 mm, and the error rate is less than 3%; the actual depth of the workpiece is less than 0.2 mm, and the error rate is less than 0.5%. It shows that the target positioning method can accurately locate the target object with high precision, which verifies the feasibility and accuracy of the method under certain conditions.

Table 3. Comparison of test results with actual results

7 Conclusion

In this paper, we study the whole process of target recognition and localization based on binocular vision. In the target recognition and feature point extraction stage, the canny sub-pixel edge extraction, edge fitting and sub-pixel shape template matching algorithm with scaling are used to target the target area, which has a great improvement in feature point extraction speed. In the target positioning stage, we use the polar line constraint and region matching algorithm based on NCC gray correlation to complete the stereo matching of the feature positioning points, which solves the problem that the target feature points in the left and right images do not match. Finally, the three-dimensional positioning of the target object is realized by the principle of three-dimensional reconstruction. The experimental results show that the whole process method improves the speed of target recognition and the accuracy of feature point extraction, reduces the possibility of matching errors or repeated matching, and effectively achieves accurate positioning of the target object. The accuracy can be better satisfied in the working space of the robot, which is more conducive to the three-dimensional reconstruction and positioning of the robot vision system.

According to the current experimental research situation, and for the complex and varied manufacturing environment of the workpiece, we need further research and experimentation to realize a more general and effective image recognition and localization algorithm. It is necessary to improve the existing experimental positioning results and obtain more accurate position information of the target object in three-dimensional space.