Keywords

1 Introduction

Corner detection is an important problem in many image processing applications including edge detection, object detection and pattern recognition [1]. It is a fundamental step in image processing. Recent years, with the development of embedded devices or high-performance computing, the real-time computing plays a crucial role in many applications, such as video game, communication app and media player. Especially in the area of computer vision, applications always require that the system can be request clients in a few seconds. As an indispensable corner detection algorithm, Harris corner detector has been successfully used in the image processing [25], such as feature selection or edge detection. It is also accelerated based on different strategy or various compute devices. However, much of them ignore the limitations of computing resources like embedded device, and they do not fully take advantage of the heterogeneous architecture.

Over past decades, the performance of computing device has achieved a significant development. Many large-scale computing tasks are benefited from modern processors like GPU, CPU or FPGA. Especially growing in many-core processors, massive algorithms have been parallelled and implemented on the many-core processor which could improve the efficiency of computing [11]. The general purpose computing on GPU pushed the revolution of many applications like machine learning, and more and more algorithms are transplanted to the many-core compute platforms. GPU also push the improvement of machine learning research. Many methods would be benefited from the high-performance of GPU [7, 10, 15, 19,20,21,22, 24].

However, large-scale computing task is suitable for the host or server devices. For the embedded devices, the limited computing resources cannot satisfy the complexity of massive data processing or the real-time reaction. For example, some image applications on the Android or IOS which should be reacted in a few seconds. Thus, how to fully utilize the limited computation resource is a key problem which is needed to solve urgently. Two types of strategy are used to speed up. One is reducing the complexity of an algorithm, and the other is optimizing based on the architecture of computing device. In real applications, the implementation is always combined this two idea to optimize the software.

In this paper, we parallel the Harris corner detection algorithm and implement it in an environment of heterogeneous architecture which is composed of many-core and multi-core processors. We also adopt some optimization for methods basing on this unique design. We implement the algorithm by OpenCL, which is an open source parallel library working for heterogeneous architecture and it is commonly used in cross computation platforms. Experimental results prove that our implementation is accuracy and efficiency. The rest paper is organized as follow: Sect. 2 introduces the background and Harris corner detection, Sect. 3 makes an instruction of heterogeneous architecture under the cross-platform software library OpenCL and the related work of parallel Harris corner algorithm implementation. Section 4 introduces details of our implementation and optimization. Section 5 lists the accuracy of detection and computing efficiency. At last, we give the conclusion and explanation.

2 Background of Harris Corner Detection

Harris corner detector is developed basing on Moravec corner detection to mark the location of corner points precisely [5]. It is a corner detection operator which is widely used in computer vision algorithms to extract corners and infer features of an image [23]. It also contributes to the area of computer vision [8]. At the rest of this section, we give an overview of the formulation of the Harris corner detection and its algorithm.

A corner is defined as the intersection of two edges. The main idea of Harris algorithm is that the corner would emerge when the value of an ROI (region of interest) variant dynamically with the shift to nearby regions [2]. The algorithm set a window scan the ROI in all directions; if it has a high gradient, we can infer that there may be corners in this region. We define \(I\left( x,y\right) \) as a pixel in the input image, \(\left( u,v\right) \) is the offset of shifted region from the ROI. \(w\left( x,y\right) \) is represented a convolution function which is Gaussian filter here. The function of the variable is defined as follow:

$$\begin{aligned} E\left( u,v \right) = \sum _{x,y}w\otimes \left( x,y\right) \left[ I\left( x+u,y+v \right) -I\left( x,y \right) \right] \end{aligned}$$
(1)

where \(\otimes \) is represented as a convolution operator. And then we make an approximation with shifted ROI value based on Taylor series expansion equation.

$$\begin{aligned} I\left( x+u,y+v \right) \approx I\left( x,y \right) +I_x\left( x,y \right) u+I_y\left( x,y \right) v \end{aligned}$$
(2)

By substituting (2) into (1) and approximate the result can be converted to matrix form:

$$\begin{aligned} E\left( u,v \right)&\approx \begin{bmatrix} u&v \end{bmatrix} \sum _{x,y}w\left( x,y \right) \otimes \begin{bmatrix} I_x^2\left( x,y \right)&I_x\left( x,y \right) I_y\left( x,y \right) \\ I_x\left( x,y \right) I_y\left( x,y \right)&I_y^2\left( x,y \right) \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix} \end{aligned}$$
(3)
$$\begin{aligned}&= \begin{pmatrix} u&v \end{pmatrix} w\left( x,y \right) \otimes \varvec{M} \begin{pmatrix} u \\ v \end{pmatrix} \end{aligned}$$
(4)

The matrix \(H\) which named Harris matrix is defined as:

$$\begin{aligned} H=\sum \limits _{x,y}w\left( x,y\right) \otimes \begin{bmatrix} I_x^2\left( x,y \right)&I_x\left( x,y \right) I_y\left( x,y \right) \\ I_x\left( x,y \right) I_y\left( x,y \right)&I_y^2\left( x,y \right) \end{bmatrix} \end{aligned}$$
(5)

To determine whether the pixel is a corner point or not, we need to compute pixel criterion score \(c\left( x,y\right) \) for each pixel. The function is given by

$$\begin{aligned} c\left( x,y\right)&=det\left( H\right) -k\left( trace\left( H\right) \right) ^2 \end{aligned}$$
(6)
$$\begin{aligned}&=\lambda _{1}\lambda _{2}-k\left( \lambda _{1}+\lambda _{2}\right) ^2 \end{aligned}$$
(7)

where \(\lambda _{1}, \lambda _{2}\) are the eigenvalues of the Harris matrix H. At the last step, we calculate the criterion score \(c\left( x,y\right) \) for each pixel, if the score higher than the threshold and it is the maximum value in the scan area, we mark this pixel as a corner point. The description of Harris corner detection algorithm is list in Algorithm 1.

figure a

3 Heterogeneous Architecture and Related Work

3.1 Heterogeneous Architecture

Since the improving requirement of complexity for large-scale computing, the performance of processors become more efficiently. Many-core and multi-core processors make a significant contribution to many fields [9]. CPU specialize in logic operation, and contrast, GPU does well in float or integer computing. These two kinds processors cooperate each other to enhance the computing speed. This structure of CPU-GPU is a typical kind of heterogeneous architecture. Figure 1 shows an example of heterogeneous architecture.

Fig. 1.
figure 1

Multi-core and many-core heterogeneous architecture. There are several compute units in the GPU and each of them contains SIMD (single instruction multi data) unit, register stack and local data store. Most square of CPU is used to be memory, like cache and register.

However, some factors limit the development of processors, including memory access and power wall, particularly the finite square of the chip for the requirement of embedded devices. With the popularity of embedded devices, the square wall of a chip is a limitation. Thus, how to fully utilize resource on-chip, like register, local memory and compute units, is a critical problem in future.

In this paper, we consider the heterogeneous architecture, which is composed of a GPU and a CPU. For implementation, we adopt a parallel open source library named OpenCL that can be performed on various devices. It is a popular framework for programming in the heterogeneous environment. It abstracts compute devices into the same structure and constructs a communication function among compute units or devices. The most advantage of OpenCL is cross-platform. Figure 2 shows the abstract structure in OpenCL.

Fig. 2.
figure 2

OpenCL open source library abstracts computing devices in a unified framework [18]. The compute units are organized in clusters, compute devices are highest level contain several compute units which are composed by dozens of process element. Memory resources are organized in a multi-level style. The nearest from process elements are register, then in the order of local memory, global memory and host memory.

3.2 Related Works

Corner detection techniques are being widely used in many computer vision applications for example in object recognition and motion detection to find suitable candidate points for feature registration and matching. High-speed feature detection is a requirement for many real-time multimedia and computer vision applications. Harris corner detector (HCD) as one of many corner detection algorithm has become a viable solution for meeting real-time requirements of the applications. There are many works to improve the efficiency of the algorithm, and some parallel implementations has been developed on different platforms. In previous work, several implementation have been proposed which target a specific device or some particular aspects of the algorithm. Saidani et al. [16] used the Harris algorithm for the detection of interest points in an image as a benchmark to compare the performance of several parallel schemes on a Cell processor. To attain further speedup, Phull et al. [13] proposed the implementation of this low complexity corner detector algorithm on a parallel computing architecture, a GPU software library namely Compute Unified Device Architecture (CUDA). Paul and his co-author [12] present a new resource-aware Harris corner-detection algorithm for many-core processors. The novel algorithm can adapt itself to the dynamically varying load on a many-core processor to process the frame within a predefined time interval. The HDC algorithm was implemented as a hardware co-processor on the FPGA portion of the SoC, by Schulz et al. [17]. Haggui et al. [3] study a direct and explicit implementation of common and novel optimization strategies, and provide a NUMA-aware parallelization. Moreover, Jasani et al. [6] proposed a bit-width optimization strategy for designing hardware-efficient HCD that exploits the thresholding step in the algorithm. Han et al. [4] implement the HCD using OpenCL and perform it on the desktop level GPU and gain a 77 times speedup.

4 Harris Corner Detection OpenCL Implementation

In this section, we introduce our strategy of parallelization for Harris corner detection in OpenCL implementation. As shown in Sect. 2, there are many operators based on the pixel level. Thus, we design our parallel implementation in pixel grain size. We parallel the step of Gaussian blur convolution, Gradient X, Y computing and Harris matrix construction which are implemented on GPU. The step of eigenvalues computes and corner response are implemented on CPU.

We divide algorithm into two kernel function. One is the construction of Harris matrix, and another is pixel score. Compared with other implementation, we decrease the number of the kernels. We integrate the function into one kernel as far as possible for the reason that it can reduce the time of communication between host and kernel device, like host memory and graphics memory. It also increases the ratio of data reuse and speeds up the program. In our design, we assume that the computing resource is limited, such as register, shared memory or computing unit, and our primary target is speeding up our program in the limited resource.

4.1 Kernel of Convolution and Matrix Construction

The compute of Gaussian blur convolution, image gradient and Harris matrix are merged into one kernel. For this kernel, we construct a computing space which is the same dimension as an input image. Every thread deals a pixel task and output one Harris matrix. All outputs in threads compose a complete Harris matrix. For a thread In this kernel, we first compute the gradient X \(I_{x}\left( x,y \right) \)and gradient Y \(I_{y}\left( x,y \right) \) of this pixel and then compute its own \(I_{x}^2\left( x,y \right) \), \(I_{y}^2\left( x,y \right) \) and \(I_{x}\left( x,y \right) I_{y}\left( x,y \right) \). Finally, we use the operation of Gaussian blur convolution to filter the pixel with its neighbourhood. The procedure description of this kernel is shown in Fig. 3.

Fig. 3.
figure 3

The figure indicates the process for the algorithm of Harris corner detection.

Optimization strategy: The pixel level computing is beneficial for many-core architecture since its high parallelism and numerical value compute. We utilize this advantage of convolution that every thread compute a mask filter. However, in the process of convolution or gradient compute, it exists many memory access. It is low efficiency when read data from global memory to compute unit frequently. To solve this problem, we move pixels nearby target to shared memory on-chip first. This method could improve the local data repetition rate and make computing units access data which are stored in the consecutive address, namely combination access. In our implementation, we set the local pixel to the size of local computing space.

4.2 Kernel of Corner Response

After the first kernel computing, we get the corner score for every pixel in the ROI which we defined. These corner scores can report the probability of a corner point existing in the corresponding ROI. If a corner score is a negative value, it means there may be an edge in this region, and a small value indicates this area may be a flat region. Thus, we need to get the score values which are larger than the threshold, which indicate that there exists a corner in the ROI of this pixel. At last, we adopt the non-maximum suppression (NMS) stage which is aim to get the local maximum value. We set the pixel which have local maximum value as a corner point.

In our implementation, we fix a \(3*3\) window to search the neighborhood nearby the pixel. Every thread in the computing space is assigned a \(3*3\) region, and if the corner score is larger than the threshold and it is the maximum value of this region, we set this pixel as a corner point. Similar to kernel convolution, we store consecutive data together from global memory to the local data memory on-chip. For limited store resource like register, we prefer the search window as little as possible.

5 Experimental Results

In this section, we will introduce experimental results for our implementation regarding accuracy and effectiveness on our heterogeneous hardware architecture.

5.1 Detection Accuracy

We use the function HarrisCorner in OpenCV as our benchmark of serial implementation. OpenCV is an open source software library, and it is utilized in image processing and computer vision. Similar with OpenCL, it can take advantage of the cross-platform and hardware acceleration based on heterogeneous compute device [14]. Figure 4 shows the results of corner detection.

Fig. 4.
figure 4

The experimental results are shown in this figure. The corners detected by algorithms are in the red circles. The left image for each of sub-images is the detection result of baseline method, which is the function in OpenCV. The right image for each of sub-images is the results of our paralleled method. Contrast, our method is more stable and more precisely. (Color figure online)

5.2 Performance Results

To evaluate our implementation, we perform our experiments on MacOS with OpenCL 1.2. The hardware configure is a CPU of 2.6 GHz Intel Core i5 and a many-core processor namely Intel Iris. Iris is a lightweight GPU with limited compute units and memory, which provides 40 stream processors. It is a typically many-core processor with limited computing resource.

Comparing with OpenCV function HarrisCorner, our implementation (image size: \(640\times 480\)) on the CPU-GPU architecture could get speedup of 11.7. With the ROI increasing, the speedup is improved. It proves that our design is efficiency. The experimental results are lists in Table 1.

Table 1. We change the size of ROI to test the compute time. This table list the compute time on CPU and heterogeneous device. When the size of ROI augment, the speedup is increasing.

6 Conclusion

In this paper, we have paralleled the Harris corner detection algorithm and implemented it on the heterogeneous architecture using OpenCL. Our implementation has achieved an acceleration compared with open library function in OpenCV. Our design considers the utilization of memory resource. It increases memory reuse ratio as possible. We implement Harris corner detection on a limited resource device and gain a speedup.