Keywords

1 Introduction

Augmented reality (AR) is a technology that combines real world with artificial elements. These elements are usually visual, but the augmentation does not have to be limited to that. For example, it is easy to imagine audio augmentation (e.g. [1] and [2]). AR can be well combined with mobile phones. The fact that mobile phones are ubiquitous devices having a lot of computational power and many different sensors including camera can make a good combined use (e.g. [3] and [4]).

Typical AR system uses camera and associated display to capture information about the real world. This obtained information is then used to perform augmentation of the image. AR systems always need to have some information about the real world in order to provide its function and marker detection is one of the possible approaches (e.g. [5] and [6]). Marker detection has been a widely researched topic in the past as there are many possible uses for it. It also gained a wide spread attention with the advent of AR in the last years.

But there is also an effort to make marker-less tracking (e.g. [7] and [8]). Marker-less based methods extract information from the real scene only. This is usually more difficult and requires more computational time. Marker-based tracking registers artificial markers placed in the real scene. Hence the marker is known upfront and it is easily recognized.

Our goal is to create a marker-based detection algorithm that would be fast, precise and flexible. Fast algorithm leaves more time for the actual application that is using the result of the detection. Our solution should be able to detect change in the position of the camera and at the same time should be able to work with any possible shape of the marker, inducing shapes with holes. It also shouldn’t be limited to one color because different scenes might require different colors in order to make the marker less obtrusive.

2 Related Work

There are many papers focused on a marker detection in a raster image. Belghit et al. [5] developed a technique that used color marker detection in order to estimate the perspective transformation matrix for the current view of the camera. They used four markers of different colors. Their detection was achieved by repeating application of gaussian filter and morphological operators of closing and opening. Their solution has but certain limitations. Mainly, it is required that no other objects of given colors are present in the scene as it may influence the detection. As the authors used blue, red, green and yellow colors, the limitations proved to be quite significant.

Liu et al. [9] present a color marker-based method for tracking objects in real time. The method is based on contours patterns that are extracted by using adaptive threshold. Each marker in the image is recognized by the template pattern. They claim that their method can work in environment with various illumination conditions.

Saaidon et al. [10] developed a solution for optical tracking using color marker that is meant to be used for detection and tracking of a medical navigation system. The detection algorithm is designed using OpenCV library.

Liu et al. [11] proposed a color marker algorithm that is supposed to be robust to occlusion of the marker which sometimes happens in dynamic scenes. They used Hamming check codes to restore parts of the occluded squares in the marker.

System for marker detection called ARTag is designed in [6]. The solution is meant to encode information inside the marker and read that information with camera. The goal was to provide algorithm that would be precise, reliable and have low false positive rates.

ARToolKit [12] is a system similar to ARTag. The user has to setup the system by capturing all markers that are going to be used in the scene. ARToolKit is then able to recognize these markers. The processing time is increasing as the number of possible markers raises. This marker detection system is widely used for other AR applications, because it is easy to use and freely available. Authors of ARTag system consider as the main problem of this solution the fact that it can often falsely detect areas without any marker [6]. Updated versions of the system ARToolKit Plus was developed as a reaction and enabled to resolve some problems with the previous version of the system.

Paper [13] is devoted solely to comparing these three systems (ARToolKit, ARTag, ARToolKit Plus). Comparison based on immunity to occlusion and lighting variation, on inter-marker confusion rates and on false negative rates is performed.

Zhang et al. [14] did an evaluation of several marker systems – namely ARToolKit, IGD, HOM and SCR. They focused on usability, efficiency, accuracy and reliability. Presented tests included processing time, recognition of multiple markers and image position accuracy with respect to distance and angle of the camera.

None of the presented solution solve the marker detection problem from our point of view sufficiently. They either focus on different output like reading more complicated information from the marker or solve different problems of the detection itself. Their common limitation is the fact that they require the marker to be of a specific shape.

Our previous work [15] presented a marker detection algorithm and its GPU implementation. The algorithm was also looking for a marker in the input image. The solution provided good results, but we are aware that there was a lot of possibility to improve the actual detection to make it more precise. The paper also contained several prototype applications that demonstrated usage of the detected marker. These applications are also used to test whether the implementation of the proposed marker detection algorithm is precise and fast enough and they give a good preview of how these kinds of applications can work.

3 Marker Detection

As it was explained in the previous part of the contribution, our goal is to design and implement an algorithm for marker detection that would be fast and give precise results. The position of the marker is obtained from the image data and in our use cases it is usually a video feed. Therefore, it might be a good idea to develop a solution for a graphic card because they are significantly optimized for this kind of processing.

3.1 Algorithm Description

The algorithm that we propose consists of two steps (see Fig. 1). At the beginning there is an input image, which represents one frame of the video sequence. Number of selected pixels is calculated for each row and each column in the first step. The pixels of the image are selected by the similarity of their color to the given predefined color of the marker that we are looking for. The color of the pixel can be thresholded by intervals in three RGB channels or by Euclidean distance in RGB space. Other possibility is to transform RGB values to another color model, for example HSV or HLS, and compare these values. The number of these pixels in each row and column create the output of the first step.

Fig. 1.
figure 1

Illustration of the proposed two-step algorithm

The second step takes as its input two one-dimensional arrays of numbers of pixels selected in all rows and columns. All values of these arrays are taken, and weighted mean is calculated. The coordinate of given column or row is weighted by the number of selected pixels. Just two numbers are the output of this step, each holding one of the two coordinates of the marker. The principle is quite simple, but it is necessary to design robust and fast implementation to achieve real-time processing, which is presented in the next chapter.

3.2 Improved Versions

The proposed algorithm is robust and can be enhanced in many ways. Probably the most obvious way of obtaining better result is to look at the pixels in a slightly different way. Normally every pixel that meets color requirements is just counted in. Improved version would look at the pixel more closely and calculate how much does it differ from the exact color that we are looking for. This similarity could by defined by Euclidean distance in RGB space. If the pixel has a color that is almost the same as the color of the marker, then this pixel will gain more weight on the resulting coordinate. On the other hand, if the color of the pixel is not very close but it still seems to be the color of the marker, then this pixel will have less weight. False positive detection can be eliminated by using this approach.

Another apparent method of improvement can remove all insignificant rows or columns from the result. These can slightly disrupt the result under some circumstances. The change is done in the second step of the algorithm. When it is reading the number of pixels it would skip all rows and columns whose number is less than a certain threshold. In case the marker is big enough these small disruptions would not have made big impact. Only when scene might contain more pixels of the same color as the marker has then it would make sense to make such enhancement.

Additional improvement comes from the very core of this design. The observed scene can have more markers in it. For instance, if there are two markers in the image then our algorithm will find coordinates of the point that is exactly in between them. The marker can also contain holes in it, and it can even be a circle and our algorithm would still find the center of it as it is shown in the results chapter.

All these mentioned improvements do not have any influence on the rendering speed. Especially first two can have positive impact on the accuracy and stability of the result.

4 Implementation

Our target platform are web browsers. Just a decade ago it would be nearly impossible to implement this kind of algorithm in the environment of a web browser. But WebGL standard changed it all. It provides access to the GPU which is excellent in parallelization and image processing. By using web browsers, our implementation can be easily used not just on the desktop systems but also directly inside mobile phones without the need of installing any additional software. All support is provided by the web browser. Moreover, all mobile phones nowadays have a camera and it can be easily used as an input for applications.

The GPU implementation is designed as a two-pass algorithm. Each pass is processed by the classical visualization pipeline that is composed of vertex and fragment shader or by compute shader in case of newer WebGL version. To obtain appropriate result from each step which uses render to texture method, the size of rendering viewport must be defined.

4.1 First Step of the Algorithm

The first step calculates the number of pixels with the appropriate marker color. This color is defined in the fragment shader file. The rendered output image is a texture with two rows. The number of output columns is determined by a bigger value of the width or the height of the input image. That is to ensure that each row and column have a free cell to write its result sum into. Output size can be determined by

$$ 2\,*\,max(width,\, height) $$
(1)

Figure 2 illustrates this step with example values. The fragment shader for each pixel in the output texture sums pixels with the correct color for the appropriate column or row. This sum is then written as the output value in the texture. The input image is an ordinary uniform texture sampler. The red zeros mean that there are no available rows for this output pixels – the example image width is bigger than the height.

Fig. 2.
figure 2

A figure illustrating the first step of the algorithm, see explanation in the text above (Color figure online)

The first step has a time complexity of \( {\mathcal{O}}\left( {max\left( {width, height} \right)} \right) \) which means linear time complexity. The cost of this step – number of cores that the algorithm requires multiplied by the time of the algorithm [16] – is \( {\mathcal{O}}\left( {(width + height) \,*\, max\left( {width, height} \right)} \right) \). This means \( {\mathcal{O}}\left( {n^{2} } \right) \). It needs to be mentioned that sum of width and height is usually a number that is not higher than a couple of hundreds and modern graphic cards usually have hundreds of cores, which means that if there are enough cores or the image is small, then the cost can be linear – number of cores would be a constant.

4.2 Second Step of the Algorithm

The second step of the algorithm uses output of the first step, and it is illustrated by Fig. 3. Output of the second step is consisting of a texture of just 2 pixels, where each pixel contains one coordinate of the detected marker. This coordinate is calculated as a weighted mean where the coordinate of given column or row is weighted by the count of pixels found. The result of this step are coordinates where the center of the marker is most probably lying.

Fig. 3.
figure 3

A figure illustrating the second step of the algorithm, see explanation in the text above (Color figure online)

$$ \frac{{\mathop \sum \nolimits_{i = 0}^{w} \left( {value_{i} * i} \right)}}{{\mathop \sum \nolimits_{i = 0}^{w} value_{i} }} $$
(2)

Time complexity is the same as in the first step – \( {\mathcal{O}}\left( {max\left( {width, height} \right)} \right) \) which means linear time complexity. This step always requires just two cores and every GPU has always more than two cores. Therefore, the cost of the algorithm is \( {\mathcal{O}}\left( n \right) \).

This step can be directly followed by a rendering step that uses the result of the second step. By using this approach, the data do not have to leave the GPU memory which is necessary for fast rendering. The second step can be also be merged with the possible third step into one step and therefore reduce the overhead of another pipeline call.

4.3 Source Code

The source code of the implementation is publicly available in a GitHub repository: https://github.com/milankostak/Marker-detection/tree/v2.0.

5 Results

Implemented algorithm was tested on preset images as well as in the real world. Selected results of this testing are shown in the form of the following screenshots from a mobile phone (Figs. 4, 5, 6 and 7).

Fig. 4.
figure 4

Detected marker in a test environment, basic green shapes of square and hexagon are tested. Detection works well in these situations. (Color figure online)

Fig. 5.
figure 5

Detected marker in a test environment with more complicated shapes. On the left is a triangle shape with blue color demonstrating that detection can be set to any color. On the right is a green circle which shows that our solution can even work with shaped that has holes in it. (Color figure online)

Fig. 6.
figure 6

Detected marker in a test environment. On the left is an icon of a person and our algorithm finds the center of it. On the right are two markers printed on a paper, the algorithm finds the point in-between of them. (Color figure online)

Fig. 7.
figure 7

Presents marker detection in a real-world scene. Marker drawn on the piece of paper was written by hand with an ordinary pen. In addition to it, a half of the marker is covered by a shadow. The algorithm was able to detect clearly the correct marker position even in an image of this difficult scene. (Color figure online)

6 Testing

The proposed algorithm has been thoroughly tested. Values were compared with our previous approach to marker detection. Testing was done on a laptop with CPU Intel Core i5 8th generation, GPU Nvidia GeForce GTX 1060, integrated GPU Intel UHD 630 and 8 GB RAM. Laptop was running on the latest stable version of Windows 10 (version 1803). Testing was done in Firefox and Chrome browsers. Mobile phone used for testing was Huawei Honor 9 with chipset Kirin 960 and Android 8.0 as operating system. Testing was also done in Firefox and Chrome browsers in their respective mobile versions.

6.1 Time Measurement Methodology

Times were recorded with the help of WebGL extensions because default WebGL context does not support direct measurement of rendering time. WebGL 1 extension is called \( {\texttt{EXT\_disjoint\_timer\_query}} \) and for WebGL 2 it is \( {\texttt{EXT\_disjoint\_timer\_query\_webgl2}} \). They both work in the same way. Appropriate \( {\texttt{beginQuery}} \) function is called before the draw (or any other) operation and \( {\texttt{endQuery}} \) is called after that operation. Elapsed time in nanosecond is then obtained via \( {\texttt{getQueryParameter}} \) function call.

By using this approach, the call is put in the queue and the start time is saved right before starting the rendering operation and the end time is saved right after the rendering ends. Saving is completely in control of GPU and therefore it contains the exact times when GPU started and ended dealing with the rendering operation. It is then necessary to wait until the result of the query is available to the main JavaScript thread that is running on CPU. Once CPU calls draw or query functions, it immediately continues executing the program regardless of what GPU is doing or if it has finished. Therefore, the program must be stalled until GPU makes results of its operations available. For this purpose, \( {\texttt{QUERY\_RESULT\_AVAILABLE\_EXT}} \) parameter must be used to check if the result is already available or not.

Unfortunately, these extensions are not widely supported and especially on the mobile phones we were not able to measure it inside any tested browser. These extensions are only supported on the desktop in Chromium browser layout engine. Neither Firefox nor Edge support this extension on the testing device.

Every recorded measurement is an average of one thousand render loops with the first thousand being ignored and the second thousand being used in all experiments. All presented values are rounded to whole microseconds and shown in thousandths of milliseconds for better readability.

6.2 Algorithm Times

Measured times of the basic version of the described algorithm is presented in Table 1. As it was explained earlier, only Chromium layout engine supports the necessary WebGL extensions, so testing was done only in it. Comparison of running the same code with WebGL 1 and 2 contexts can be seen in the same table. First and second steps of the algorithm are measured. Testing was done on both available GPUs – integrated Intel UHD 630 and dedicated NVIDIA GTX 1060 (laptop version). This table shows results for the input image that has 1280 pixels in width and 720 pixels in height, which results in 921,600 pixels in total. Table 1 also shows that there is not any significant difference between using WebGL 1 or WebGL 2. The biggest measured difference is around 5%. Therefore, all other shown values are going to be only for WebGL 2.

Table 1. Times required to finish rendering for given steps in the algorithm, input image: 1280 × 720 pixels

The same measurement is shown in Table 2 for smaller input image. In this case it is 640 pixels in width and 360 pixels in height, which results in 230,400 pixels in total. That is four times less than in the previous case – each dimension size is two times smaller. This measurement was done in order to observe the effect of the input size on the rendering times. Table 3 presents how much faster the rendering is. It can be seen that for the less powerful GPU the times are stably two times smaller. For the more powerful graphic card the rendering for smaller input is much faster.

Table 2. Times required to finish rendering for given steps in the algorithm, input image: 640 × 360 pixels
Table 3. Comparison of times for different inputs sizes – width and height are half; the result image is one quarter in the total pixels count

6.3 Comparison with Previous Solution

Our previous approach to marker detection was implemented in a different way. The key idea was to divide the picture into small areas of 3 × 3 or 4 × 4 pixels and then counting pixels independently in these areas. These operations were performed multiple times until the reduced image size was small enough. Now we also tried to measure if with the new algorithm we achieved better time of the detection. But our previous approach was measured in a slightly different way and with other devices as it was described earlier and published in [15]. In order to make a good comparison of those approaches, previous solution was measured with the same methodology as the new one.

Comparable times with the new solution for both big and small input image are presented in Table 4 and the measured times are slightly shorter. The reason for this is in the design of the previous solution where every fragment shader is doing less operations and the number of operations for every shader is always constant. Although it requires more cores in order to function properly. But this situation is more an advantage than disadvantage because modern GPUs are counting on that fact. The previous solution but always requires reading data from the GPU because it does not output coordinates directly. They need to be searched in the output of the second step. This was done after reading data from GPU to CPU. And as it is explained later, reading operation is slow. This means that if values do not have to be read from GPU then our new solution would be faster. For more information about fetching information from GPU see the next Sect. 6.4.

Table 4. Measured times of our previous solution with the new methodology of measuring for different resolution of input image

This approach also does not provide solution under all circumstances, because it works by dividing the image into areas that are completely independent. If marker is divided between those areas, then found solution may be slightly moved or in some extreme cases not found at all.

6.4 CPU Times and Reading from GPU

Reading data from GPU memory is always a slow operation. The reason for this is the fact that it requires synchronization of operating memory and graphics memory. This operation stalls executing of the main program and waits until the data are available.

Reading from GPU does not always have to be needed. If the result is supposed to be worked with on the device that is also doing the marker detection that the information about the found marker does not have to leave the graphics memory and can be directly reused. But it is required to obtain the values and possibly send it to a completely another device in some use cases. That is the approach that we used in our old solution which implemented mobile phone as an interactive device in an augmented reality system with example applications. That is the main reason why the previous approach required the marker coordinates to be searched in the output of its second step. The new solution supports both options. Data can be both kept in the graphics memory or it can be read as a two pixels value from GPU to CPU.

Our old solution always returns number of pixels that is equal to the input number of pixels divided by 144 – each dimension is reduced by 4 and then by 3. That means that both dimensions are reduced by a factor of 12 and for two dimensions it is 144. After that, the obtained data needs to be iterated through to find the information about the marker. On the other hand, the new solution always returns just 2 pixels.

Times required for these operations are presented in the following tables (Tables 5 and 6). Apart from the device that was used for testing performance of the GPU, a mobile phone was also used to test the CPU performance. Dispatching of the algorithm itself is always faster in the new solution. Reading is a trickier operation. It seems that it depends on the used GPU and it looks like it might be hard to predict how the reading will behave on other devices. But it can still be seen that the differences are small, and the overall times are still suitable for smooth rendering in 60 frames per second.

Table 5. CPU times of the new solution for input image with width 1280 px and height 720 px
Table 6. CPU times of the previous solution for input image with width 1280 px and height 720 px

7 Conclusion and Future Work

Proposed algorithm provides a good solution for situations when simple color marker detection is required. The main advantage is in its parallelization. Modern GPUs usually have hundreds of cores that are heavily optimized for fast image and texture rendering. Choosing WebGL for its implementation ensures that the algorithm can be used in all modern browsers and on all devices including mobile phones. Implementation of the algorithm proved that the proposed solution is robust and fast enough for real-time video processing. The correct detected position of different shapes of the marker printed on a paper in a scene with real light conditions demonstrated that the detection is precise even under harder circumstances.

Thorough time measurement of the new algorithm was performed and then compared with our previous approach. Both rendering times and reading times were measured. Rendering times of the new algorithm are slightly longer than for the previous solution. The previous solution but always required reading of the rendered texture on the CPU and that proved to be more than a ten times slower operation. The new algorithm does not require it which makes it over all faster than the old one.

Both versions of WebGL are currently lacking support of other kinds of shaders. They support only vertex and fragment shaders. Just a couple of months back, in December 2018, support for WebGL2 compute shader was presented for the first time. It is still in its early phase of support and is only available in the so-called nightly builds of Chrome browser. As soon as this support stabilizes, our algorithm could be easily implemented with this technology, possibly gaining a better performance.