# Blind Spot Obstacle Detection from Monocular Camera Images with Depth Cues Extracted by CNN

- 626 Downloads

## Abstract

The images from a monocular camera can be processed to detect depth information regarding obstacles in the blind spot area captured by the side-view camera of a vehicle. The depth information is given as a classification result “near” or “far” when two blocks in the image are compared with respect to their distances and the depth information can be used for the purpose of blind spot area detection. In this paper, the proposed depth information is inferred from a combination of blur cues and texture cues. The depth information is estimated by comparing the features of two image blocks selected within a single image. A preliminary experiment demonstrates that a convolutional neural network (CNN) model trained by deep learning with a set of relatively ideal images achieves good accuracy. The same CNN model is applied to distinguish near and far obstacles according to a specified threshold in the vehicle blind spot area, and the promising results are obtained. The proposed method uses a standard blind spot camera and can improve safety without other additional sensing devices. Thus, the proposed approach has the potential to be applied in vehicular applications for the detection of objects in the driver’s blind spot.

## Keywords

Coarse-to-fine analysis Convolutional neural network Blind spot detection Principal component analysis Discrete cosine transformation## Abbreviations

- CNN
Convolutional neural network

- BSD
Blind spot detection

- ADAS
Advanced driver assistance systems

- PCA
Principal component analysis

- DCT
Discrete cosine transformation

## 1 Introduction

In the real world, advanced driver assistance systems (ADAS) increasingly act and interact with complex environments to support driving tasks. For this purpose, more and more complex environmental detection techniques are being developed with multiple sensors. These sensors can include sonar, radar, LiDAR, and cameras. To implement proper ADAS function behavior, the functional system is often based on a model that is combined with the sensor signal inputs.

Sensor technology provides the basic external information for the functional module system. In a vehicle, imaging data and ranging data (i.e., distance and speed) are collected by sensors. The reactions to objects are appropriately calculated using the model algorithm. Because of the complexity of vehicle dynamics and the ego vehicle environment, the determined detection algorithm is very often adaptively calibrated. The object detection algorithm is often insufficient for the overwhelming complexity of the real world. These functional insufficiencies can lead to unintended system behavior. To make the detection algorithm more robust, artificial intelligence approaches [1] and models such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and deep belief networks (DBNs) are increasingly used for deep learning and optimization.

In the area of ADAS, low-cost monocular cameras are widely used to provide driving assistance functions. Such camera sensors can accomplish many tasks, including the detection of lane marks, vehicles, and pedestrians, as well as determining the distance from the ego vehicle to an obstacle. The image data returned by a monocular camera sensor can be used to issue lane departure warnings, blind spot obstacle warnings, collision warnings, or to design a lane keep assist control system. In conjunction with other sensors, it can also be used to implement an emergency braking system and other safety–critical features.

In the real world, many fatal accidents are caused by blind spot ignorance, especially in the case of long trucks or buses negotiating corners. In Europe and Japan, the car rearview and side-view mirrors can be replaced by camera monitoring systems. Therefore, commercial mirror-less cars can be expected soon, and the applied ADAS functions based on camera image processing in such monitoring systems are of vital importance.

Normally, whether a car is close to the blind spot zone is judged using 2D image information. For example, a classifier has been used to determine whether an image region is a vehicle or non-vehicle based on a feature vector [2], and vehicle shadow/light location information in the image can be used to perform blind spot detection [3].

In this paper, we utilize the inferred depth information from the 2D image and relative position to achieve the robust BSD performance augmentation. The depth information is the key to recognizing the near and far areas [4, 5]. In the field of computer vision, constructing a depth mapping from a single 2D image is a challenging task [6]. It is difficult to obtain precise depth information if there is no other reference parameter in a single 2D image. The mentioned depth information in the 2D image is a relative scale other than absolute value. The methods of depth information estimation from images rely on the structure from motion, binocular and multi-view stereo. These observations come from the multiple view of the scene under different lighting conditions. To overcome this limited conditions, the monocular depth estimation as a supervised learning task attempts to directly predict the depth of each pixel in an image by the off-line trained model. The learning-based methods have proved effective for the depth estimation in single images. CNN training objective is proposed to learn to perform single image depth estimation with an image reconstruction loss by the generated disparity images [7]. However, this approach uses binocular stereo footage to enforce consistency between the disparities produced relative to both the left and right images.

In this paper, the monocular depth cue plays the key rule. Supposed that the monocular camera focus is set at an infinite point, the depth information is inferred from the combination of the blur cues [8, 9] and texture cues [10, 11, 12]. The blue cues and the texture cues are extracted from the image. The texture cues mean the density of the edge inside the block, while the blur cues mean the degree of blur and the sharpness of the edge. We proposed to conduct the local cues extraction from the monocular image. The depth estimation is derived from the feature comparison between two image blocks selected within a single image. In our proposed algorithm, the following methods are used.

*Coarse-to-fine analysis* is used to increase the probability of feature extraction from the limited image block area, because the information of interest generated from large-scale features may be lost in high-resolution image blocks, but will be included in lower-resolution image blocks. The low-resolution level is ideal for an overview of the image scene, whereas more detail can be found at higher- or finer-resolution levels [13].

*Principal component analysis (PCA)* is applied to extract the most important information from the image data [14]. PCA is a statistical technique used for image data compression and data structure analysis. It is used to extract the edge lines as texture cues from a single image. To classify the edge line orientations for higher spatial frequency density, we configure the PCA results into four categories. This allows us to obtain clearer texture cues with higher spatial frequency density along the edge line orientation.

*Discrete cosine transform (DCT)* is applied to extract the features of texture cues and blur cues from the image. DCT detects the edge line density allocated in the spatial frequency domain with regard to the texture cues. For the blur cues, DCT detects the sharpness of the edges.

*Convolutional neural network (CNN)* is applied to classify the depth cues from the DCT analysis results. Through deep learning, a better trained neural network model with acceptable accuracy is developed.

The effectiveness of the proposed approach is evaluated through a series of case studies [15]. In this paper, the outline of the proposed approach is described, and one of the test results is shown. The application for BSD purposes is evaluated using real road traffic image data.

The proposed method is found to be able to detect depth cue information for the purpose of BSD. The adaptation to more complex environments and improved algorithm performance will be considered in future work.

## 2 Algorithm

The basic theory and methods discussed in this paper are now briefly introduced. The proposed method uses both texture cues and blur cues to obtain depth information, whereas existing methods only utilize one of these cues [12]. Depth perception allows us to perceive the world around us in three dimensions and to estimate the distance between ourselves and other objects. One of the techniques for depth perception involves the monocular cues. When perceiving the world around us, many of these monocular cues work together to contribute to our experience of depth estimation. In computer vision, the image taken by a monocular camera is what the computer sees. When the monocular camera makes the focus point at an object that located in the far distance, the image contents nearby to the camera position become blur, while the farther area has the obvious texture cues. This correlation between blur cues and texture cues can be interpreted as depth cues. The monocular depth cues are derived from an image that is characterized by texture cues and blur cues. The texture cues refer to the density of the edges inside the block, whereas the blur cues are the degree of blur and the sharpness of the edges. In this study, we mainly focus on the local cues to simplify the computations.

Starting from this point, we attempt to enhance the effectiveness of the extraction of texture and blur cues. Several relevant concepts are now introduced.

The multi-resolution image representation of coarse-to-fine analysis is discussed first. PCA is not described in detail, as it is a popular method of extracting features from image data. PCA is used for image data preparation in the proposed method, with the edge lines extracted through eigenvalue and vector analysis of the image covariance matrix. The DCT is applied to identify the feature distribution density of the edge lines for the object image. The proposed approach uses the combined application of PCA and DCT.

### 2.1 Coarse-to-Fine Analysis

### 2.2 PCA for Image Processing

#### 2.2.1 Principal Component Analysis

*x*–

*y*plane. From the point data, we can calculate the variance \( c\left( {x,\;x} \right) \) in the

*x*-direction and the variance \( c\left( {y,\;y} \right) \) in the

*y*-direction. However, the horizontal and vertical spread of the data does not explain the clear diagonal correlation. If the

*x*-value of a data point increases, the

*y*-value also increases, resulting in a positive correlation. This correlation can be captured by extending the notion of variance to what is called the “covariance” of the data: \( c\left( {x,x} \right) \), \( c\left( {y,\;y} \right) \), \( c\left( {x,\;y} \right) \) and \( c\left( {y,x} \right) \). These four values can be summarized in a matrix, which is called as the covariance matrix:

*λ*

_{1}and

*λ*

_{2}and two eigenvectors

*v*

_{1}and

*v*

_{2}can be obtained. The line characteristics can then be derived from the eigenvectors and eigenvalues. The PCA process is illustrated in Fig. 3, where the eigenvalues and vectors are also listed.

The largest eigenvector of the covariance matrix always lies along the direction of the largest variance of the data, and the magnitude of this vector is equal to the corresponding eigenvalue. The second-largest eigenvector is always orthogonal to the largest eigenvector and lies along the direction of the second-largest spread of the data.

#### 2.2.2 Four Proposed Categories

*e*

_{1}and eigenvalue

*λ*

_{1}:

Accordingly, the edge lines can be classified into the following four categories (see Fig. 5).

### 2.3 Edge Line Density Extraction by Discrete Cosine Transform

#### 2.3.1 Discrete cosine transform

To determine the edge line density and sharpness in the frequency domain, a transformation is required to deal with the image in the spatial domain. The DCT helps separate the image into parts of differing importance with respect to visual quality. The DCT is similar to the discrete Fourier transform: it transforms an image from the spatial domain to the frequency domain, but it can approximate lines well with fewer coefficients using only real numbers. DCTs operate on real data with even symmetry.

*N*data items) can be written as:

*N*×

*M*image) is:

#### 2.3.2 Line Density Expressed by DCT Spectrum

As the binarized image texture resulted from PCA process is calculated by DCT, the edge line density in the image can be expressed by the spectrum density of spatial frequency. In other words, the DCT transforms the edge distribution of the image into the frequency distribution.

The DCT is used to calculate the spatial frequency of the selected image block. According to the properties of the DCT, low frequencies are concentrated in the top left of the spectrum, and high frequencies are concentrated in the bottom right.

In Fig. 6, * k* denotes the wave number period in length units, and

*v*and

*u*denote the horizontal and vertical frequencies of 2D waves, respectively.

DCT produces more obvious features. In the DCT results, the low-frequency density part is concentrated in the top left of the DCT map, and the high-frequency density part is concentrated in the bottom right [16]. This frequency distribution feature is fed to CNN for deep learning.

The PCA image data processing algorithm and DCT spatial frequency analysis method have been described in this section. CNN is described in the next section for the estimation of the depth cues.

## 3 Depth Cues Derived by CNN

- (1)
It reduces the number of weights in the calculation process;

- (2)
It is robust in terms of object recognition under slight distortions, deformations, and other interference;

- (3)
It has the characteristics of automatic learning and feature induction;

- (4)
It is not sensitive to changes in object position in the image.

### 3.1 Convolutional Neural Network

- (1)
Convolutional layer;

- (2)
Pooling layer;

- (3)
Fully connected layer.

The outline of a CNN structure is shown in Fig. 9.

### 3.2 Proposed Structure Framework

CNN is trained to learn the far and near information, which is the basis to classify the depth information of “near” or “far.” The depth estimation is then obtained through the feature comparison of two different blocks on the image.

The flow of the proposed structure framework is as follows:

- ①
Apply the coarse-to-fine analysis to the blocks selected from the image to obtain different resolution pictures. In this study, four resolution levels are used for the coarse-to-fine process. Thus, there are four pictures of different resolutions for each block.

- ②
In the texture cue path, apply PCA with the four-category process to Block A and Block B. The PCA results are then applied to the DCT (③-1) to get the feature map.

- ③
In the blur cue path, PCA is not applied in order to retain the blur features of Block A and Block B. The DCT (③-2) calculation is applied directly to obtain the feature map.

- ④
The DCT feature maps are concatenated and aggregated.

- ⑤
Through this concatenation, the output is obtained by the CNN deep learning process.

The details of this workflow are described in the following sections.

#### 3.2.1 Coarse-to-Fine Analysis

The image is preprocessed by the coarse-to-fine analysis with a multi-resolution image representation, where four resolution levels are used in this study. The coarse-to-fine representation can cover more information for image processing.

#### 3.2.2 Texture Cue Path

After the PCA preprocessing, DCT is used to calculate the spatial frequency of the selected block. Low frequencies are used to evaluate the degree of blur cues, i.e., the degree of blur and the sharpness of the edge. Both low and high frequencies are used to evaluate the density of the edges inside the block, which is the texture cue.

#### 3.2.3 Blur Cue Path

#### 3.2.4 Data Concatenation

The DCT results must be concatenated for input to the CNN. The data concatenation results in an array of size *C *× *H*×*W *=2 × 64 × 64.

*k*≦ P − 8 + 1, 1 ≦

*l*≦ P − 8 + 1; \( m_{ij}^{(n)} \) is the element of the filter mask at coordinates

*i*and

*j*in the block, where

*n*is the order number of the matrix; \( C_{kl} \) is the convolution result at position

*k, l*; and \( x_{ij}^{(n)} \) is the element of the filter at coordinates

*k*−

*i*and

*l*−

*j*. A visual representation of the convolution is shown in Fig. 18.

### 3.3 CNN Diagram and Parameters

The target is to evaluate the proposed approach of using texture and blur as cues for estimating the depth information. Thus, the training image data must contain blur and texture.

- (1)
Image data of size 4 × 64 × 64 are acquired from the concatenation result.

- (2)
Convolution layer: the input image is convoluted using a set of filters, and each filter produces one feature map in the output image. This CNN structure uses three layers.

- (3)
Rectified linear units (ReLU) are applied to get a better stochastic gradient descent (SGD) convergent speed and to prevent from going to dead zone without convergence.

- (4)
Pooling layer: reduces the dimension of the feature vector output by the convolution layer output by subsampling.

- (5)
Inner product: the fully connected layer.

In the next section, the CNN model and the experimental results are evaluated to verify the parameter settings.

## 4 Evaluation

### 4.1 Experiments for the Proposed Method

Test results

Images for the test | (a) | (b) | (c) | (d) | (e) | (f) | Average |
---|---|---|---|---|---|---|---|

Accuracy of the test | 93.13% | 95.25% | 93.63% | 92.38% | 97.75% | 97.63% | 94.96% |

### 4.2 Experiment for Vehicle Blind Spot Detection

As the focus point of the monocular camera is assumed to be located far beyond the vehicle side mirror view, the images taken by the rear side-view camera have the following features: The near area is the blur zone and the area far from the blind spot is the clear zone.

BSD results

Images for the test | (1) | (2) | (3) | (4) | Average |
---|---|---|---|---|---|

Accuracy of the test | 74.13% | 78.13% | 80.12% | 81.38% | 78.44% |

In summary, under relatively ideal test conditions, the CNN deep learning model can achieve high accuracy. In terms of vehicle blind area detection, the proposed method is also effective, with an accuracy rate of 78.44%, though the performance must be improved to handle more complex vehicle environments.

## 5 Conclusions and Future Work

### 5.1 Conclusions

- (1)
The proposed approach combining coarse-to-fine analysis with the four-category PCA process configuration enhances the effectiveness of depth cue extraction.

- (2)
The proposed method is robust, as the detection is not limited to vehicular shape profiles. Any obstacles approaching the ego host vehicle can be detected, because the method uses local cues to derive the relative depth information.

- (3)
The proposed method is not related to the image color. However, the detection performance is not good enough for images taken at night.

- (4)
Because of the shifting window block in a single image calculation for local cues, the total calculation cost is still the concerning due to the current microprocessor capacity. This may be overcome by using 7-nm chips and specific artificial intelligence chips.

### 5.2 Future Work

More parameter sets and training datasets need to be applied to the proposed method under vehicle driving conditions related to blind spot area detection. Using huge real driving scene image datasets, a robust CNN model can be developed through the effects of deep learning. Better results with higher accuracy could then be achieved. The output from our method can be used as a warning signal to drivers when changing lanes, alerting them to the presence of a vehicle in their blind spot. In real driving situations, online, real-time blind spot detection is required, and so, improvements in the calculation capacity and efficiency of our algorithm are required. Future research efforts will focus on this considerable challenge.

## References

- 1.Li, J., Cheng, H., Guo, H., et al.: Survey on artificial intelligence for vehicles. Automot. Innov.
**1**(1), 2–14 (2018)CrossRefGoogle Scholar - 2.Li, S.: A new vehicle detection method for blind spot detection system based on DSP. Int. J. Res. Eng. Sci
**4**(5), 27–29 (2016)Google Scholar - 3.Wu, B.F., Kao, C.C., Li, Y.F., et al.: A real-time embedded blind spot safety assistance system. Int. J. Veh. Technol. (2012). https://doi.org/10.1155/2012/506235 CrossRefGoogle Scholar
- 4.Pinard, C., Chevalley, L., Manzanera, A., et al.: Learning structure-from-motion from motion. arXiv:1809.04471v1 [cs.CV]. Accessed 12 Sep 2018
- 5.Wu, T.Y., Liu, Y.: Position estimation of camera based on unsupervised learning. arXiv:1805.02020 [cs.CV]. Accessed 5 May 2018
- 6.Vijayanarasimhan, S., Ricco, S., Schmid, C., et al.: Learning of structure and motion from video. arXiv:1704.07804v1 [cs.CV]. Accessed 25 Apr 2017
- 7.Godard, C., Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. arXiv:1609.03677v3 [cs.CV]. Accessed 12 Apr 2017
- 8.Hazirbas, C., Leal-Taixé, L., Cremers, D., et al.: Deep depth from focus. Comput. Vis. Pattern Recognition. arXiv:1704.01085 (2017)
- 9.Carvalho, M., Saux, B.L., Trouvé-Peloux, P., et al.: Deep depth from defocus: how can defocus blur improve 3D estimation using dense neural networks? arXiv:1809.01567v2 [cs.CV]. Accessed 6 Sep 2018
- 10.Huang, X.J, Wang, L.H., Huang, J.J., et al.: A depth extraction method based on motion and geometry for 2D to 3D conversion. In: Third International Symposium on Intelligent Information Technology Application, pp. 294–298 (2009)Google Scholar
- 11.Tsai, T.H., Fan, C.S.: Monocular vision-based depth map extraction method for 2D to 3D video conversion. EURASIP J. Image Video Process.
**2016**, 21 (2016)CrossRefGoogle Scholar - 12.Han, K., Hong, K.: Geometric and texture cue based depth-map estimation for 2D-3D image conversion. In: IEEE International Conference on Consumer Electronics (ICCE) (2011)Google Scholar
- 13.Moukari, M., Picard, S., Simon, L., et al.: Deep multi-scale architectures for monocular depth estimation. arXiv:1806.03051v1 [cs.CV]. Accessed 8 Jun 2018
- 14.Hua, J.Z., Wang, J.G., Peng, H.Q., et al.: A novel edge detection method based on PCA. Int. J. Adv. Comput. Technol.
**3**(3), 228–238 (2011)Google Scholar - 15.Guo, Y.X.: Investigation of monocular depth cues obtained through coarse-to-fine image analysis. Thesis of Master Degree, Department of Information Processing, Tokyo Institute of Technology (2017)Google Scholar
- 16.Oliveira, R.S., Cintra, R.J., Bayer, F.M., et al.: Low-complexity 8-point DCT approximation based on angle similarity for image and video coding. arXiv:1808.02950v1 [eess.IV]. Accessed 8 Aug 2018

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.