Keywords

1 Introduction

In the latter half of the 1990 s, with the evolution of a general-purpose computer, since it became possible to process a large amount of data at high speed, a feature amount vector called image local feature amount was extracted from the image and an image Methods to realize recognition have become mainstream. Machine learning requires a large amount of samples with class labels, but since researchers do not need to design several rules as in the rule-based method, image recognition with high versatility can be realized. In the 2000s, features such as Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG), which were designed based on the hand crafted, had studied as image local features. And in the 2010s, deep learning which automatically acquires feature extraction process by learning is attracted attention. Handcrafted feature is not necessarily optimal because it extracted and expressed features based on human knowledge. Deep learning is a new approach that can automatically extract effective feature for recognition. Image recognition by deep learning has overwhelming results in object recognition challenge and since then, deep learning approach is applied in various fields is progressing. In this paper, we explain how deep learning is applied in the computer vision and autonomous driving, how it is solved problems, and the latest trend of deep learning.

2 Problem Setting in Autonomous Driving

In conventional machine learning, it is difficult to solve general object recognition directly from an input image as shown in Fig. 1, so this problem has been solved by subdividing it into each task of image classification, object detection, scene understanding and specific object recognition. The following describes the definition of each task and the approach to each task.

Fig. 1.
figure 1

Subdividing tasks on autonomous driving.

2.1 Image Classification

Image classification is a task that recognized the belonging class of image. In conventional machine learning, an approach called Bag-of-Features (BoF) [4] has been used which vector-quantizes the image local feature amount and expresses the feature of the entire image as a histogram. Thereafter, the feature includes the Fisher vector which expresses richer information and the Vector of Locally Aggregated Descriptors (VLAD) which reduces the amount of memory have been proposed [5, 6]. On the other hand, deep learning has achieved accuracy exceeding human recognition performance in 1,000 class image classification task.

2.2 Object Detection

Object detection is a task of finding where objects of a certain class are in the image. Face detection and pedestrian detection are included in this task. Combinations of Haar-like feature and AdaBoost for face detection [2], and combination of HOG feature and Support Vector Machine (SVM) for pedestrian detection [3] are widely used. In conventional machine learning, object detection is realized by learning a two-class classifier corresponding to a certain category and performing raster scan within the image. Object detection by deep learning can realize multi-class object detection targeting a plurality of categories with a single network.

2.3 Semantic Segmentation

Scene understanding is a task of understanding the scene structure in the image. In particular, semantic segmentation for recognizing object class for each pixel has been considered difficult task to solve by conventional machine learning. Therefore, although it was considered as one of the final problems of computer vision, it is beginning to show that it is a task that can be solved by application of deep learning.

2.4 Specific Object Recognition

Specific object recognition is a task of recognizing a specific object class by assigning attributes to objects having proper nouns and is defined as a subtask of general object recognition problem. Specific object recognition is realized by detecting feature points in a image such as SIFT [7] and calculating distance and voting process with feature points of the registration pattern. The Learned Invariant Feature Transform (LIFT) [8] replaces the process of SIFT with deep learning based approach, and realize performance improvement.

In the following, we focus on the tasks of image classification, object detection, scene understanding (semantic segmentation), and describe the application of deep learning and its trends.

3 Object Classification

On ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which performs 1,000 class category classification, deep learning based approaches are winning after 2012. Convolutional Neural Network (CNN) [9] is a basis method with overwhelming results in such large-scale image recognition. AlexNet [10] is constructed 5 convolution layer, 3 fully connection layers and a output layer which outputs probabilities of 1,000 categories, as shown in the Fig. 2. Figure 3 is visualizing convolution filters of 1st layer of AlexNet which obtains from training a large amount of object images of 1,000 classes. These filters are consist of texture, color information and edge having direction components are automatically acquired by learning.

Simonyan proposed more deeper network, called VGGNet, and improved classification performance [11]. GoogLeNet [12] has 22 layers with Inception modules. When the layer becomes deeper, the gradient becomes 0 in the middle and a gradient elimination problem occurs where reverse propagation cannot be performed. ResNet [13] proposed a method which has a route to backpropagate through the bypass, realizing 152 layers super deep structure. As a result, the error rate of ResNet improved to 3.56%. When the same task was performed by humans, the error rate was 5.1%, and the approach to deep learning gained recognition performance equal to or better than human ability. This approach is also apply to traffic sign recognition task and is comparable performance to human. Figure 4 shows the result of traffic sign recognition under a clutter background, dark image.

Fig. 2.
figure 2

Structure of AlexNet

Fig. 3.
figure 3

Convolution filters of 1st layer of AlexNet [10].

Fig. 4.
figure 4

Result of traffic sign recognition.

4 Object Detection

Conventional object detection is an approach to raster scan a classifier. In deep learning, R-CNN extracts object candidate regions by Selective Search [15] and extract efficient feature using AlexNet to perform multi-class classification [14]. Selective search roughly segments an object candidate by repeatedly grouping regions with similar color and texture with various threshold values and detects object candidate regions. The point that object detection using CNN is realized and it is possible to detect multiclass is epochal, but since Selective Search integrates repeated regions when obtaining object candidate region, it is time consuming.

Fig. 5.
figure 5

Structure of Faster R-CNN.

Faster R-CNN [16] introduces the Region Proposal Network (RPN) as shown in Fig. 5, and detects the object candidate region and recognizes the object class simultaneously. First, convolution processing is performed on the entire image to obtain a feature map. In RPN, an object is detected by sliding the detection window against the obtained feature map. RPN introduces detection method called anchor. The anchor is k number of detection windows that have different sizes and aspect ratios as shown in the Fig. 6. The output layer outputs the object score and the coordinates for the region specified by the anchor. In addition, the region also input to another fully connection layer, to recognize object class. By using these Region Proposal methods, it is possible to detect objects of multiple classes with different aspect ratios. Single Shot based method is attracting attention as a novel multi-class object detection approach. This is a method of detecting multiple objects by only giving the entire image to CNN without sliding window. YOLO (You Only Look Once) [17] is a representative method and detects multiple objects on each local region that is divided by a grid of \( 7 \times 7 \) as shown in Fig. 7. First, a feature map is generated through convolution and pooling of the input image. The position (i, j) of each channel of the obtained feature map (\( 7 \times 7 \times 1024 \)) is a region feature corresponding to the grid (i, j) of the input image, Enter the map in all tie layers. The output layer have units corresponding to object score, coordinates, object size and score of each category for each grid position.

Fig. 6.
figure 6

Scanning using anchors [16].

Fig. 7.
figure 7

Structure of YOLO

Since YOLO detects objects along roughly defined grids, it is not robust to changes in scale, especially small object. The Single Shot Multi-Box Detector (SSD) [18] outputs scores of object coordinates and categories from several convolution layers as shown in Fig. 8. In SSD, small objects are detected in layers closer to the input layer, and large objects are detected in layers closer to the output layer. The feature map closer to the input layer is less affected by the reduction of the feature map due to pooling. The SSD outputs object coordinates and object categories for each position of the feature map. Therefore, it is unnecessary to estimate the object category via another network, so it is possible to perform fast object detection. Figure 9 shows the result of pedestrian detection by SSD.

Fig. 8.
figure 8

Structure of SSD [18].

5 Semantic Segmentation

In the computer vision field, semantic segmentation is a task with high difficulty and has been studied for many years. And it was thought that it would take time to realize highly accurate semantic segmentation. However, similar to other tasks, a method based on deep learning has been proposed and achieves performance that exceeds the conventional method.

Fig. 9.
figure 9

Result of pedestrian detection.

Fig. 10.
figure 10

Structure of Fully Convolutional Network

Fully Convolutional Network (FCN) [20] is a method capable of learning and labeling end-to-end only using CNN. The structure of FCN is shown in Fig. 10. The FCN is a network structure that does not have fully connection layer. By repeating the convolution layer and the pooling layer, the size of the feature map becomes smaller. In order to make it the same size as the original image, the feature map is upsampled 32 times in the final layer and convolution processing is performed. This is called deconvolution. The final layer outputs a probability map of each class to be labeled. When upsampling the feature map in this manner, coarse segmentation results are obtained. To obtain fine segmentation result, it integrates the intermediate feature maps. Generally, the feature maps that are closer to the input layer extract detailed information. These detail information are missing by pooling processing. In object recognition, these detailed information are unnecessary, but are important information in semantic segmentation tasks. Therefore, the FCN performs processing of integrating feature maps in the intermediate layer before the output layer. FCN have several types, FCN-32s, FCN-16s and FCN-8s, depending on the size of the feature map used for this integration. In FCN-8s, it integrates the feature maps after the third and fourth pooling process to the input of the final layer. At this time, in order to obtain same size feature maps, the feature maps of fourth pooling is upsampled by a factor of 2, and the feature map before the last layer is upsampled by a factor of 4. The segmentation result is obtained from these feature maps.

The FCN needs to store the feature map of the intermediate layer, and the memory usage is large. SegNet [21, 22] employ an encoder - decoder architecture which does not need to store the feature map of the intermediate layer. The encoder of SegNet repeatedly performs convolution processing and pooling processing as shown in Fig. 11(a). On the decoder, the feature maps are upsampled by deconvolution processing and the segmentation result of the original image size is outputted. In these processes, as shown in Fig. 11(b), the selected position by pooling is stored, and it refers when upsampling the feature map on the decoder. As a result, it is possible to restore detailed information without using the feature map of the intermediate layer.

Fig. 11.
figure 11

Structure of SegNet

PSPNet [23] obtains rich information of different scale by the Pyramid Pooling Module which extracts efficient feature at multiple scales as shown in Fig. 12. The Pyramid Pooling Module downsamples the feature map to \(1\,\times \,1\), \(2\,\times \,2\), \(3\,\times \,3\), \(6\,\times \,6\) and performs convolution process. After that, the feature map is upsampled to the original size and concatenate them. The convolution processing is performed them and a probability map of each class is output.

Fig. 12.
figure 12

Structure of PSPNet [23]

In addition, Cityscapes Dataset [24] photographed with an in-vehicle camera achieves high accuracy. We also propose scale aware semantic segmentation method especially small objects as shown in Fig. 13. The contributions of the method are (1) to feed the features of small region by multiple skip connections, (2) to extract context from multiple receptive field by multiple dilated convolution blocks. The proposed method has achieved high accuracy in the Cityscapes dataset as shown in Fig. 14. The comparison with state-of-the-art methods, it has achieved the comparative performance at category IoU and iIoU metrics.

Fig. 13.
figure 13

Structure of our network

Fig. 14.
figure 14

Result of semantic segmentation

6 Conclusion

In this paper, I explained how deep layer learning is applied in image recognition task. Although the network model and application method are different depending on the task, the problem solvable by deep learning is to find a mapping function from a large amount of data and an accurate teacher label. In the future, learning from a small amount of data, realization of semi-teaching learning with a small amount of teacher-labeled data and a large amount of unlabeled data is a problem in deep learning. Furthermore, we hope to achieve end-to-end learning including reinforcement learning so that deep learning simultaneously acquires the recognition process required to generate better motion and motion of robot.