Keywords

1 Introduction

With the development of modern traffic, license plate detection (LPD) and recognition technology have attracted more and more attention. It is commonly used in traffic monitoring, highway toll station, parking lot entrance, exit management and other actual monitoring systems. Although it has achieved great success in recent years, LPD is still a difficult task under the unconstrained scenarios, such as rotation, distortion, uneven illumination and vagueness. Most previous works [1, 14, 18, 26] usually achieve good performance on extremely limited datasets. In some extremely complex real scenes, the performance may not satisfying due to the manual collected LP data is insufficient in both quantity and diversity. One intuitive idea is to collect and annotate a massive LP data to obtain better detection results. However, this procedure is rather time and energy consuming. In addition, the human labelling may also introduce bias or errors.

Recently, some works [6, 7] have applied synthetic training dataset to the field of object detection. Inspired by the work of Peter Slosar et al. [24] which uses synthetic data for vehicle detection. In this paper, we propose a novel synthetic data generation approach for license plate detection. Specifically, we use Matlab (version R2016b) and BlenderFootnote 1 (version 2.79) with bundled Python (version 3.6) scripting for the rendering. We simulate various factors affecting license plate acquisition in natural scenes, and render different types of license plate images containing different regions. We extract the depth map of each rendered image for pixel-wise segmentation of the license plate in a given view, then compute an axis-aligned 2D bounding box, and realize automatic labelling of the license plate bounding box. The image of our synthesized dataset contains the LP images with various tilt angles, light intensities, and degrees of blur, which can cover a large diversity of the vehicles in the real scene.

The current state-of-the-art detection methods can be divided into two categories: the two-stage approach [5, 8, 9, 23] and the one-stage approach [17, 21]. However, these methods are not specifically designed for license plate detection, so the accuracy and efficiency of these methods for license plate detection maybe not optimal. In recent years, some works [6, 16, 25] have introduced dilated convolutional operation and attention mechanism into the field of object detection. Dilated convolution can help the network to expand the receptive field of convolutional kernels and obtain higher resolution features without increasing the parameter amount. Attention mechanisms can help the network better focus on the object area. Based on these observations, we propose to jointly integrate the dilated convolutional operations and attention model into a unified model, which can help the network to detect the small object license plate and improve the final detection performance. We first propose a dilated convolutional attention enhancement block for license plate detection. It introduces dilated convolutional operation into the Faster R-CNN [23] framework to increase the receptive field of convolution kernels and obtain higher resolution feature maps. Then the attention mechanism is introduced to weight the feature maps and help the neural network to achieve better classification performance.

In summary, this paper makes the following contributions:

  • We propose a method to synthesize the license plate images, which can not only generate license plates of different provinces, cities and different types, but also realize the accurate labelling of the bounding box of the license plate area.

  • We propose an novel license plate detection method based on Faster R-CNN. Specifically, we introduce the dilated convolutional operation and attention mechanism into the conventional convolutional network to generate more discriminative feature representations to achieve better performance of license plate detection.

  • Evaluations of the proposed LPD model on generated LP dataset demonstrate the validity of the synthetic license plate method and the proposed license plate detection model.

2 Related Work

2.1 Dateset of Synthetic LP

Most license plate datasets [3] tend to collect images from traffic monitoring systems, highway toll stations or parking lots. The collected license plate images usually have some shortcomings, such as small tilt angle, small number, uneven distribution of license plate types, manual annotation and so on. Therefore, these datasets can not evaluate LP detection algorithm very well. At present, the largest public license plate dataset is CCPD [27], but the CCPD dataset comes from the same city with limited types of license plate. On this basis, we propose to use the synthetic license plate dataset to simulate the real license plate data to train the detector. Our method can not only change the angle of license plate and various environmental factors, but also generate various license plate data from different provinces and cities with annotation.

2.2 LP Detection Algorithms

With the rapid development of region-based convolutional neural network [8], the currently popular object detection models have been widely applied in LP detection [10, 14, 15]. Faster R-CNN [23] utilizes a region proposal network which can generate high-quality region proposals for detection, so as to detect objects more accurately and quickly. YOLO [21] and YOLO9000 [22] frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. SSD [17] combines the regression idea in YOLO and Anchor mechanism in Faster R-CNN, completely eliminate proposal generation and subsequent pixel or feature resampling stage, and encapsulate all calculations in a network.

2.3 Attention Mechanism

Attention mechanism in deep learning is essentially similar to human selective visual attention mechanism, whose goal is to select more critical information from a large number of information for the current task. At present, attention model has been widely used in various types of deep learning tasks, such as image recognition [6], speech recognition [2] and sequence learning [19], location and understanding of images [4, 13]. Recently, the representative Squeeze-and-Excitation [11] reweights feature channels using signals aggregated from entire feature maps, while BAM [20] and CBAM [25] refine convolutional features independently in the channel and spatial dimensions.

figure a
Fig. 1.
figure 1

The procedure of our license plate rendering framework. First we use MATLAB to generate a license plate map (a) and load it into a defined set of vehicles (b). Then we create a synthetic image of the vehicle and license plate from the defined collection (c) and render a depth map of the license plate (d). The depth map is used to obtain the precise segmentation of pixels in the license plate bounding box (e).

3 Overview of the Synthetic Dataset

3.1 LP Rendering Methodology

The main idea of our license plate rendering system is illustrated in Fig. 1. First, we generate different types of license plate maps (Fig. 1(a)) of different provinces, and load these license plate maps into the defined vehicle object model (Fig. 1(b)). The object is instantiated with a given set of parameters of material properties (vehicle color, windows material properties, etc.). For this image, depth map of LP is also rendered (Fig. 1(d)) which is used to determine the bounding box of the license plate in the image. Algorithm 1 summarizes the license plate synthesizing algorithm.

3.2 Viewpoint and Sunlight of LP

In order to obtain realistic and useful dataset, it is very important to control the camera angle and the illumination. As shown in Algorithm 1, we simulate the camera angle, solar direction and illumination intensity in natural scenes by specifying function parameters. We use variables cmax and cmay to control the horizontal and vertical view of the camera, sunx and suny to control the azimuth and height of the sun, suni to control the intensity of the sun. At the same time, we use brightness and contrast variables to adjust the brightness and contrast of the image.

4 Detection Approach

4.1 Model

The network architecture is shown in Fig. 2. We employ the Faster R-CNN [23] as the base network.

We propose a dilated convolutional attention block based on RFNet [16] and the attention model proposed by Woo et al. [25] called DCA Block. The feature map FM of the base network is fed into the dilated convolution layer and the attention module respectively. Then the merged feature map is used as the input feature map of the RPN network and the detection network. Compared with the whole image, the license plate is small. Therefore features from the convolution layer of the original model can not accurately describe the license plate, so the dilated convolution layer is added to obtain higher resolution features. Additionally, we apply channel attention and spatial attention to the feature map to enhance the features of the object area. The whole process is summarized as follows:

Fig. 2.
figure 2

The overall structure of our network. In the Faster R-CNN base network output feature map, we introduce the dilated convolution layer to obtain higher resolution features while maintaining the parameters and the same receptive field. In addition, we use the feature map channel and spatial attention to enhance the object features. Finally, we fuse the two output results and feed them to RPN and subsequent detection network.

$$\begin{aligned} \begin{aligned}&FM^{'} = M_{dc}(FM)\\&FM^{''} = [M_{sp}(M_{ch}(FM) \otimes FM)] \otimes [M_{ch}(FM) \otimes FM]\\&FM^{'''} = FM^{'} + FM^{''} \end{aligned} \end{aligned}$$
(1)

where \(\otimes \) denotes element-wise multiplication. \(\mathbf{M_{dc}}\) is a dilated convolution attention map, \(\mathbf{M_{ch}}\) is a one-dimensional channel attention map and \(\mathbf{M_{sp}}\) is a two-dimensional spatial attention map. \(\mathbf{FM^{'''}}\) is the final refined output.

4.2 Dilated Convolution

As shown in Fig. 2, we utilize the combination of multi-branch convolution layer and dilated pool. Specifically, we first use the bottleneck structure in each branch, consisting of a 1 \(\times \) 1 convolution layer plus an n \(\times \) n convolution layer, followed by a pooling or convolution layer with a corresponding dilation. The detailed process is as follows:

$$\begin{aligned} \begin{aligned} M_{dc}(FM) =&\varphi [f^{1 \times 1}(concat(f^{3\times 3}_{r=1}(f^{1\times 1}(f^{1\times 1}(FM))),\\ {}&f^{3\times 3}_{r=3}(f^{3\times 3}(f^{1\times 1}(FM))),f^{3\times 3}_{r=5}(f^{5\times 5}(f^{1\times 1}(FM)))))]\\ =&\varphi [f^{1 \times 1}(concat(FM_{r=1},FM_{r=3},FM_{r=5}))] \end{aligned} \end{aligned}$$
(2)

where \(\varphi \) denotes the Relu activation function and \(f^{n \times n}\) represents a convolution operation with the filter size of n \(\times \) n, \(f^{3\times 3}_{r}\) denotes an dilated convolution operation with a convolution kernel of size 3 and an dilation rate of \(\varvec{r}\).

4.3 Channel Attention

As in Fig. 2, we utilize average pooling and maximum pooling operations to aggregate the spatial information of the feature map, and generate two different spatial average pool features and maximum pool features. Then both are fed to the shared network to generate our channel attention map. The shared network consists of a multilayer perceptron (MLP) and a hidden layer. After sharing the network layer, we use element summation to merge the output eigenvectors.

$$\begin{aligned} \begin{aligned} M_{ch}(FM)&= \sigma (MLP(AvgPool(FM))+MLP(MaxPool(FM))) \end{aligned} \end{aligned}$$
(3)

4.4 Spatial Attention

Following the idea of CBAM [25] module, the feature map produced by channel attention module is used as the input feature map of spatial attention module. First, we make a global max pooling and global average pooling on channel, and then concatenate the two results. After a convolution operation, the dimension is reduced to one channel, then a spatial attention feature can be acquired via sigmoid function. Finally, the feature is multiplied with the input feature of the module to obtain the final feature, as shown in Fig. 2. The specific calculation is as follows:

$$\begin{aligned} \begin{aligned} M_{sp}(FM)&=\sigma (f^{7 \times 7}([AvgPool(M_{ch}(FM)),MaxPool(M_{ch}(FM))])) \end{aligned} \end{aligned}$$
(4)

where \(\sigma \) denotes the sigmoid function, the MLP weight shared and \(f^{7\times 7}\) represents a convolution operation with the filter size of 7 \(\times \) 7.

Training. We choose Resnet-101 as backbone. Our model is pre-trained on ImageNet and then fine-tuned on synthetic dataset of license plate. The experiment is implemented in pytorch and trained end-to-end on a group with four Tesla P100 GPUs, with Stochastic Gradient Descent (SGD) and a weight decay of 0.0001 and momentum of 0.9. At the beginning of the training process, the learning rate is set to 0.001. After 20 epochs, the learning rate decreases by 0.1 times for every 5 epochs.

5 Experiments

In this section, we give a detailed description of our synthetic dataset, further more, we evaluate the the performance of different detectors on our synthetic dataset and CCPD dataset. We show that the detectors trained with the synthetic dataset are comparable with those trained with the real license plate dataset. Finally, we synthesize a dataset of 20,000 yellow, blue and new energy license plates, and compare the performance of the prevalent detection algorithm with our algorithm. We show that the proposed method improves the accuracy of license plate detection compared with the original method.

5.1 Data Preparation

As aforementioned in Sect. 3, we render and synthesize a large license plate dataset (SLPD100) containing only blue plates, which contains about 100 K images with resolution of 800 (Width) \(\times \) 1160 (Height) \(\times \) 3 (Channels). For each image, the bounding box label contains (x, y) coordinates of the top left and bottom right corner of the bounding box are used to locate the minimum bounding rectangle of LP. The CCPD dataset is the largest license plate dataset in public, which contains about 250 k images. We divide CCPD into two parts, the default training set containing about 100 k images, and the default evaluation set containing about 20,000 images. The training set and test set of our experiment are shown in Table 1.

Table 1. The division of training set and test set and the numbers in brackets represent the size of dataset.

In order to verify the validity of our proposed license plate detection method, we also synthesize another dataset of about 20 k (SLPD20), including yellow license plate, blue license plate and new energy license plate. Our test dataset for evaluating detector performance is about 3,000 license plate images (LPD3000) taken by surveillance cameras and hand-held cameras.

5.2 Experiment Analysis

Evaluation Criterion. We follow the standard protocol (Intersection-over-Union (IoU)) [12] of object detection. The bounding box is considered to be correct if and only if its IoU with the ground-truth bounding box is more than 70% (IoU > 0.7).

Experimental Results on Different Datasets. We synthesize a dataset similar to the CCPD dataset and conduct experiments by the current prevalent YOLO9000, SSD, Faster R-CNN detection algorithm. Table 2 shows the experimental results. In the experiment, we set the same parameters for the same dataset. As shown in Table 2, we use the synthetic license plate datasets (SLPD100) and CCPD datasets to train SSD, Faster R-CNN and YOLO9000 detectors and use CCPD (20k) as the test set. The test accuracy on SSD and Faster R-CNN are about 2% to 3% lower than the real dataset CCPD, and the performance on YOLO9000 is only 0.2% higher. The main reason may be that the data distribution of the synthetic datasets is not comprehensive enough. Because we do not consider these factors such as license plate occlusion, rain, snow and fog, which may be a gap between our synthetic datasets and the license plate image collected by the natural scene. In future work, we will consider further increasing the data diversity in the synthetic datasets. Generally speaking, the detector trained with the synthetic license plate dataset is comparable to the detector trained with the real license plate dataset, which shows the effectiveness the synthetic license plate datasets.

Table 2. Comparison of license plate testing accuracy between different datasets of the same detector.

Experimental Results on Different Detectors. We evaluate different detectors on the synthetic dataset. The experimental results are shown in Table 3. We evaluate the current prevalent detectors, SSD [17], YOLO9000 [22], R-FCN [5] and Faster R-CNN [23]. The results show that the performance of our detector based on Faster R-CNN improves about \(2.2\%\) compared with other detectors. In Sect. 5.3, we will analyse the effectiveness of our proposed method in detail.

Table 3. Test accuracy of different detectors.

5.3 Ablation Studies

For ablation study, we use the SLPD20 and LPD3000 as training and test dataset. In the experiment, we progressively introduce the channel attention on Faster R-CNN, the spatial attention, and then the dilation convolution module, and report the results on Table 3.

Fig. 3.
figure 3

Feature visualization. We compare the visualization results of our method (Faster R-CNN + DCA Block) with baseline (Faster R-CNN).

Dilated Convolution. As shown in Table 3, after the introducing the dilated convolution module, the detection accuracy on the test set increases by 0.9% (from 88.9% to 89.8%) compared to the baseline.

Channel Attention. We choose Faster R-CNN as our baseline, and we introduce the channel attention module between the base network Resnet-101 and the RPN network. As shown in Table 3, introducing channel attention, the mAP increases after by about 0.6% (from 88.9% to 89.5%), which demonstrates the effectiveness of the channel attention model.

Fig. 4.
figure 4

Comparison of results. The first line is the baseline method test result picture, in which the blue rectangle represents the undetected object, the second line is our method test result picture.

Dual Attention. As shown in the experiment in Table 3, the introduction of the channel and spatial dual attention model improves the license plate detection accuracy by 1.1% (from 88.9% to 90.0%), which demonstrates the effectiveness of the dual attention model.

DCA Block. Based on the above analysis, we consider combining the dilated convolution and the dual attention module. The feature maps generated by the dilated convolution module and the dual attention module are fused. The experimental results show that the accuracy of our method is about 2.2% (from 88.9% to 91.1%) higher. For qualitative analysis, we compare the visualization results of our method (Faster R-CNN + DCA Block) with baseline (Faster R-CNN) in Fig. 3. We can see that our method pays more attention to the object area than baseline. Meanwhile, as in Fig. 4, our method can detect almost all the objects in the image. The result shows the effectiveness of the method.

6 Conclusions

In this paper, we present a method to synthesize license plate datasets and a dilated convolutional attention augmentation module in conventional deep license plate detection. The proposed license plate synthesis method can not only simulate the real scene by controlling the illumination intensity and other environmental factors of the synthetic images, but also can automatically label the license plate area as ground truth. It is very useful to solve the problems of limited license plates in training dataset and high cost manual labeling under some specific conditions. The proposed dilated convolutional attention augmentation module uses the dilated convolutional operation with different dilation rates to increase the receptive field of convolution kernels and obtain the higher resolution feature maps. In addition, the attention mechanism is added to learn the weight map for better classification. Extensive evaluations on two benchmarks demonstrate that our method improves the performance of license plate detection over the baseline methods.