Keywords

1 Introduction

As an important part of intelligent transportation system, vehicle license plate recognition (VLPR) has attracted considerable research interests. It is a useful technology for government agencies to track or detect stolen vehicles or collect data for traffic management and improvement. Due to its close relationship to public security, VLPR requires generalization and high accuracy in real applications.

While a lot of works have been proposed on the topic of VLPR, the VLPR task is still challenging, not only because of the environmental factors such as lighting, shadows, and occlusions, but also because of the image acquisition factors such as motion and focus blurs. For Chinese license plate recognition, the situation is more complicated. They are composed of Chinese characters and numbers. Their colors and sizes may be different and their lengths are not necessarily fixed, even placed in two lines.

Traditional image processing method needs a series of processing steps, including localization, segmentation and recognition. Many of them depend on handcrafted features and could only work well under controlled conditions. These handcrafted features are usually sensitive to image noises, and may result in many false positives under complex backgrounds.

CNNs have achieved great success in various tasks including image classification, object detection and semantic segmentation [11]. CNNs containing deep neural layers can learn efficient representations from a large amount of training data. For the VLPR tasks, extended CNNs transform the one-to-many problem into a single-label classification problem by classifying one character at a time. This requires the task to firstly segment characters and then recognize them one by one. More recent work performs segmentation before classification and use CNN-BRNN [14] and CTC to achieve the state of the art results. The segmentation-free VLPR [8] focus on images from real-world traffic cameras and applies the ConvNet RNN to the VLPR tasks. Unfortunately, their method takes no consideration of Chinese license plate recognition and requires specific optimization for higher accuracy.

In this paper, we proposed an attention enhanced ConvNet-RNN for Chinese vehicle license plate recognition. It is one-pass, end-to-end neural network. The proposed AC-RNN has two improvements. The original ConvNet-RNN is actually not fit for the vehicle license plate recognition with weak semantic connections. Thus, a novel semantic enhance strategy is introduced which inserts some trivial null characters into labeled strings. Secondly, an attention mechanism is added to learn the weights map, helping the neural network to perform classification better. The two techniques are not individual with each other. They work together to get higher accuracy. Furthermore, to avoid overfitting caused by lack of data. We also proposed a data generation method and generated a dataset containing one million labeled images. In summary, this paper makes the follows contributions to the community:

  • A novel semantic enhance strategy for Chinese VLPR.

  • An attention enhanced ConvNet-RNN.

  • A data generation method and a new dataset of Chinese VLP.

This paper is organized as thus, Sect. 2 presents related works about this paper and Sect. 3 provides the proposed neural network, In Sect. 4, a series of experiments will be presented and the conclusion will be shown in Sect. 5.

2 Related Works

2.1 Vehicle License Plate Recognition

Approaches for VLPR problems contain two stages, localization and recognition. The main work of this paper focus on the latter stage–recognition.

Plate localization aims to detect plates from images. Lots of work has been done for detection problems, for example, Faster RCNN [28] is known for its high precision of detection. YOLO [27] is famous for its speed of detection. Like a combination of Faster RCNN and YOLO, SSD [20] performs very well on both accuracy and speed. In [30], CPTN is designed for texts detection in natural images. Plate localization aims to detect plates rather than texts, therefore, a SSD model is applied in our project to detect plates.

Previous works on plate recognition include traditional image processing methods [1, 2, 12, 13, 15, 25, 26, 31] and new deep learning approaches [3, 16, 18, 19, 21, 23, 33]. Most of them need to detect and segment the characters out from a license plate image before recognition.

The recognition of plates typically contains two-stages as well. Segmentation extracts characters from license plate image; and classification distinguishes the segmented characters one by one. For example, in [12], Gou et al. propose Extremal Regions (ER) to segment characters from coarsely detected license plates. Restricted Boltzmann machines were applied to recognize the characters. In [33], the license plate is segmented into seven blocks using a projection method, after which, two classifiers are designed to recognize Chinese characters, numbers, and alphabet letters. In [18, 23], two CNN models are used to recognize characters from plate image. Firstly, a binary deep network classifier is trained to confirm if a character exists. Another deep CNN is adopted for the task of character recognition. In [21], Liu et al. implement a CNN model which has shared hidden layers and two distinct softmax layers for the Chinese and the alphanumeric characters respectively.

In fact, character segmentation by itself is a challenging task since it is prone to be influenced by uneven lighting, shadows and noises in the images [18]. The plate cannot be recognized correctly if the segmentation is improper, even if we have a strong classifier for characters. Therefore, in this paper, VLPR is regarded as a sequence labeling problem, and our proposed method aims to recognize plates without segmentation.

2.2 ConvNet-RNN

One of the most popular methods for sequence-to-sequence problems is ConvNet-RNN (CRNN) [14]. ConvNet-RNN is proposed for image-based recognition in [29], Shi et al. integrate feature extraction, sequence modeling and transformation into a unified framework. CRNN is end-to-end trainable and can deal with sequence in various lengths, involving no character segmentation or horizontal scale normalization. CRNN has been popular since it is proposed, for example, in [10], CRNN is adopted for offline handwriting recognition, in [7, 32], CRNN provides a useful method for Optical Music Recognition (OMR), and in [6], CRNN is used for script identification in natural scene image and video frame.

CRNN is also applied to VLPR. In [8], CRNN is adopted for plate recognition with CTC. However, due to the weak correlation between characters in plate, classic CRNN does not work well on VLPR, therefore, this paper introduces the method of generating interval characters to strengthen the correlation between characters combined with a fixed length CTC.

2.3 Attention Model

Attention mechanism becomes more and more popular since being used in image classification by Mnih et al. [24]. In [4], Dzmitry et al. first introduce attention mechanism to neural machine translation (NMT). By learning different weights from the source parts to different target words, the trained Model can automatically search for parts of a source sentence that are relevant to a target word. In [22], Luong et al. show that how to extend RNN with attention mechanism. Global attention and local attention are introduced in natural language processing (NLP). After [4, 22], attention mechanism is widely used in NLP tasks, including not only sequence-to-sequence models, but also various classification tasks. In [5], attention is used to allow a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output label on the topic of Large Vocabulary Continuous Speech Recognition (LVCSR) system. In [9], Serdyuk adopted the attention mechanism for speech recognition. Attention is also used for OCR in [17].

3 The Proposed AC-RNN Framework

3.1 Network Architecture

The network architecture of AC-RNN is shown in Fig. 1. It contains three main parts: the ConvNet, the attention based RNN and the CTC. The AC-RNN takes plate images as input. Through deep convolutional neural layers, the AC-RNN learns a group of feature maps. It then departs the feature map into sequence feature blocks and sends them to Bi-LSTM RNN. After encoding and decoding training processes, pre-frame predictions are gotten. Then a length-fixed CTC is performed to classify the characters. The AC-RNN works consecutively in an end-to-end fashion which inputs a plate image and outputs the predicted labels.

Fig. 1.
figure 1

Attention enhanced ConvNet-RNN.

3.2 Interval for Semantic Enhancement

LSTM is well known for its ability to capture long-range dependencies. Therefore, LSTM is widely used in voice recognition, Optical Character Recognition (OCR), text categories and so on. As often observed, the characters of vehicle license plate have semantic connections in the context. Since the LSTM is directional, to make the best use of context information, the AC-RNN adopts a bidirectional LSTM. On the other hand, differed with language translation or OCR tasks, which the LSTM is intuitively fit for, the characters on a plate are weakly relevant. Following the rules, some characters are fixed according to the car properties while other characters may be generated randomly.

In order to strengthen the correlations between characters on a plate, characters named empty of sequence (EOS) are inserted at the intervals to the sequential labels when training the model.

Fig. 2.
figure 2

Sample of interval character in sequence.

As shown in Fig. 2, the EOS participates in the training of the attention-based recurrent neural network. In the case of Chinese vehicle license plate, the AC-RNN adopts the rule that inserting the EOS at each interval between neighbouring pairs of characters showed with solid EOS labeled rectangles. The interval characters could help distinguishing the gaps of characters in the plate. It also strengthens the correlation of sequence that the LSTM need. The EOS in the prediction sequence needs removing before output.

3.3 Attention Based RNN Decoder

Differed with classic CRNN, attention based CRNN has an additional attention model to learn the weights map, helping the neural network to perform classification better. As shown in Fig. 3, the attention model tries to learn a weights map \(a_{ij}\), which could tell how relevant is the part of the source hidden \(h_{j}\) to a target frame. Then the RNN input \(c_{i}\) could be calculated by the following equation.

$$\begin{aligned} c_{i}=\sum ^{T_{X}}_{j=1}a_{ij}h_{j} \end{aligned}$$
(1)
Fig. 3.
figure 3

Attention mechanism in our AC-RNN.

3.4 Length-Fixed CTC Decoder for Vehicle License Plate

The RNN will output a sequence that contains per-frame predictions. To get the final label sequence, two main steps are taken by the CTC. They are Merge Repeated and Remove Blank. As shown in the Fig. 4, to get the correct output ‘HELLO’, there must have a blank token between ‘LL’, and with this blank token, we obtain ‘HELLO’ rather than ‘HELO’.

Fig. 4.
figure 4

Illustration of CTC progress.

The VLPs usually have a certain length, for example, in China, the length of VLP is seven. According to the proposed method, the length of the final output result is checked in the CTC step. If the CTC generates a sequence of uncommon length, the AC-RNN will find the longest continuous sub strings without blank. This step is illustrated by label (2) in Fig. 4. In this situation, the AC-RNN will put a blank into the longest substring in force, which guarantees the right output. This step is illustrated by label (1) in Fig. 4.

4 Experiments

4.1 Experimental Environments

Our experiments are carried out on the 8-way GPU cluster, whose configuration is shown in Table 1. We design the experiments with caffe and try to compare general ConvNet-RNN and our AC-RNN with length-fixed CTC.

Table 1. Experimental environments.

4.2 Generated Datasets

On account of various reasons, it is always difficult to obtain VLP datasets, not to mention balanced datasets over the country. To avoid overfitting caused by lack of data and to enhance the robustness of our model, a data generation method is proposed. By this method, plenty of plate images are generated, which can be used both in VLP detection and VLP recognition.

Fig. 5.
figure 5

The method proposed to generate VLP data.

Fig. 6.
figure 6

Examples of images generated by our method.

As shown in Fig. 5, our method to generate data set is shown as following.

Step 1: A dataset from monitor cameras is necessary, with which some of lean VLPs will be detected by a detector. The remaining part without the plate will be used as background.

Step 2: Lots of VLPs with random character distribution will be plotted according to template from transportation department.

Step 3: The plates detected in step 1 will be replaced by plates from step 2, thus new images will be generated with nature backgrounds and manual plates.

Step 4: A series of transformations will be applied to these generated data, such as random scale, random blur, random rotate, random sharpen etc.

Step 5: After detected by a detector, our new VLP dataset will be ready.

By our method, datasets of VLP containing millions of images can be generated for detection and recognition. Some examples of our generated images are shown in Fig. 6.

4.3 Experiments and Results

In our experiments, dataset for training and valuation contains 400 thousand images collected from natural scene and 800 thousand images generated by our data generation method. A dataset from EasyPR containing 260 images is set for test, all these images are collected from natural scene. Some instances of the test dataset are shown in Fig. 7.

Fig. 7.
figure 7

Some instances in test dataset

The main work of this paper focus on the stage of recognition. As described in Sect. 3, contributions to network in this paper are semantic enhancement of plates, length-fixed CTC decoder and attention mechanism, therefore our contrast experiments are carried out on two different stages: The classic ConvNet-RNN with nothing enhanced and our AC-RNN. We test these models and summarize the results as Table 2. As shown in Table 2, compared with the classic ConvNet-RNN, our work in this paper makes an excellent improvement on accuracy.

Table 2. Evaluation results.

A classic method to evaluate the performance of sequence labeling is to measure the percentage of perfectly predicted images in the test dataset, as shown in Table 2. Considering vehicle license plates in China contain a Chinese character and the repeated characters can only appeared in alphabets and numbers, the accuracy of the last six bits is calculated in Table 3. The accuracy of the Chinese characters is also listed in results. Meanwhile, some instances are given in Fig. 8.

Table 3. Evaluation results on our VLP dataset.
Fig. 8.
figure 8

Some examples in experiments

As shown in Table 3 and Fig. 8, the classic ConvNet-RNN performs bad when text on a plate contains several same connected characters. As comparison, the AC-RNN proposed in this paper does not have this problem, and with attention mechanism, the AC-RNN performs better on recognition.

5 Conclusion

In this paper, an attention enhanced ConvNet-RNN for Chinese Vehicle License Recognition, AC-RNN, is proposed. Compared with classic ConvNet-RNN, there are two improvements in AC-RNN. Firstly, intervals for semantic enhancement and length-fixed CTC decoder are firstly introduced to VLPR problems, in which intervals can strengthen the correlations between characters on a plate and a length-fix CTC decoder can perform better when there are several connected same characters on a plate. Secondly, attention mechanism is applied to learn a weights map, helping the neural network to perform classification better. Besides, a new method to generate datasets of VLP is proposed, thus, the overfitting caused by lack of data can be avoided.