1 Introduction

Deep Convolutional Neural Network (DCNN) is a common approach in character recognition of handwritten characters and signs in scenes. Character recognition in scene images consists of two parts: character detection and recognition [4]. This technique can be applied to item images from e-commerce websites to collect the item’s information. DCNN requires a huge amount of training data in order to obtain high accuracy [3, 5]. Although public datasets are available, the range of fonts in these datasets is too small. In order to address this problem, a character synthesis method was proposed to reduce the image collection cost [1], which generates synthesis images by using font data and background images. However, it assumes English characters on a simple background. In this paper, we propose a method to generate synthesis characters and word images using Japanese font data and complex background images for item images from e-commerce websites. In addition, we introduce an approach to emphasize the outline of characters. Our method trained with synthesized images, which include both with and without outline emphasis, improves recognition accuracy.

2 Proposed Method

The proposed method consists of three steps: generation of character images, addition of margins and emphasis of outlines, and synthesis of the characters with a complex background. Figure 1 shows the flow of character image generation. First, character images are generated from character lists with font data; these images are then synthesized with a background image. We prepared 22 fonts commonly used in e-commerce websites such as MS Mincho and Yuri Gothic.

Fig. 1.
figure 1

The flow of character image generation

Character images on complex backgrounds of e-commerce images are sometimes decorated, for example with borders. In order to improve recognition accuracy of such characters, outline emphasis of characters is introduced during character generation. First, a margin is added to the generated image. Then, an outline is added to emphasize characters in the image. At that time, two types of images are generated with different outline thicknesses. Figure 2 shows the flow of the addition of margins to the character image, followed by outline emphasis.

Fig. 2.
figure 2

Addition of margins and emphasis of outline

Fig. 3.
figure 3

Synthesis of background image (Color figure online)

The background region is replaced with a complex background such as an item image. As shown in Fig. 3, the color of the background is green, and the color of the character and the outline are different. The background image is cropped randomly from a banner image of an e-commerce store.

The DCNN is trained by using the synthesis character images. To test the effectiveness of the method, we also applied the proposed method to word synthesis. The word images are generated based on a word list and synthesized with complex backgrounds.

Fig. 4.
figure 4

The structure of DCNN

2.1 Structure of DCNN

Figure 4 shows the DCNN network structure. The network consists of 4 layers: 3 convolution layers and 1 fully connected layer. The filter size of each layer is \(5 \times 5\). Max pooling is employed for the pooling layer. The fully connected layer has 4,096 units and it employs Dropout [7] during the learning phase. The activation function of each layer is ReLU [6]. The output units have 1,253 classes for character recognition and 241 classes for word recognition. The input image size is \(32 \times 32\) for character recognition and \(96 \times 96\) for word recognition, respectively. AdaGrad [8] is used for the optimization method. The mini batch size is 32 and the epoch number is 50.

Fig. 5.
figure 5

Example of synthesis images

Table 1. The comparison of character recognition on synthesis images [%]

3 Evaluation

First, we evaluate the effectiveness of outline emphasis and synthesis with background images using top 5 accuracy. We trained 1,253 characters using 145 images for each (Fig. 5). For evaluation, 227 images collected from e-commerce websites were used. Table 1 shows the result for synthesis images. The method with emphasis and synthesis achieves best performance on top 1 accuracy.

The evaluation results of character recognition and word recognition on real images of e-commerce are shown in Tables 2 and 3, respectively. From Table 2, the method with emphasis performs 12.4% better than baseline, which is without emphasis and synthesis, on top 1 accuracy. On the other hand, the method with synthesis also improves 13.8% than baseline on top 1 accuracy. The combination of emphasis and synthesis achieves best performance with an improvement of 14.0%. The results of word recognition on real images are shown in Table 3. The method with emphasis improves about 20.4% and 19.8% than baseline on top 1 and top 5, respectively. Synthesis is also effective for real images; it improves accuracy by 24.4% and 22.3% than baseline on top 1 and top 5, respectively. The combination of emphasis and synthesis achieves best performance. Figure 6 shows recognition results. It recognizes words correctly even when the number of characters is different. However, recognition fails when the characters are blurred or rotated.

Fig. 6.
figure 6

Examples of recognition results

Table 2. Comparison of character recognition on real images [%]
Table 3. Comparison of word recognition on real images [%]

4 Conclusion

In this paper, we proposed a method to generate outline emphasis of word images and synthesize them with complex background images. The DCNN trained with generated images obtained high accuracy on both synthesized images and real images from e-commerce websites.