1 Introduction

In recent years, scene text detection and recognition have attracted growing research interests from the computer vision community, especially after the revival of neural networks and growth of image datasets. Scene text detection and recognition provide an automatic, rapid approach to access the textual information embodied in natural scenes, benefiting a variety of real-world applications, such as geo-location [58], instant translation, and assistance for the blind.

Scene text spotting, which aims at concurrently localizing and recognizing text from natural scenes, have been previously studied in numerous works [21, 49]. However, in most works, except [3, 27], text detection and subsequent recognition are handled separately. Text regions are first hunted from the original image by a trained detector and then fed into a recognition module. This procedure seems simple and natural, but might lead to sub-optimal performances for both detection and recognition, since these two tasks are highly correlated and complementary. On one hand, the quality of detections larges determines the accuracy of recognition; on the other hand, the results of recognition can provide feedback to help reject false positives in the phase of detection.

Recently, two methods [3, 27] that devise end-to-end trainable frameworks for scene text spotting have been proposed. Benefiting from the complementarity between detection and recognition, these unified models significantly outperform previous competitors. However, there are two major drawbacks in [3, 27]. First, both of them can not be completely trained in an end-to-end manner. [27] applied a curriculum learning paradigm [1] in the training period, where the sub-network for text recognition is locked at the early iterations and the training data for each period is carefully selected. Busta et al. [3] at first pre-train the networks for detection and recognition separately and then jointly train them until convergence. There are mainly two reasons that stop [3, 27] from training the models in a smooth, end-to-end fashion. One is that the text recognition part requires accurate locations for training while the locations in the early iterations are usually inaccurate. The other is that the adopted LSTM [17] or CTC loss [11] are difficult to optimize than general CNNs. The second limitation of [3, 27] lies in that these methods only focus on reading horizontal or oriented text. However, the shapes of text instances in real-world scenarios may vary significantly, from horizontal or oriented, to curved forms.

Fig. 1.
figure 1

Illustrations of different text spotting methods. The left presents horizontal text spotting methods [27, 30]; The middle indicates oriented text spotting methods [3]; The right is our proposed method. Green bounding box: detection result; Red text in green background: recognition result. (Color figure online)

In this paper, we propose a text spotter named as Mask TextSpotter, which can detect and recognize text instances of arbitrary shapes. Here, arbitrary shapes mean various forms text instances in real world. Inspired by Mask R-CNN [13], which can generate shape masks of objects, we detect text by segment the instance text regions. Thus our detector is able to detect text of arbitrary shapes. Besides, different from the previous sequence-based recognition methods [26, 44, 45] which are designed for 1-D sequence, we recognize text via semantic segmentation in 2-D space, to solve the issues in reading irregular text instances. Another advantage is that it does not require accurate locations for recognition. Therefore, the detection task and recognition task can be completely trained end-to-end, and benefited from feature sharing and joint optimization.

We validate the effectiveness of our model on the datasets that include horizontal, oriented and curved text. The results demonstrate the advantages of the proposed algorithm in both text detection and end-to-end text recognition tasks. Specially, on ICDAR2015, evaluated at a single scale, our method achieves an F-Measure of 0.86 on the detection task and outperforms the previous top performers by \(13.2\%\)\(25.3\%\) on the end-to-end recognition task.

The main contributions of this paper are four-fold. (1) We propose an end-to-end trainable model for text spotting, which enjoys a simple, smooth training scheme. (2) The proposed method can detect and recognize text of various shapes, including horizontal, oriented, and curved text. (3) In contrast to previous methods, precise text detection and recognition in our method are accomplished via semantic segmentation. (4) Our method achieves state-of-the-art performances in both text detection and text spotting on various benchmarks.

2 Related Work

2.1 Scene Text Detection

In scene text recognition systems, text detection plays an important role [59]. A large number of methods have been proposed to detect scene text [7, 15, 16, 19, 21, 23, 30, 31, 34,35,36,37, 43, 47, 48, 50, 52, 54, 54,55,56,57]. In [21], Jaderberg et al. use Edge Boxes [60] to generate proposals and refine candidate boxes by regression. Zhang et al. [54] detect scene text by exploiting the symmetry property of text. Adapted from Faster R-CNN [40] and SSD [33] with well-designed modifications, [30, 56] are proposed to detect horizontal words.

Multi-oriented scene text detection has become a hot topic recently. Yao et al. [52] and Zhang et al. [55] detect multi-oriented scene text by semantic segmentation. Tian et al. [48] and Shi et al. [43] propose methods which first detect text segments and then link them into text instances by spatial relationship or link predictions. Zhou et al. [57] and He et al. [16] regress text boxes directly from dense segmentation maps. Lyu et al. [35] propose to detect and group the corner points of the text to generate text boxes. Rotation-sensitive regression for oriented scene text detection is proposed by Liao et al. [31].

Compared to the popularity of horizontal or multi-oriented scene text detection, there are few works focusing on text instances of arbitrary shapes. Recently, detection of text with arbitrary shapes has gradually drawn the attention of researchers due to the application requirements in the real-life scenario. In [41], Risnumawan et al. propose a system for arbitrary text detection based on text symmetry properties. In [4], a dataset which focuses on curve orientation text detection is proposed. Different from most of the above-mentioned methods, we propose to detect scene text by instance segmentation which can detect text with arbitrary shapes.

2.2 Scene Text Recognition

Scene text recognition [46, 53] aims at decoding the detected or cropped image regions into character sequences. The previous scene text recognition approaches can be roughly split into three branches: character-based methods, word-based methods, and sequence-based methods. The character-based recognition methods [2, 22] mostly first localize individual characters and then recognize and group them into words. In [20], Jaderberg et al. propose a word-based method which treats text recognition as a common English words (90k) classification problem. Sequence-based methods solve text recognition as a sequence labeling problem. In [44], Shi et al. use CNN and RNN to model image features and output the recognized sequences with CTC [11]. In [26, 45], Lee et al. and Shi et al. recognize scene text via attention based sequence-to-sequence model.

The proposed text recognition component in our framework can be classified as a character-based method. However, in contrast to previous character-based approaches, we use an FCN [42] to localize and classify characters simultaneously. Besides, compared with sequence-based methods which are designed for a 1-D sequence, our method is more suitable to handle irregular text (multi-oriented text, curved text et al.).

2.3 Scene Text Spotting

Most of the previous text spotting methods [12, 21, 29, 30] split the spotting process into two stages. They first use a scene text detector [21, 29, 30] to localize text instances and then use a text recognizer [20, 44] to obtain the recognized text. In [3, 27], Li et al. and Busta et al. propose end-to-end methods to localize and recognize text in a unified network, but require relatively complex training procedures. Compared with these methods, our proposed text spotter can not only be trained end-to-end completely, but also has the ability to detect and recognize arbitrary-shape (horizontal, oriented, and curved) scene text.

2.4 General Object Detection and Semantic Segmentation

With the rise of deep learning, general object detection and semantic segmentation have achieved great development. A large number of object detection and segmentation methods [5, 6, 8, 9, 13, 28, 32, 33, 39, 40, 42] have been proposed. Benefited from those methods, scene text detection and recognition have achieved obvious progress in the past few years. Our method is also inspired by those methods. Specifically, our method is adapted from a general object instance segmentation model Mask R-CNN [13]. However, there are key differences between the mask branch of our method and that in Mask R-CNN. Our mask branch can not only segment text regions but also predict character probability maps, which means that our method can be used to recognize the instance sequence inside character maps rather than predicting an object mask only.

Fig. 2.
figure 2

Illustration of the architecture of the our method.

3 Methodology

The proposed method is an end-to-end trainable text spotter, which can handle various shapes of text. It consists of an instance-segmentation based text detector and a character-segmentation based text recognizer.

3.1 Framework

The overall architecture of our proposed method is presented in Fig. 2. Functionally, the framework consists of four components: a feature pyramid network (FPN) [32] as backbone, a region proposal network (RPN) [40] for generating text proposals, a Fast R-CNN [40] for bounding boxes regression, a mask branch for text instance segmentation and character segmentation. In the training phase, a lot of text proposals are first generated by RPN, and then the RoI features of the proposals are fed into the Fast R-CNN branch and the mask branch to generate the accurate text candidate boxes, the text instance segmentation maps, and the character segmentation maps.

Backbone. Text in nature images are various in sizes. In order to build high-level semantic feature maps at all scales, we apply a feature pyramid structure [32] backbone with ResNet [14] of depth 50. FPN uses a top-down architecture to fuse the feature of different resolutions from a single-scale input, which improves accuracy with marginal cost.

RPN. RPN is used to generate text proposals for the subsequent Fast R-CNN and mask branch. Following [32], we assign anchors on different stages depending on the anchor size. Specifically, the area of the anchors are set to {\(32^2, 64^2, 128^2, 256^2, 512^2\)} pixels on five stages {\(P_2, P_3, P_4, P_5, P_6\)} respectively. Different aspect ratios {0.5, 1, 2} are also adopted in each stages as in [40]. In this way, the RPN can handle text of various sizes and aspect ratios. RoI Align [13] is adapted to extract the region features of the proposals. Compared to RoI Pooling [8], RoI Align preserves more accurate location information, which is quite beneficial to the segmentation task in the mask branch. Note that no special design for text is adopted, such as the special aspect ratios or orientations of anchors for text, as in previous works [15, 30, 34].

Fast R-CNN. The Fast R-CNN branch includes a classification task and a regression task. The main function of this branch is to provide more accurate bounding boxes for detection. The inputs of Fast R-CNN are in \(7 \times 7\) resolution, which are generated by RoI Align from the proposals produced by RPN.

Mask Branch. There are two tasks in the mask branch, including a global text instance segmentation task and a character segmentation task. As shown in Fig. 3, giving an input RoI, whose size is fixed to \(16*64\), through four convolutional layers and a de-convolutional layer, the mask branch predicts 38 maps (with \(32*128\) size), including a global text instance map, 36 character maps, and a background map of characters. The global text instance map can give accurate localization of a text region, regardless of the shape of the text instance. The character maps are maps of 36 characters, including 26 letters and 10 Arabic numerals. The background map of characters, which excludes the character regions, is also needed for post-processing.

Fig. 3.
figure 3

Illustration of the mask branch. Subsequently, there are four convolutional layers, one de-convolutional layer, and a final convolutional layer which predicts maps of 38 channels (1 for global text instance map; 36 for character maps; 1 for background map of characters).

3.2 Label Generation

For a training sample with the input image I and the corresponding ground truth, we generate targets for RPN, Fast R-CNN and mask branch. Generally, the ground truth contains \(P=\{p_{1}, p_{2}...p_{m} \}\) and \(C=\{c_{1}=(cc_{1},cl_{1}),c_{2}=(cc_{2},cl_{2}), ... , c_{n}=(cc_{n},cl_{n})\}\), where \(p_{i}\) is a polygon which represents the localization of a text region, \(cc_{j}\) and \(cl_{j}\) are the category and location of a character respectively. Note that, in our method C is not necessary for all training samples.

Fig. 4.
figure 4

(a) Label generation of mask branch. Left: the blue box is a proposal yielded by RPN, the red polygon and yellow boxes are ground truth polygon and character boxes, the green box is the horizontal rectangle which covers the polygon with minimal area. Right: the global map (top) and the character map (bottom). (b) Overview of the pixel voting algorithm. Left: the predicted character maps; right: for each connected regions, we calculate the scores for each character by averaging the probability values in the corresponding region. (Color figure online)

We first transform the polygons into horizontal rectangles which cover the polygons with minimal areas. And then we generate targets for RPN and Fast R-CNN following [8, 32, 40]. There are two types of target maps to be generated for the mask branch with the ground truth P, C (may not exist) as well as the proposals yielded by RPN: a global map for text instance segmentation and a character map for character semantic segmentation. Given a positive proposal r, we first use the matching mechanism of [8, 32, 40] to obtain the best matched horizontal rectangle. The corresponding polygon as well as characters (if any) can be obtained further. Next, the matched polygon and character boxes are shifted and resized to align the proposal and the target map of \(H\times W\) as the following formulas:

$$\begin{aligned} B_{x}=(B_{x_0}-min(r_{x})) \times W / (max(r_{x})-min(r_{x})) \end{aligned}$$
(1)
$$\begin{aligned} B_{y}=(B_{y_0}-min(r_{y})) \times H / (max(r_{y})-min(r_{y})) \end{aligned}$$
(2)

where \((B_{x},B_{y})\) and \((B_{x_0},B_{y_0})\) are the updated and original vertexes of the polygon and all character boxes; \((r_{x}, r_{y})\) are the vertexes of the proposal r.

After that, the target global map can be generated by just drawing the normalized polygon on a zero-initialized mask and filling the polygon region with the value 1. The character map generation is visualized in Fig. 4a. We first shrink all character bounding boxes by fixing their center point and shortening the sides to the fourth of the original sides. Then, the values of the pixels in the shrunk character bounding boxes are set to their corresponding category indices and those outside the shrunk character bounding boxes are set to 0. If there are no character bounding boxes annotations, all values are set to \(-1\).

3.3 Optimization

As discussed in Sect. 3.1, our model includes multiple tasks. We naturally define a multi-task loss function:

$$\begin{aligned} L = L_{rpn} + \alpha _1 L_{rcnn} + \alpha _2 L_{mask}, \end{aligned}$$
(3)

where \(L_{rpn}\) and \(L_{rcnn}\) are the loss functions of RPN and Fast R-CNN, which are identical as these in [8, 40]. The mask loss \(L_{mask}\) consists of a global text instance segmentation loss \(L_{global}\) and a character segmentation loss \(L_{char}\):

$$\begin{aligned} L_{mask} = L_{global} + \beta L_{char}, \end{aligned}$$
(4)

where \(L_{global}\) is an average binary cross-entropy loss and \(L_{char}\) is a weighted spatial soft-max loss. In this work, the \(\alpha _1\), \(\alpha _2\), \(\beta \), are empirically set to 1.0.

Text Instance Segmentation Loss. The output of the text instance segmentation task is a single map. Let N be the number of pixels in the global map, \(y_n\) be the pixel label (\(y_n \in {0,1}\)), and \(x_n\) be the output pixel, we define the \(L_{global}\) as follows:

$$\begin{aligned} L_{global} = -\frac{1}{N}\sum _{n=1}^{N}\left[ y_n \times log(S(x_n)) + (1-y_n) \times log(1- S(x_n)) \right] \end{aligned}$$
(5)

where S(x) is a sigmoid function.

Character Segmentation Loss. The output of the character segmentation consists of 37 maps, which correspond to 37 classes (36 classes of characters and the background class). Let T be the number of classes, N be the number of pixels in each map. The output maps X can be viewed as an \(N \times T\) matrix. In this way, the weighted spatial soft-max loss can be defined as follows:

$$\begin{aligned} L_{char} = -\frac{1}{N}\sum _{n=1}^{N}W_n\sum _{t=0}^{T-1} Y_{n,t} log(\frac{e^{X_{n,t}}}{\sum _{k=0}^{T-1} e^{X_{n,k}}}), \end{aligned}$$
(6)

where Y is the corresponding ground truth of X. The weight W is used to balance the loss value of the positives (character classes) and the background class. Let the number of the background pixels be \(N_{neg}\), and the background class index be 0, the weights can be calculated as:

$$\begin{aligned} W_i = {\left\{ \begin{array}{ll} 1&{} \text {if } Y_{i,0}=1, \\ N_{neg} / (N - N_{neg})&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(7)

Note that in inference, a sigmoid function and a soft-max function are applied to generate the global map and the character segmentation maps respectively.

3.4 Inference

Different from the training process where the input RoIs of mask branch come from RPN, in the inference phase, we use the outputs of Fast R-CNN as proposals to generate the predicted global maps and character maps, since the Fast R-CNN outputs are more accurate.

Specially, the processes of inference are as follows: first, inputting a test image, we obtain the outputs of Fast R-CNN as [40] and filter out the redundant candidate boxes by NMS; and then, the kept proposals are fed into the mask branch to generate the global maps and the character maps; finally the predicted polygons can be obtained directly by calculating the contours of text regions on global maps, the character sequences can be generated by our proposed pixel voting algorithm on character maps.

Pixel Voting. We decode the predicted character maps into character sequences by our proposed pixel voting algorithm. We first binarize the background map, where the values are from 0 to 255, with a threshold of 192. Then we obtain all character regions according to connected regions in the binarized map. We calculate the mean values of each region for all character maps. The values can be seen as the character classes probability of the region. The character class with the largest mean value will be assigned to the region. After that, we group all the characters from left to right according to the writing habit of English.

Weighted Edit Distance. Edit distance can be used to find the best-matched word of a predicted sequence with a given lexicon. However, there may be multiple words matched with the minimal edit distance at the same time, and the algorithm can not decide which one is the best. The main reason for the above-mentioned issue is that all operations (delete, insert, replace) in the original edit distance algorithm have the same costs, which does not make sense actually.

Fig. 5.
figure 5

Illustration of the edit distance and our proposed weighted edit distance. The red characters are the characters will be deleted, inserted and replaced. Green characters mean the candidate characters. \(p_{index}^c\) is the character probability, index is the character index and c is the current character. (Color figure online)

Inspired by [51], we propose a weighted edit distance algorithm. As shown in Fig. 5, different from edit distance, which assign the same cost for different operations, the costs of our proposed weighted edit distance depend on the character probability \(p_{index}^c\) which yielded by the pixel voting. Mathematically, the weighted edit distance between two strings a and b, whose length are |a| and |b| respectively, can be described as \(D_{a,b}(|a|,|b|)\), where

$$\begin{aligned} {\qquad D_{a,b}(i,j)={{\left\{ \begin{array}{ll}\max (i,j)&{}{\text { if }}\min (i,j)=0,\\ \min {{\left\{ \begin{array}{ll} D_{a,b}(i-1,j)+C_d\\ D_{a,b}(i,j-1)+C_i\\ D_{a,b}(i-1,j-1)+C_r \times 1_{(a_{i}\ne b_{j})}\end{array}\right. }}&{\text { otherwise.}}\end{array}\right. }}} \end{aligned}$$
(8)

where \(1_{(a_{i}\ne b_{j})}\) is the indicator function equal to 0 when \(a_{i}=b_{j}\) and equal to 1 otherwise; \(D_{a,b}(i,j)\) is the distance between the first i characters of a and the first j characters of b; \(C_d\), \(C_i\), and \(C_r\) are the deletion, insert, and replace cost respectively. In contrast, these costs are set to 1 in the standard edit distance.

4 Experiments

To validate the effectiveness of the proposed method, we conduct experiments and compare with other state-of-the-art methods on three public datasets: a horizontal text set ICDAR2013 [25], an oriented text set ICDAR2015 [24] and a curved text set Total-Text [4].

4.1 Datasets

SynthText. is a synthetic dataset proposed by [12], including about 800000 images. Most of the text instances in this dataset are multi-oriented and annotated with word and character-level rotated bounding boxes, as well as text sequences.

ICDAR2013. is a dataset proposed in Challenge 2 of the ICDAR 2013 Robust Reading Competition [25] which focuses on the horizontal text detection and recognition in natural images. There are 229 images in the training set and 233 images in the test set. Besides, the bounding box and the transcription are also provided for each word-level and character-level text instance.

ICDAR2015. is proposed in Challenge 4 of the ICDAR 2015 Robust Reading Competition [24]. Compared to ICDAR2013 which focuses on “focused text” in particular scenario, ICDAR2015 is more concerned with the incidental scene text detection and recognition. It contains 1000 training samples and 500 test images. All training images are annotated with word-level quadrangles as well as corresponding transcriptions. Note that, only localization annotations of words are used in our training stage.

Total-Text. is a comprehensive scene text dataset proposed by [4]. Except for the horizontal text and oriented text, Total-Text also consists of a lot of curved text. Total-Text contains 1255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. Note that, we only use the localization annotations in the training phase.

4.2 Implementation Details

Training. Different from previous text spotting methods which use two independent models [22, 30] (the detector and the recognizer) or alternating training strategy [27], all subnets of our model can be trained synchronously and end-to-end. The whole training process contains two stages: pre-trained on SynthText and fine-tuned on the real-world data.

In the pre-training stage, we set the mini-batch to 8, and all the shorter edge of the input images are resized to 800 pixels while keeping the aspect ratio of the images. The batch sizes of RPN and Fast R-CNN are set to 256 and 512 per image with a 1 : 3 sample ratio of positives to negatives. The batch size of the mask branch is 16. In the fine-tuning stage, data augmentation and multi-scale training technology are applied due to the lack of real samples. Specifically, for data augmentation, we randomly rotate the input pictures in a certain angle range of \([-15^\circ , 15^\circ ]\). Some other augmentation tricks, such as modifying the hue, brightness, contrast randomly, are also used following [33]. For multi-scale training, the shorter sides of the input images are randomly resized to three scales (600, 800, 1000). Besides, following [27], extra 1162 images for character detection from [56] are also used as training samples. The mini-batch of images is kept to 8, and in each mini-batch, the sample ratio of different datasets is set to 4:1:1:1:1 for SynthText, ICDAR2013, ICDAR2015, Total-Text and the extra images respectively. The batch sizes of RPN and Fast R-CNN are kept as the pre-training stage, and that of the mask branch is set to 64 when fine-tuning.

We optimize our model using SGD with a weight decay of 0.0001 and momentum of 0.9. In the pre-training stage, we train our model for 170k iterations, with an initial learning rate of 0.005. Then the learning rate is decayed to a tenth at the 120k iteration. In the fine-tuning stage, the initial learning rate is set to 0.001, and then be decreased to 0.0001 at the 40k iteration. The fine-tuning process is terminated at the 80k iteration.

Inference. In the inference stage, the scales of the input images depend on different datasets. After NMS, 1000 proposals are fed into Fast R-CNN. False alarms and redundant candidate boxes are filtered out by Fast R-CNN and NMS respectively. The kept candidate boxes are input to the mask branch to generate the global text instance maps and the character maps. Finally, the text instance bounding boxes and sequences are generated from the predicted maps.

We implement our method in Caffe2 and conduct all experiments on a regular workstation with Nvidia Titan Xp GPUs. The model is trained in parallel and evaluated on a single GPU.

4.3 Horizontal Text

We evaluate our model on ICDAR2013 dataset to verify its effectiveness in detecting and recognizing horizontal text. We resize the shorter sides of all input images to 1000 and evaluate the results on-line.

The results of our model are listed and compared with other state-of-the-art methods in Tables 1 and 3. As shown, our method achieves state-of-the-art results among detection, word spotting and end-to-end recognition. Specifically, for detection, though evaluated at a single scale, our method outperforms some previous methods which are evaluated at multi-scale setting [16, 18] (F-Measure: \(91.7\%\) v.s. \(90.3\%\)); for word spotting, our method is comparable to the previous best method; for end-to-end recognition, despite amazing results have been achieved by [27, 30], our method is still beyond them by \(1.1\%\)\(1.9\%\).

4.4 Oriented Text

We verify the superiority of our method in detecting and recognizing oriented text by conducting experiments on ICDAR2015. We input the images with three different scales: the original scale (\(720 \times 1280\)) and two larger scales where shorter sides of the input images are 1000 and 1600 due to a lot of small text instance in ICDAR2015. We evaluate our method on-line and compare it with other methods in Tables 2 and 3. Our method outperforms the previous methods by a large margin both in detection and recognition. For detection, when evaluated at the original scale, our method achieves the F-Measure of \(84\%\), higher than the current best one [16] by \(3.0\%\), which evaluated at multiple scales. When evaluated at a larger scale, a more impressive result can be achieved (F-Measure: \(86.0\%\)), outperforming the competitors by at least \(5.0\%\). Besides, our method also achieves remarkable results on word spotting and end-to-end recognition. Compared with the state of the art, the performance of our method has significant improvements by \(13.2\%\)\(25.3\%\), for all evaluation situations.

Table 1. Results on ICDAR2013. “S”, “W” and “G” mean recognition with strong, weak and generic lexicon respectively.
Table 2. Results on ICDAR2015. “S", “W" and “G" mean recognition with strong, weak and generic lexicon respectively.
Fig. 6.
figure 6

Visualization results of ICDAR 2013 (the left), ICDAR 2015 (the middle) and Total-Text (the right).

Table 3. The detection results on ICDAR2013 and ICDAR2015. For ICDAR2013, all methods are evaluated under the “DetEval” evaluation protocol. The short sides of the input image in “Ours (det only)” and “Ours” are set to 1000.
Fig. 7.
figure 7

Qualitative comparisons on Total-Text without lexicon. Top: results of TextBoxes [30]; Bottom: results of ours.

4.5 Curved Text

Detecting and recognizing arbitrary text (e.g. curved text) is a huge superiority of our method beyond other methods. We conduct experiments on Total-Text to verify the robustness of our method in detecting and recognizing curved text. Similarly, we input the test images with the short edges resized to 1000. The evaluation protocol of detection is provided by [4]. The evaluation protocol of end-to-end recognition follows ICDAR 2015 while changing the representation of polygons from four vertexes to an arbitrary number of vertexes in order to handle the polygons of arbitrary shapes.

Table 4. Results on Total-Text. “None” means recognition without any lexicon. “Full” lexicon contains all words in test set.

To compare with other methods, we also trained a model [30] using the code in [30]Footnote 1 with the same training data. As shown in Fig. 7, our method has a large superiority on both detection and recognition for curved text. The results in Table 4 show that our method exceeds [30] by 8.8 points in detection and at least \(16.6\%\) in end-to-end recognition. The significant improvements of detection mainly come from the more accurate localization outputs which encircle the text regions with polygons rather than the horizontal rectangles. Besides, our method is more suitable to handle sequences in 2-D space (such as curves), while the sequence recognition network used in [3, 27, 30] are designed for 1-D sequences.

4.6 Speed

Compared to previous methods, our proposed method exhibits a good speed-accuracy trade-off. It can run at 6.9 FPS with the input scale of \(720 \times 1280\). Although a bit slower than the fastest method [3], it exceeds [3] by a large margin in accuracy. Moreover, the speed of ours is about 4.4 times of [27] which is the current state-of-the-art on ICDAR2013.

4.7 Ablation Experiments

Some ablation experiments, including “With or without character maps”, “With or without character annotation”, and “With or without weighted edit distance”, are discussed in the Supplementary.

5 Conclusion

In this paper, we propose a text spotter, which detects and recognizes scene text in a unified network and can be trained end-to-end completely. Comparing with previous methods, our proposed network is very easy to train and has the ability to detect and recognize irregular text (e.g. curved text). The impressive performances on all the datasets which includes horizontal text, oriented text and curved text, demonstrate the effectiveness and robustness of our method for text detection and end-to-end text recognition.