Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Lyu, Pengyuan; Liao, Minghui; Yao, Cong; Wu, Wenhao; Bai, Xiang

doi:10.1007/978-3-030-01264-9_5

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Conference paper
First Online: 09 October 2018

3451 Accesses
240 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11218))

Abstract

Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.

P. Lyu and M. Liao—Contribute equally.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

In recent years, scene text detection and recognition have attracted growing research interests from the computer vision community, especially after the revival of neural networks and growth of image datasets. Scene text detection and recognition provide an automatic, rapid approach to access the textual information embodied in natural scenes, benefiting a variety of real-world applications, such as geo-location [58], instant translation, and assistance for the blind.

Scene text spotting, which aims at concurrently localizing and recognizing text from natural scenes, have been previously studied in numerous works [21, 49]. However, in most works, except [3, 27], text detection and subsequent recognition are handled separately. Text regions are first hunted from the original image by a trained detector and then fed into a recognition module. This procedure seems simple and natural, but might lead to sub-optimal performances for both detection and recognition, since these two tasks are highly correlated and complementary. On one hand, the quality of detections larges determines the accuracy of recognition; on the other hand, the results of recognition can provide feedback to help reject false positives in the phase of detection.

Recently, two methods [3, 27] that devise end-to-end trainable frameworks for scene text spotting have been proposed. Benefiting from the complementarity between detection and recognition, these unified models significantly outperform previous competitors. However, there are two major drawbacks in [3, 27]. First, both of them can not be completely trained in an end-to-end manner. [27] applied a curriculum learning paradigm [1] in the training period, where the sub-network for text recognition is locked at the early iterations and the training data for each period is carefully selected. Busta et al. [3] at first pre-train the networks for detection and recognition separately and then jointly train them until convergence. There are mainly two reasons that stop [3, 27] from training the models in a smooth, end-to-end fashion. One is that the text recognition part requires accurate locations for training while the locations in the early iterations are usually inaccurate. The other is that the adopted LSTM [17] or CTC loss [11] are difficult to optimize than general CNNs. The second limitation of [3, 27] lies in that these methods only focus on reading horizontal or oriented text. However, the shapes of text instances in real-world scenarios may vary significantly, from horizontal or oriented, to curved forms.

In this paper, we propose a text spotter named as Mask TextSpotter, which can detect and recognize text instances of arbitrary shapes. Here, arbitrary shapes mean various forms text instances in real world. Inspired by Mask R-CNN [13], which can generate shape masks of objects, we detect text by segment the instance text regions. Thus our detector is able to detect text of arbitrary shapes. Besides, different from the previous sequence-based recognition methods [26, 44, 45] which are designed for 1-D sequence, we recognize text via semantic segmentation in 2-D space, to solve the issues in reading irregular text instances. Another advantage is that it does not require accurate locations for recognition. Therefore, the detection task and recognition task can be completely trained end-to-end, and benefited from feature sharing and joint optimization.

We validate the effectiveness of our model on the datasets that include horizontal, oriented and curved text. The results demonstrate the advantages of the proposed algorithm in both text detection and end-to-end text recognition tasks. Specially, on ICDAR2015, evaluated at a single scale, our method achieves an F-Measure of 0.86 on the detection task and outperforms the previous top performers by $13.2\%$–$25.3\%$ on the end-to-end recognition task.

The main contributions of this paper are four-fold. (1) We propose an end-to-end trainable model for text spotting, which enjoys a simple, smooth training scheme. (2) The proposed method can detect and recognize text of various shapes, including horizontal, oriented, and curved text. (3) In contrast to previous methods, precise text detection and recognition in our method are accomplished via semantic segmentation. (4) Our method achieves state-of-the-art performances in both text detection and text spotting on various benchmarks.

2 Related Work

2.1 Scene Text Detection

In scene text recognition systems, text detection plays an important role [59]. A large number of methods have been proposed to detect scene text [7, 15, 16, 19, 21, 23, 30, 31, 34,35,36,37, 43, 47, 48, 50, 52, 54, 54,55,56,57]. In [21], Jaderberg et al. use Edge Boxes [60] to generate proposals and refine candidate boxes by regression. Zhang et al. [54] detect scene text by exploiting the symmetry property of text. Adapted from Faster R-CNN [40] and SSD [33] with well-designed modifications, [30, 56] are proposed to detect horizontal words.

Multi-oriented scene text detection has become a hot topic recently. Yao et al. [52] and Zhang et al. [55] detect multi-oriented scene text by semantic segmentation. Tian et al. [48] and Shi et al. [43] propose methods which first detect text segments and then link them into text instances by spatial relationship or link predictions. Zhou et al. [57] and He et al. [16] regress text boxes directly from dense segmentation maps. Lyu et al. [35] propose to detect and group the corner points of the text to generate text boxes. Rotation-sensitive regression for oriented scene text detection is proposed by Liao et al. [31].

Compared to the popularity of horizontal or multi-oriented scene text detection, there are few works focusing on text instances of arbitrary shapes. Recently, detection of text with arbitrary shapes has gradually drawn the attention of researchers due to the application requirements in the real-life scenario. In [41], Risnumawan et al. propose a system for arbitrary text detection based on text symmetry properties. In [4], a dataset which focuses on curve orientation text detection is proposed. Different from most of the above-mentioned methods, we propose to detect scene text by instance segmentation which can detect text with arbitrary shapes.

2.2 Scene Text Recognition

Scene text recognition [46, 53] aims at decoding the detected or cropped image regions into character sequences. The previous scene text recognition approaches can be roughly split into three branches: character-based methods, word-based methods, and sequence-based methods. The character-based recognition methods [2, 22] mostly first localize individual characters and then recognize and group them into words. In [20], Jaderberg et al. propose a word-based method which treats text recognition as a common English words (90k) classification problem. Sequence-based methods solve text recognition as a sequence labeling problem. In [44], Shi et al. use CNN and RNN to model image features and output the recognized sequences with CTC [11]. In [26, 45], Lee et al. and Shi et al. recognize scene text via attention based sequence-to-sequence model.

The proposed text recognition component in our framework can be classified as a character-based method. However, in contrast to previous character-based approaches, we use an FCN [42] to localize and classify characters simultaneously. Besides, compared with sequence-based methods which are designed for a 1-D sequence, our method is more suitable to handle irregular text (multi-oriented text, curved text et al.).

2.3 Scene Text Spotting

Most of the previous text spotting methods [12, 21, 29, 30] split the spotting process into two stages. They first use a scene text detector [21, 29, 30] to localize text instances and then use a text recognizer [20, 44] to obtain the recognized text. In [3, 27], Li et al. and Busta et al. propose end-to-end methods to localize and recognize text in a unified network, but require relatively complex training procedures. Compared with these methods, our proposed text spotter can not only be trained end-to-end completely, but also has the ability to detect and recognize arbitrary-shape (horizontal, oriented, and curved) scene text.

2.4 General Object Detection and Semantic Segmentation

With the rise of deep learning, general object detection and semantic segmentation have achieved great development. A large number of object detection and segmentation methods [5, 6, 8, 9, 13, 28, 32, 33, 39, 40, 42] have been proposed. Benefited from those methods, scene text detection and recognition have achieved obvious progress in the past few years. Our method is also inspired by those methods. Specifically, our method is adapted from a general object instance segmentation model Mask R-CNN [13]. However, there are key differences between the mask branch of our method and that in Mask R-CNN. Our mask branch can not only segment text regions but also predict character probability maps, which means that our method can be used to recognize the instance sequence inside character maps rather than predicting an object mask only.

3 Methodology

The proposed method is an end-to-end trainable text spotter, which can handle various shapes of text. It consists of an instance-segmentation based text detector and a character-segmentation based text recognizer.

3.1 Framework

The overall architecture of our proposed method is presented in Fig. 2. Functionally, the framework consists of four components: a feature pyramid network (FPN) [32] as backbone, a region proposal network (RPN) [40] for generating text proposals, a Fast R-CNN [40] for bounding boxes regression, a mask branch for text instance segmentation and character segmentation. In the training phase, a lot of text proposals are first generated by RPN, and then the RoI features of the proposals are fed into the Fast R-CNN branch and the mask branch to generate the accurate text candidate boxes, the text instance segmentation maps, and the character segmentation maps.

Backbone. Text in nature images are various in sizes. In order to build high-level semantic feature maps at all scales, we apply a feature pyramid structure [32] backbone with ResNet [14] of depth 50. FPN uses a top-down architecture to fuse the feature of different resolutions from a single-scale input, which improves accuracy with marginal cost.

RPN. RPN is used to generate text proposals for the subsequent Fast R-CNN and mask branch. Following [32], we assign anchors on different stages depending on the anchor size. Specifically, the area of the anchors are set to {$32^2, 64^2, 128^2, 256^2, 512^2$} pixels on five stages {$P_2, P_3, P_4, P_5, P_6$} respectively. Different aspect ratios {0.5, 1, 2} are also adopted in each stages as in [40]. In this way, the RPN can handle text of various sizes and aspect ratios. RoI Align [13] is adapted to extract the region features of the proposals. Compared to RoI Pooling [8], RoI Align preserves more accurate location information, which is quite beneficial to the segmentation task in the mask branch. Note that no special design for text is adopted, such as the special aspect ratios or orientations of anchors for text, as in previous works [15, 30, 34].

Fast R-CNN. The Fast R-CNN branch includes a classification task and a regression task. The main function of this branch is to provide more accurate bounding boxes for detection. The inputs of Fast R-CNN are in $7 \times 7$ resolution, which are generated by RoI Align from the proposals produced by RPN.

Mask Branch. There are two tasks in the mask branch, including a global text instance segmentation task and a character segmentation task. As shown in Fig. 3, giving an input RoI, whose size is fixed to $16*64$, through four convolutional layers and a de-convolutional layer, the mask branch predicts 38 maps (with $32*128$ size), including a global text instance map, 36 character maps, and a background map of characters. The global text instance map can give accurate localization of a text region, regardless of the shape of the text instance. The character maps are maps of 36 characters, including 26 letters and 10 Arabic numerals. The background map of characters, which excludes the character regions, is also needed for post-processing.

3.2 Label Generation

For a training sample with the input image I and the corresponding ground truth, we generate targets for RPN, Fast R-CNN and mask branch. Generally, the ground truth contains $P=\{p_{1}, p_{2}...p_{m} \}$ and $C=\{c_{1}=(cc_{1},cl_{1}),c_{2}=(cc_{2},cl_{2}), ... , c_{n}=(cc_{n},cl_{n})\}$, where $p_{i}$ is a polygon which represents the localization of a text region, $cc_{j}$ and $cl_{j}$ are the category and location of a character respectively. Note that, in our method C is not necessary for all training samples.

We first transform the polygons into horizontal rectangles which cover the polygons with minimal areas. And then we generate targets for RPN and Fast R-CNN following [8, 32, 40]. There are two types of target maps to be generated for the mask branch with the ground truth P, C (may not exist) as well as the proposals yielded by RPN: a global map for text instance segmentation and a character map for character semantic segmentation. Given a positive proposal r, we first use the matching mechanism of [8, 32, 40] to obtain the best matched horizontal rectangle. The corresponding polygon as well as characters (if any) can be obtained further. Next, the matched polygon and character boxes are shifted and resized to align the proposal and the target map of $H\times W$ as the following formulas:

$$\begin{aligned} B_{x}=(B_{x_0}-min(r_{x})) \times W / (max(r_{x})-min(r_{x})) \end{aligned}$$

(1)

$$\begin{aligned} B_{y}=(B_{y_0}-min(r_{y})) \times H / (max(r_{y})-min(r_{y})) \end{aligned}$$

(2)

where $(B_{x},B_{y})$ and $(B_{x_0},B_{y_0})$ are the updated and original vertexes of the polygon and all character boxes; $(r_{x}, r_{y})$ are the vertexes of the proposal r.

After that, the target global map can be generated by just drawing the normalized polygon on a zero-initialized mask and filling the polygon region with the value 1. The character map generation is visualized in Fig. 4a. We first shrink all character bounding boxes by fixing their center point and shortening the sides to the fourth of the original sides. Then, the values of the pixels in the shrunk character bounding boxes are set to their corresponding category indices and those outside the shrunk character bounding boxes are set to 0. If there are no character bounding boxes annotations, all values are set to $-1$.

3.3 Optimization

As discussed in Sect. 3.1, our model includes multiple tasks. We naturally define a multi-task loss function:

$$\begin{aligned} L = L_{rpn} + \alpha _1 L_{rcnn} + \alpha _2 L_{mask}, \end{aligned}$$

(3)

where $L_{rpn}$ and $L_{rcnn}$ are the loss functions of RPN and Fast R-CNN, which are identical as these in [8, 40]. The mask loss $L_{mask}$ consists of a global text instance segmentation loss $L_{global}$ and a character segmentation loss $L_{char}$:

$$\begin{aligned} L_{mask} = L_{global} + \beta L_{char}, \end{aligned}$$

(4)

where $L_{global}$ is an average binary cross-entropy loss and $L_{char}$ is a weighted spatial soft-max loss. In this work, the $\alpha _1$, $\alpha _2$, $\beta $, are empirically set to 1.0.

Text Instance Segmentation Loss. The output of the text instance segmentation task is a single map. Let N be the number of pixels in the global map, $y_n$ be the pixel label ($y_n \in {0,1}$), and $x_n$ be the output pixel, we define the $L_{global}$ as follows:

$$\begin{aligned} L_{global} = -\frac{1}{N}\sum _{n=1}^{N}\left[ y_n \times log(S(x_n)) + (1-y_n) \times log(1- S(x_n)) \right] \end{aligned}$$

(5)

where S(x) is a sigmoid function.

Character Segmentation Loss. The output of the character segmentation consists of 37 maps, which correspond to 37 classes (36 classes of characters and the background class). Let T be the number of classes, N be the number of pixels in each map. The output maps X can be viewed as an $N \times T$ matrix. In this way, the weighted spatial soft-max loss can be defined as follows:

$$\begin{aligned} L_{char} = -\frac{1}{N}\sum _{n=1}^{N}W_n\sum _{t=0}^{T-1} Y_{n,t} log(\frac{e^{X_{n,t}}}{\sum _{k=0}^{T-1} e^{X_{n,k}}}), \end{aligned}$$

(6)

where Y is the corresponding ground truth of X. The weight W is used to balance the loss value of the positives (character classes) and the background class. Let the number of the background pixels be $N_{neg}$, and the background class index be 0, the weights can be calculated as:

$$\begin{aligned} W_i = {\left\{ \begin{array}{ll} 1&{} \text {if } Y_{i,0}=1, \\ N_{neg} / (N - N_{neg})&{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(7)

Note that in inference, a sigmoid function and a soft-max function are applied to generate the global map and the character segmentation maps respectively.

3.4 Inference

Different from the training process where the input RoIs of mask branch come from RPN, in the inference phase, we use the outputs of Fast R-CNN as proposals to generate the predicted global maps and character maps, since the Fast R-CNN outputs are more accurate.

Specially, the processes of inference are as follows: first, inputting a test image, we obtain the outputs of Fast R-CNN as [40] and filter out the redundant candidate boxes by NMS; and then, the kept proposals are fed into the mask branch to generate the global maps and the character maps; finally the predicted polygons can be obtained directly by calculating the contours of text regions on global maps, the character sequences can be generated by our proposed pixel voting algorithm on character maps.

Pixel Voting. We decode the predicted character maps into character sequences by our proposed pixel voting algorithm. We first binarize the background map, where the values are from 0 to 255, with a threshold of 192. Then we obtain all character regions according to connected regions in the binarized map. We calculate the mean values of each region for all character maps. The values can be seen as the character classes probability of the region. The character class with the largest mean value will be assigned to the region. After that, we group all the characters from left to right according to the writing habit of English.

Weighted Edit Distance. Edit distance can be used to find the best-matched word of a predicted sequence with a given lexicon. However, there may be multiple words matched with the minimal edit distance at the same time, and the algorithm can not decide which one is the best. The main reason for the above-mentioned issue is that all operations (delete, insert, replace) in the original edit distance algorithm have the same costs, which does not make sense actually.

Inspired by [51], we propose a weighted edit distance algorithm. As shown in Fig. 5, different from edit distance, which assign the same cost for different operations, the costs of our proposed weighted edit distance depend on the character probability $p_{index}^c$ which yielded by the pixel voting. Mathematically, the weighted edit distance between two strings a and b, whose length are |a| and |b| respectively, can be described as $D_{a,b}(|a|,|b|)$, where

$$\begin{aligned} {\qquad D_{a,b}(i,j)={{\left\{ \begin{array}{ll}\max (i,j)&{}{\text { if }}\min (i,j)=0,\\ \min {{\left\{ \begin{array}{ll} D_{a,b}(i-1,j)+C_d\\ D_{a,b}(i,j-1)+C_i\\ D_{a,b}(i-1,j-1)+C_r \times 1_{(a_{i}\ne b_{j})}\end{array}\right. }}&{\text { otherwise.}}\end{array}\right. }}} \end{aligned}$$

(8)

where $1_{(a_{i}\ne b_{j})}$ is the indicator function equal to 0 when $a_{i}=b_{j}$ and equal to 1 otherwise; $D_{a,b}(i,j)$ is the distance between the first i characters of a and the first j characters of b; $C_d$, $C_i$, and $C_r$ are the deletion, insert, and replace cost respectively. In contrast, these costs are set to 1 in the standard edit distance.

4 Experiments

To validate the effectiveness of the proposed method, we conduct experiments and compare with other state-of-the-art methods on three public datasets: a horizontal text set ICDAR2013 [25], an oriented text set ICDAR2015 [24] and a curved text set Total-Text [4].

4.1 Datasets

SynthText. is a synthetic dataset proposed by [12], including about 800000 images. Most of the text instances in this dataset are multi-oriented and annotated with word and character-level rotated bounding boxes, as well as text sequences.

ICDAR2013. is a dataset proposed in Challenge 2 of the ICDAR 2013 Robust Reading Competition [25] which focuses on the horizontal text detection and recognition in natural images. There are 229 images in the training set and 233 images in the test set. Besides, the bounding box and the transcription are also provided for each word-level and character-level text instance.

ICDAR2015. is proposed in Challenge 4 of the ICDAR 2015 Robust Reading Competition [24]. Compared to ICDAR2013 which focuses on “focused text” in particular scenario, ICDAR2015 is more concerned with the incidental scene text detection and recognition. It contains 1000 training samples and 500 test images. All training images are annotated with word-level quadrangles as well as corresponding transcriptions. Note that, only localization annotations of words are used in our training stage.

Total-Text. is a comprehensive scene text dataset proposed by [4]. Except for the horizontal text and oriented text, Total-Text also consists of a lot of curved text. Total-Text contains 1255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. Note that, we only use the localization annotations in the training phase.

4.2 Implementation Details

Training. Different from previous text spotting methods which use two independent models [22, 30] (the detector and the recognizer) or alternating training strategy [27], all subnets of our model can be trained synchronously and end-to-end. The whole training process contains two stages: pre-trained on SynthText and fine-tuned on the real-world data.

In the pre-training stage, we set the mini-batch to 8, and all the shorter edge of the input images are resized to 800 pixels while keeping the aspect ratio of the images. The batch sizes of RPN and Fast R-CNN are set to 256 and 512 per image with a 1 : 3 sample ratio of positives to negatives. The batch size of the mask branch is 16. In the fine-tuning stage, data augmentation and multi-scale training technology are applied due to the lack of real samples. Specifically, for data augmentation, we randomly rotate the input pictures in a certain angle range of $[-15^\circ , 15^\circ ]$. Some other augmentation tricks, such as modifying the hue, brightness, contrast randomly, are also used following [33]. For multi-scale training, the shorter sides of the input images are randomly resized to three scales (600, 800, 1000). Besides, following [27], extra 1162 images for character detection from [56] are also used as training samples. The mini-batch of images is kept to 8, and in each mini-batch, the sample ratio of different datasets is set to 4:1:1:1:1 for SynthText, ICDAR2013, ICDAR2015, Total-Text and the extra images respectively. The batch sizes of RPN and Fast R-CNN are kept as the pre-training stage, and that of the mask branch is set to 64 when fine-tuning.

We optimize our model using SGD with a weight decay of 0.0001 and momentum of 0.9. In the pre-training stage, we train our model for 170k iterations, with an initial learning rate of 0.005. Then the learning rate is decayed to a tenth at the 120k iteration. In the fine-tuning stage, the initial learning rate is set to 0.001, and then be decreased to 0.0001 at the 40k iteration. The fine-tuning process is terminated at the 80k iteration.

Inference. In the inference stage, the scales of the input images depend on different datasets. After NMS, 1000 proposals are fed into Fast R-CNN. False alarms and redundant candidate boxes are filtered out by Fast R-CNN and NMS respectively. The kept candidate boxes are input to the mask branch to generate the global text instance maps and the character maps. Finally, the text instance bounding boxes and sequences are generated from the predicted maps.

We implement our method in Caffe2 and conduct all experiments on a regular workstation with Nvidia Titan Xp GPUs. The model is trained in parallel and evaluated on a single GPU.

4.3 Horizontal Text

We evaluate our model on ICDAR2013 dataset to verify its effectiveness in detecting and recognizing horizontal text. We resize the shorter sides of all input images to 1000 and evaluate the results on-line.

The results of our model are listed and compared with other state-of-the-art methods in Tables 1 and 3. As shown, our method achieves state-of-the-art results among detection, word spotting and end-to-end recognition. Specifically, for detection, though evaluated at a single scale, our method outperforms some previous methods which are evaluated at multi-scale setting [16, 18] (F-Measure: $91.7\%$ v.s. $90.3\%$); for word spotting, our method is comparable to the previous best method; for end-to-end recognition, despite amazing results have been achieved by [27, 30], our method is still beyond them by $1.1\%$–$1.9\%$.

4.4 Oriented Text

We verify the superiority of our method in detecting and recognizing oriented text by conducting experiments on ICDAR2015. We input the images with three different scales: the original scale ($720 \times 1280$) and two larger scales where shorter sides of the input images are 1000 and 1600 due to a lot of small text instance in ICDAR2015. We evaluate our method on-line and compare it with other methods in Tables 2 and 3. Our method outperforms the previous methods by a large margin both in detection and recognition. For detection, when evaluated at the original scale, our method achieves the F-Measure of $84\%$, higher than the current best one [16] by $3.0\%$, which evaluated at multiple scales. When evaluated at a larger scale, a more impressive result can be achieved (F-Measure: $86.0\%$), outperforming the competitors by at least $5.0\%$. Besides, our method also achieves remarkable results on word spotting and end-to-end recognition. Compared with the state of the art, the performance of our method has significant improvements by $13.2\%$–$25.3\%$, for all evaluation situations.

Table 1. Results on ICDAR2013. “S”, “W” and “G” mean recognition with strong, weak and generic lexicon respectively.

Full size table

Table 2. Results on ICDAR2015. “S", “W" and “G" mean recognition with strong, weak and generic lexicon respectively.

Full size table

Table 3. The detection results on ICDAR2013 and ICDAR2015. For ICDAR2013, all methods are evaluated under the “DetEval” evaluation protocol. The short sides of the input image in “Ours (det only)” and “Ours” are set to 1000.

Full size table

4.5 Curved Text

Detecting and recognizing arbitrary text (e.g. curved text) is a huge superiority of our method beyond other methods. We conduct experiments on Total-Text to verify the robustness of our method in detecting and recognizing curved text. Similarly, we input the test images with the short edges resized to 1000. The evaluation protocol of detection is provided by [4]. The evaluation protocol of end-to-end recognition follows ICDAR 2015 while changing the representation of polygons from four vertexes to an arbitrary number of vertexes in order to handle the polygons of arbitrary shapes.

Table 4. Results on Total-Text. “None” means recognition without any lexicon. “Full” lexicon contains all words in test set.

Full size table

To compare with other methods, we also trained a model [30] using the code in [30]^{Footnote 1} with the same training data. As shown in Fig. 7, our method has a large superiority on both detection and recognition for curved text. The results in Table 4 show that our method exceeds [30] by 8.8 points in detection and at least $16.6\%$ in end-to-end recognition. The significant improvements of detection mainly come from the more accurate localization outputs which encircle the text regions with polygons rather than the horizontal rectangles. Besides, our method is more suitable to handle sequences in 2-D space (such as curves), while the sequence recognition network used in [3, 27, 30] are designed for 1-D sequences.

4.6 Speed

Compared to previous methods, our proposed method exhibits a good speed-accuracy trade-off. It can run at 6.9 FPS with the input scale of $720 \times 1280$. Although a bit slower than the fastest method [3], it exceeds [3] by a large margin in accuracy. Moreover, the speed of ours is about 4.4 times of [27] which is the current state-of-the-art on ICDAR2013.

4.7 Ablation Experiments

Some ablation experiments, including “With or without character maps”, “With or without character annotation”, and “With or without weighted edit distance”, are discussed in the Supplementary.

5 Conclusion

In this paper, we propose a text spotter, which detects and recognizes scene text in a unified network and can be trained end-to-end completely. Comparing with previous methods, our proposed network is very easy to train and has the ability to detect and recognize irregular text (e.g. curved text). The impressive performances on all the datasets which includes horizontal text, oriented text and curved text, demonstrate the effectiveness and robustness of our method for text detection and end-to-end text recognition.

Notes

1.
https://github.com/MhLiao/TextBoxes.

References

Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceeding of ICML, pp. 41–48 (2009)
Google Scholar
Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: reading text in uncontrolled conditions. In: Proceedings of ICCV, pp. 785–792 (2013)
Google Scholar
Busta, M., Neumann, L., Matas, J.: Deep TextSpotter: an end-to-end trainable scene text localization and recognition framework. In: Proceedings of ICCV, pp. 2223–2231 (2017)
Google Scholar
Chng, C.K., Chan, C.S.: Total-Text: a comprehensive dataset for scene text detection and recognition. In: Proceedings of ICDAR, pp. 935–942 (2017)
Google Scholar
Dai, J., He, K., Li, Y., Ren, S., Sun, J.: Instance-sensitive fully convolutional networks. In: Proceedings of ECCV, pp. 534–549 (2016)
Chapter Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Proceedings of NIPS, pp. 379–387 (2016)
Google Scholar
Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Proceedings of CVPR, pp. 2963–2970 (2010)
Google Scholar
Girshick, R.B.: Fast R-CNN. In: Proceedings of ICCV, pp. 1440–1448 (2015)
Google Scholar
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of CVPR, pp. 580–587 (2014)
Google Scholar
Gómez, L., Karatzas, D.: TextProposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recognit. 70, 60–74 (2017)
Article Google Scholar
Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML, pp. 369–376 (2006)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of CVPR, pp. 2315–2324 (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: Proceedings of ICCV, pp. 2980–2988 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016)
Google Scholar
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: Proceedings of ICCV, pp. 3066–3074 (2017)
Google Scholar
He, W., Zhang, X., Yin, F., Liu, C.: Deep direct regression for multi-oriented scene text detection. In: Proceedings ICCV, pp. 745–753 (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: WordSup: exploiting word annotations for character based text detection. In: Proceedings of ICCV, pp. 4950–4959 (2017)
Google Scholar
Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Proceedings of ECCV, pp. 497–511 (2014)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)
Article MathSciNet Google Scholar
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_34
Chapter Google Scholar
Kang, L., Li, Y., Doermann, D.S.: Orientation robust text line detection in natural images. In: Proceedings of CVPR, pp. 4034–4041 (2014)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: Proceedings of ICDAR, pp. 1156–1160 (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: Proceedings of ICDAR, pp. 1484–1493 (2013)
Google Scholar
Lee, C., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of CVPR, pp. 2231–2239 (2016)
Google Scholar
Li, H., Wang, P., Shen, C.: Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of ICCV, pp. 5248–5256 (2017)
Google Scholar
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: Proceedings of CVPR, pp. 4438–4446 (2017)
Google Scholar
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27(8), 3676–3690 (2018)
Article MathSciNet Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of AAAI, pp. 4161–4167 (2017)
Google Scholar
Liao, M., Zhu, Z., Shi, B., Xia, G.s., Bai, X.: Rotation-sensitive regression for oriented scene text detection. In: Proceedings of CVPR, pp. 5909–5918 (2018)
Google Scholar
Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: Proceedings of CVPR, pp. 936–944 (2017)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, Y., Jin, L.: Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of CVPR, pp. 3454–3461 (2017)
Google Scholar
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of CVPR, pp. 7553–7563 (2018)
Google Scholar
Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Proceedings of ACCV, pp. 770–783 (2010)
Chapter Google Scholar
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of CVPR, pp. 3538–3545 (2012)
Google Scholar
Neumann, L., Matas, J.: Real-time lexicon-free scene text localization and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1872–1885 (2016)
Article Google Scholar
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Article Google Scholar
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)
Article Google Scholar
Shi, B., Bai, X., Belongie, S.J.: Detecting oriented text in natural images by linking segments. In: Proceedings of CVPR, pp. 3482–3490 (2017)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of CVPR, pp. 4168–4176 (2016)
Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: a unified text detection system in natural scene images. In: Proceedings of ICCV, pp. 4651–4659 (2015)
Google Scholar
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV, pp. 1457–1464 (2011)
Google Scholar
Yao, C., Bai, X., Liu, Wenyu and, M.Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1083–1090. IEEE (2012)
Google Scholar
Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. IEEE Trans. Image Process. 23(11), 4737–4749 (2014)
Article MathSciNet Google Scholar
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z.: Scene text detection via holistic, multi-channel prediction. CoRR abs/1606.09002 (2016)
Google Scholar
Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4049 (2014)
Google Scholar
Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in natural scenes. In: Proceedings of CVPR, pp. 2558–2567 (2015)
Google Scholar
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: Proceeding of CVPR, pp. 4159–4167 (2016)
Google Scholar
Zhong, Z., Jin, L., Zhang, S., Feng, Z.: DeepText: a unified framework for text proposal generation and text detection in natural images. CoRR abs/1605.07314 (2016)
Google Scholar
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: an efficient and accurate scene text detector. In: Proceedings of CVPR, pp. 2642–2651 (2017)
Google Scholar
Zhu, Y., Liao, M., Yang, M., Liu, W.: Cascaded segmentation-detection networks for text-based traffic sign detection. IEEE Trans. Intell. Transport. Syst. 19(1), 209–219 (2018)
Article Google Scholar
Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: recent advances and future trends. Front. Comput. Sci. 10(1), 19–36 (2016)
Article Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by National Key R&D Program of China No. 2018YFB1 004600, NSFC 61733007, and NSFC 61573160, to Dr. Xiang Bai by the National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team.

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Pengyuan Lyu, Minghui Liao & Xiang Bai
Megvii (Face++) Technology Inc., Beijing, China
Cong Yao & Wenhao Wu

Authors

Pengyuan Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Minghui Liao
View author publications
You can also search for this author in PubMed Google Scholar
Cong Yao
View author publications
You can also search for this author in PubMed Google Scholar
Wenhao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Bai .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 128 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X. (2018). Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11218. Springer, Cham. https://doi.org/10.1007/978-3-030-01264-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-01264-9_5
Published: 09 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01263-2
Online ISBN: 978-3-030-01264-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

2.1 Scene Text Detection

2.2 Scene Text Recognition

2.3 Scene Text Spotting

2.4 General Object Detection and Semantic Segmentation

3 Methodology

3.1 Framework

3.2 Label Generation

3.3 Optimization

3.4 Inference

4 Experiments

4.1 Datasets

4.2 Implementation Details

4.3 Horizontal Text

4.4 Oriented Text

4.5 Curved Text

4.6 Speed

4.7 Ablation Experiments

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 128 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation