1 Introduction

Scene text recognition has attracted renewed interests in recent years due to the wide use of intelligent devices such as smart phones. In a typical application, for example, one needs to recognize text in an image captured by a mobile phones and translate text into other languages. As a result, numerous researchers devoted to the research of scene text recognition [1].

However, scene text recognition is a challenging problem due to the complexity of scene images such as background clutter, illumination changes, and the variation of text position, size, font, color and line orientation. Figure 1 shows some scene images that contain text. We can see that unlike the recognition of text in printed documents (this is also called Optical Character Recognition, OCR), which contain clean and well-formatted text, scene text detection and recognition is a more challenging problem.

Fig. 1.
figure 1figure 1

Some scene images that contain text.

In scene text recognition, there are four main problems: (1) text detection, (2) text recognition, (3) full image word recognition, and (4) isolated scene character recognition. Text detection is to locate text regions in an image; while text recognition, given text regions, is usually referred to as cropped word recognition. Full image word recognition usually includes both text detection and text recognition in an end-to-end scene text recognition system. Isolated character recognition is usually a basic component of a scene text recognition system. Hence the problem of isolated scene character recognition is also a fundamental problem and has attracted increasing attentions recently.

Numerous methods have been proposed for these problems, and have achieved great progresses in the past decade. Ten years ago two papers [2] and [3] addressed these problems and provided comprehensive surveys on text information extraction and camera-based document analysis. Recently, Zhang et al. [4] reviewed the text detection problem. More recently, Ye et al. [1] presented more general and extensive discussions on all the problems in scene text recognition.

However, we notice that, there are still some issues remained to be addressed. First, athough in [1] the achieved progresses of scene word recognition and end-to-end recognition are surveyed, the state-of-the-art of scene character recognition is ignored. Second, the papers published in some recent literature such as ECCV 2014, ACCV 2014, ICIP 2014, and ICPR 2014 are not included in [1]. In fact, some papers in these conferences have renewed the state-of-the-art. Third, the paper [1] describes the problems of scene word recognition and end-to-end recognition briefly while the methods proposed for these problems are not categorized in details.

In this paper, we focus on the problems of scene character recognition and scene text recognition (mainly in the context of cropped word recognition), and present the state-of-the-art of these problems. We do not review the text detection problem and end-to-end recognition problem due to the space limitation. We review the most recently published papers on scene text recognition in ECCV 2014, ACCV 2014, ICIP 2014, and ICPR 2014. Specifically, we review the papers in two aspects: character feature representation and word recognition model, which are two important issues in scene text recognition.

The rest of the paper is organized as follows. In Sect. 2, we first review some public databases used for scene character and text recognition. In Sect. 3, we then introduce scene character recognition methods, focusing on the issue of character feature representation, and provide the state-of-the-art performance achieved so far. In Sect. 4, we review the problems of scene word recognition, focusing on the multiple information integration methods, and provide the state-of-the-art performance achieved so far. In Sect. 5, we conclude the paper with some discussion.

2 The Databases for Scene Character and Text Recognition

In this subsection, we summarize some publicly available datasets that are commonly used for scene character and text recognition. The most widely used datasets for scene character and/or text recognition include the ICDAR2003 dataset [5], the ICDAR2011 dataset [6], the Chars74K dataset [7], the Street View Text (SVT) dataset [8], and the III5K-Word dataset [9]. Among them, the ICDAR2003 and ICDAR2011 datasets are used for the “Robust OCR”, “Robust Reading and Text Locating”, and “Robust Word Recognition” competitions organized jointly with the ICDAR main conference. Hence the ICDAR2003 and ICDAR2011 datasets contain natural scene images, from which words and character samples can be cropped. The original SVT dataset contains natural scene images and cropped words only. Later, Mishra et al. provided the character level annotations of the test set of the SVT dataset [9]. The III5K-Word dataset is composed of word images, and character level annotations are provided by the authors. The Chars74K dataset is composed of isolated scene character samples.

Besides the datasets mentioned above, there are also some other datesets for research of scene text detection and recognition (see [1] for a brief review). Since this paper focuses on the state-of-the-art of scene character and text recognition, we only introduce the commonly used datasets for evaluation of scene character and text recognition.

The ICDAR2003 dataset contains 507 natural scene images including 258 training images and 249 test images in total. There are totally 1,156 words (including 6,185 character samples) cropped from the training set of the ICDAR2003 dataset, and 1,107 words (including 5,379 character samples) cropped from the test set of the ICDAR2003 dataset. The ICDAR2011 dataset contains 229 images for training (including 846 words) and 255 images for test (including 1,189 words). The SVT dataset is composed of 100 training images (including 258 words) and 249 test images (including 647 words that contains 3,796 character samples). For the ICDAR2011 dataset, only the words in the images can be cropped because the images are annotated at word level only. The IIIT5K-Word dataset is the largest and most challenging dataset for word recognition so far. This dataset includes 5000 word images, where 2000 images are used for training and 3000 images for test. The Chars74K dataset contains nearly 74 thousand scene character samples.

In summary, for evaluating scene character recognition methods, the ICDAR2003 and Chars74K datasets are commonly used. For evaluating scene text recognition methods, the ICDAR2003, ICDAR2011, SVT, and III5K-Word datasets are usually used.

3 The State-of-the-Art Scene Character Recognition Methods

For scene character recognition, two important issues may affect the performance of scene character recognition: character feature representation methods and character classification methods. Whereas, much more attentions are paid to feature representation. For the character classification methods, the support vector machine (SVM) classifier (with a linear kernel or RBF or chi-square kernel) is one of the most popular one. Some other classifiers such as the random ferns classifier (FERNS) [8], the nearest neighbor (NN) classifier [8], random forest [10] and the convolutional neural network classifier (CNN) [11] have been adopted.

Since much more attentions are paid to character feature representation methods, in the following we mainly review the papers related to feature representation. We categorize the existing methods in three main kinds: HOG and its variants, mid-level character structural features, and deep learning based methods. Table 1 shows the state-of-the-art performance achieved so far. From the results, we can see that, the deep learning based feature learning methods achieve the highest performance.

Table 1. The state-of-the-art methods and their results for scene character recognition(%).

3.1 HOG and Its Variants

The Histograms of Oriented Gradients (HOG) features have been shown to be effective and have been used in object detection [24], and for scene character feature representation [8, 25]. Although HOG is very simple and is effective in describing local features (such as edges), HOG ignores the spatial and structural information. Hence some methods are proposed to improve HOG. For example, Yi et al. [22] improve the HOG features by global sampling (called GHOG) or local sampling (called LHOG) to better model character structures. Tian et al. [20] propose the Co-occurrence of Histogram of Oriented Gradients (called CoHOG) features, which capture the spatial distribution of neighboring orientation pairs instead of only a single gradient orientation, for scene text recognition. The CoHOG method improves HOG significantly in scene character recognition. Later, the authors of [20] propose the pyramid of HOG (called PHOG) [17] to encode the relative spatial layout of the character parts, and propose the convolutional CoHOG (called ConvCoHOG) to extract richer character features. These methods effectively improve the performance of scene character recognition.

3.2 Mid-level Character Structural Features

Character structure information is important to character representation and has been exploited in [10, 19, 2628, 30]. In [2628], the authors propose to use part-based tree-structured features, which are originally designed for face detection [29] for representing character features. The part-based tree-structured features are designed directly according to the shape and structure of each character class. Yao et al. [10] propose to use a set of mid-level detectable primitives (called strokelets), which capture substructures of characters, for character representation. The strokelets are used in conjunction with the HOG features for character description, as supplementary features to the HOG features. In [19], a discriminative feature pooling method that automatically learns the most informative sub-regions of each scene character is proposed for character feature representation. Zhang et al. [30] propose to use sparse coding based features for capturing character structures. The basis idea of [30] is to learn common structures with sparse coding and to capture character structures using histograms of sparse codes.

Recently, Gao et al. [18] propose a stroke bank based character representation method. The basic idea is to design a stroke detector for scene character recognition. In [16], Gao et al. propose to learn co-occurrence of local strokes by using a spatiality embedded dictionary, which is used to introduce more precise spatial information for character recognition. The results demonstrate the effectiveness of the two methods.

It is interesting to find that, some character feature representation methods mentioned above explore the mid-level features to describe character structures. Such as strokelets extracted in Yao et al. [10], the sub-regions learned by [19], and the stoke bank designed in [18], and the sub-structures learned by [30], they are all mid-level features. These learned mid-level features have shown their effectiveness in scene characeter/text recognition.

3.3 Deep Learning Based Methods

The deep learning methods have also been adopted for feature learning of scene characters. Coates et al. [14] propose a unsupervised feature learning method using convolutional neural networks (CNN) for scene character recognition. Recently, in ECCV 2014, Jaderberg et al. [12] develop a CNN classifier that can be used for both text detection and recognition. The CNN classifier has a novel architecture that enables efficient feature sharing using a number of layers in common for character recognition. The performance achieved by Jaderberg et al. [12] on both scene character recognition and text recognition is pretty high and is the best among the existing methods so far (see Sect. 4 for the performance of text recognition achieved by [12]).

4 The State-of-the-Art Scene Text Recognition Methods

Since the scene text recognition methods in end-to-end recognition systems are similar to those in cropped word recognition. In this paper, we mainly focus on the state-of-the-art of cropped word recognition. In the following, we review the methods in cropped word recognition methods.

4.1 A Typical Word Recognition Procedure

In a typical word recognition system, there are mainly two steps. The first step is character detection, which aims to simultaneously detect and recognize characters. In this step, a 63-class (10 digits, 52 English letters, and the outlier/backgound class) classifier is used to obtain character candidates and classify them. For generating character candidates, two strategies have been used: one is the sliding window strategy (such the work in [8, 11, 30], etc.), and one is to detect character candidates using the character detector/classifier directly (such as the work in [10, 26], etc.). In this step, the character feature representation methods play an important role in scene text recognition, and the performance of character classification highly affects the performance of scene text recognition.

The second step is the word formation step, which aims to combine character candidates to yield the word recognition result. In this step, multiple information can be integrated to help improve the performance of scene text recognition. An information integration model or a word recognition model can be used to integrate multiple information, which raises another important issue in scene text recognition. In the next subsection, we will briefly review the word recognition model (or score function or object function) in the literature. Figure 2 shows the scene text recognition procedure presented in [26], showing the results of the two steps.

Fig. 2.
figure 2figure 2

A typical scene text recognition procedure. The images are referred to paper [26]. In this paper, a CRF model is used as the information integration model.

4.2 Information Integration Model/Word Recognition Model

Regarding the word recognition model for yielding word recognition results, Wang et al. [8] apply a lexicon-driven pictorial structures model to combine character detection scores and geometric constraints. Mishra et al. [25] build a conditional random field (CRF) model to integrate bottom-up cues (character detection scores) with top-down cues (lexicon prior). Similarly, Shi et al. [26] use a CRF model to get final word recognition results. In [11, 12, 3133], heuristic integration models (summation of character detection scores) are used to integrate character detection result. In [28], a probabilistic model is proposed to combine the character detection scores and a language model from the Bayesian decision view. In those works, the parameters of the word recognition model are set empirically.

In [30], Zhang et al. apply the lexicon-driven pictorial structures model similar to that in [8] for word recognition. However, they improve it by taking into account the influence of the word length (i.e., the number of characters in the word) to word recognition results. Moreover, they propose to learn parameters using the Minimum Classification Error (MCE) training method [34] to optimize scene text recognition. For searching the optimal word as the recognition result, the dynamic programming algorithm is commonly used, such as the work in [8, 12, 30], etc.

4.3 Word Spotting Based Methods Versus Open Vocabulary Based Methods

For cropped word recognition, the existing scene text recognition methods can be categorized into two kinds: word spotting based methods [8, 11, 12, 30, 32, 35] and open vocabulary based methods [9, 10, 19, 25, 26, 31, 3639]. For word spotting based methods, a lexicon is provided for each cropped word image, and the optimal word is the one yielding the maximum matching score. This is similar to a word spotting procedure. For open vocabulary based methods, language prior or language model is obtained using a general larger corpus, from which the language prior or language model can be estimated.

Since the work of Wang et al. [8], most papers on scene text recognition report results using lexicons consisting of a list of words (which can be 50 words containing the ground truth word or the words created from all the words in the test set, called Full lexicons). That is, for open vocabulary based methods, one needs to retrieve the word with the smallest edit distance in the lexicon as the recognition result, such as [25, 26], etc.

Table 2. The state-of-the-Art performance of scene text recognition. (%)

4.4 The State-of-the-art Performance of Scene Text Recognition

We show the state-of-the-art performance of scene text recognition in Table 2. In the table, SVT, I03, I11, and III5K denotes the SVT, ICDAR2003, ICDAR2011, and III5K-Word dataset, respectively. In the end of each name of the dataset, the number “50” means using the lexicon consisting of 50 words; the word “Full” means using the lexicons created from words of the test set; and the word “Med” means using the Medium lexicon provided by the authors of [9].

From the table, we can see that the PhotoOCR method presented in [33] report the highest performance on SVT-50, achieving accuracy of 90.3 % on SVT-50. On I03-50 and I03-Full, the method proposed in [12] performs the best, achieving accuracy of 96.2 % and 91.5 % on I03-50 and I03-Full, respectively. It is worth noting that both [33] and [12] adopt deep learning based methods. This demonstrates the potential advantages of the deep learning based methods. Only a few works report performance on I11-50, I11-Full, III5K-50 and III5K-Med. On I11-50 and I11-Full, Shi et al. [28] report promising performance, achieving accuracy of 87.22 % and 83.21 % on I11-50 and I11-Full, respectively. On III5K-50 and III5K-Med, Yao et al. [10] report promising results, achieving accuracy of 80.2 % and 69.3 % on III5K-50 and III5K-Med, respectively.

5 Conclusions

This paper reviews the state-of-the-art of scene character and text recognition, with emphasize on character feature representation and word recognition models. The performance of scene character recognition and text recognition obtained by the recently proposed methods on both scene character recognition and text recognition are reviewed, including the most recent papers in ECCV 2014, ACCV 2014, ICIP 2014, and ICPR 2014. From the reported results, we can see that the deep learning based methods achieve the highest performance, indicating that this type of methods open a new direction for scene character and text recognition. Character feature representation, as a basic component of scene character and text recognition systems, will also be an important research direction in the future.