1 Introduction

The detection and recognition of text from outdoor images is of increasing research interest to the fields of computer vision, machine learning and optical character recognition. The combination of perspective distortion, uncontrolled source text quality, and lack of significant structure to the text layout adds extra challenge to the still incompletely solved problem of accurately recognizing text from all the world’s languages. Demonstrating the interest, several datasets related to the problem have become available: including ICDAR 2003 Robust Reading [11], SVHN [13], and, more recently, COCO-Text [16], with details of these and others shown in Table 1.

While these datasets each make a useful contribution to the field, the majority are very small compared to the size of a typical deep neural network. As the dataset size increases, it becomes increasingly difficult to maintain the accuracy of the ground-truth, as the task of annotating must be delegated to an increasingly large pool of workers less involved with the project. In the COCO-text [16] dataset for instance, the authors performed an audit themselves of the accuracy of the ground truth, and found that the annotators had found legible text regions with a recall of 84 %, and transcribed the text content with an accuracy of 87.5 %. Even at an edit distance of 1, the text content accuracy was still only 92.5 %, with missing punctuation being the largest remaining category of error.

Synthetic data has been shown [8] to be a good solution to this problem and can work well provided the synthetic data generator includes the formatting/distortions that will be present in the target problem. Some real-world data however, by its very nature, can be hard to predict, so real data remains the first choice in many cases where available.

The difficulty remains therefore, in generating a sufficiently accurately annotated, large enough dataset of real images, to satisfy the needs of modern data-hungry deep network-based systems, which can learn as large a dataset as we can provide, without necessarily giving back the generalization that we would like. To this end, and to make OCR more like image captioning, we present the French Street Name Signs (FSNS) dataset, which we believe to be the first to offer multiple views of the same physical object, and thus the chance for a learning system to compensate for degradation in any individual view.

Table 1. Datasets of outdoor images containing text, including larger than single character ground truth. Information obtained mostly from the iapr-tc11.org website

2 Basics of the FSNS Dataset

As its name suggests, the FSNS dataset is a set of signs, from the streets of France, that bear street names. Some example images are shown in Fig. 1. Each image carries four tiles of \(150 \times 150\) pixels laid out horizontally, each of which contains a pre-detected street name sign, or random noise in the case that less than four independent views are available of the same physical sign. The text detection problem is thus largely eliminated, although the signs are still of variable size and orientation within each tile image. Also each sign carries multiple text lines, with a maximum of 3 lines of significant text, with the possibility of other additional lines of irrelevant text. Each of the tiles within an image is intended to be a different view of the same physical sign, taken from a different position and/or at a different time. Different physical signs of the same street name, from elsewhere on the same street, are included as separate images. There are over 1 million different physical signs.

Fig. 1.
figure 1

Some examples of FSNS images

The different views are of different quality, possibly taken from an acute angle, or blurred by motion, distance from the camera, or by unintentional privacy filtering. Occasionally some of the tiles may be views of a different sign altogether, which can happen when two signs are attached to the same post. Some examples of these problems are shown in Fig. 2. The multiple views can reduce some of the usual problems of outdoor images, such as occlusion by foreground objects, image truncation caused by the target object being at the edge of the frame, and varied lighting. Other problems cannot be solved by multiple views, such as bent, corroded or faded signs.

The task of the system then is to obtain the best possible canonical text result by combining information from the multiple views, either by processing each tile independently and combining the results, or by combining information deep within the recognition system (most likely deep network).

Fig. 2.
figure 2

Examples of blurring, obstruction, and incorrect spatial clustering

3 How the FSNS Dataset Was Created

The following process was used to create the FSNS dataset:

  1. 1.

    A street-name-sign detector was applied to all Google Street View images from France. The detector returns an image rectangle around each street name sign, together with its geographic location (latitude and longitude).

  2. 2.

    Multiple images of the same geographic location were gathered together (spatially clustered).

  3. 3.

    Text from the signs was transcribed using a combination of reCAPTCHA [3], OCR and human operators.

  4. 4.

    Transcribed text was presented to human operators to verify the accuracy of the transcription. Incorrect samples were re-routed for human transcription (back to step 3) or discarded if already the result of a human transcription.

  5. 5.

    Images were bucketized geographically (by latitude/longitude) so that the train, validation, test, and private test sets come from disjoint geographic locations, with 100 m wide strips of “wall” in between that are not used, to ensure that the same physical sign can’t be viewed from different sets.

  6. 6.

    Since roads are long entities that may pass between the disjoint geographic sections, there may be multiple signs of the same street name at multiple locations in different subsets. Therefore as each subset is generated, any images with truth strings that match a truth string in any previously generated subset are discarded. Each subset thereby has a disjoint set of truth strings.

  7. 7.

    All images for which the truth string included a character outside of the chosen encoding set, or for which the encoded label length exceeded the maximum of 37, were discarded. The character set to be handled is thus carefully controlled.

Note that the transcription was systematically Title Case folded from the original transcription, in order to make it represent the way that the street name would appear on a map. This process includes removal of text that is not relevant, including data such as the district or building numbers.

4 Normalized Truth Text

The FSNS dataset is made more interesting by the fact that the truth text is a normalized representation of the name of the street, as it should be written on the map, instead of a simple direct transcription of the text on the sign. The main normalization is Title Case transformation of the text, which is often written on the sign in all upper case. Title Case is specified as follows:

The words: au, aux, de, des, du, et, la, le, les, sous, sur always appear in lower-case. The prefixes: d’, l’ always appear in lower-case. All other words, including suffixes after d’ and l’, always appear with the initial letter capitalized and the rest in lower-case.

The other main normalization is that some text on the sign, which is not part of the name of the street, is discarded. Although this seems a rather vague instruction, for a human, even without knowledge of French, it becomes easy after reading a few signs, as the actual street names fit into a reasonably obvious pattern, and the extraneous text is usually in a smaller size.

Some examples of some of these normalizations of the text between the sign and the truth text are shown in Fig. 3. The task of transcribing the signs is thus not a basic OCR problem, but perhaps somewhat more like image captioning [17], by requiring an interpretation of what the sign means, not just its literal content. A researcher working with the FSNS dataset is hereby provided with a variety of design options between adding text post-processing to the output of an OCR engine and training a single network to learn the entire problem “end-to-end”.

Fig. 3.
figure 3

Examples of images with their normalized truth text

5 Details of the FSNS Dataset

The location of the FSNS dataset is documented in the README.md file.Footnote 1 There are 3 disjoint subsets, Train, Validation and TestFootnote 2. Each contains images of fixed size, \(600\times 150\) pixels, containing 4 tiles of \(150\times 150\) laid out horizontally, and padded with random noise where less than 4 views are available.

The size and location of each subset are shown in Table 2, and some basic analysis of the word content of each subset is shown in Table 3. The analysis in Table 3 excludes frequent words with frequency in the Train set >100, and the words listed in Sect. 4 as lower-case. As might be expected, given the process by which the subsets have been made disjoint, the fraction of words in each subset that are out of vocabulary with respect to the Train subset is reasonably high at around 30 %. Such a rate of out-of-vocabulary words will also make it difficult for a system to learn the full vocabulary from the Train set.

Table 2. Location and size of each subset of the FSNS dataset
Table 3. Word counts excluding ‘stop’ words, (being the prefixes with a frequency >100, and the lower-cased words) in each subset and number out of vocabulary (OOV) with respect to (wrt) words in the Train subset

Each subset is stored as multiple TFRecords files of tf.train.Example protocol buffers, which makes them ready-made for input to TensorFlow [1, 4]. The Example protocol buffer is very flexible, so the full details of the content of each example are laid out in Table 4.

Note that the ultimate goal of a machine learning system is to produce the UTF-8 string in “image/text.” That may be achieved either by learning the byte sequences in the text field, or there is also a pre-encoded mapping to integer class-ids provided in “image/class” and “image/unpadded_class”. The mapping between these class-ids and the UTF-8 text is provided in a separate file at charset_size=134.txt. Each line in that file lists a class-id, a tab character, and the UTF-8 string that is represented by the class-id. Class-id 0 represents a space, and the last class-id, 133, represents the “null” character, as used by the Connectionist Temporal Classification (CTC) alignment algorithm [5] typically used with an LSTM network. Note that some class-ids map to multiple UTF-8 strings, as some normalization has been applied, such as folding all the different shapes of double quote to the same class.

The ground truth text in the FSNS dataset uses a subset of these characters. In addition to all digits, upper and lower-case A-Z, there are the following accented characters: à À â  ä ç Ç é É è È ê Ê ë Ë î Î ï ô Ô œ ù Ù û Û ü ÿ and these punctuation symbols: <= _ - , ; ! ? / . ’ "( ) ] & + a total of 109, including space.

For systems that process the multiple views separately, it is possible to avoid processing the noise padding. The number of real, non-noise views of a sign is given by the value of the field “image/orig_width” divided by 150.

No sample in any of the subsets has a text field that encodes to more than 37 class-ids. 37 is not a completely arbitrary choice. When padded with nulls in between each label for CTC, \((2\times 37+1=75)\) the classic sequences are no longer than half the width \((150/2=75)\) of a single input view, which allows for some shrinkage of the data width in the network.

Table 4. The content of each example proto in the TFRecords files

6 The Challenge

The FSNS dataset provides a rich and interesting challenge in machine learning, due to the variety of tasks that are required. Here is a summary of the different processes that a model needs to learn to discover the right solution:

  • Locating the lines of text within the sign within each image.

  • Recognizing the text content within each line.

  • Discarding irrelevant text.

  • Title Case normalization.

  • Combining data from multiple signs, ignoring data from blurred or inconsistent signs.

None of the above is an explicit goal of the challenge. The current trend in machine learning is to build and train a single large/deep network to solve all of a problem without additional algorithmic pieces on one end or another, or to glue trained components together [6, 17]. We believe that the FSNS data set is large enough to train a single deep network to learn all of the above tasks, and we provide an example in Sect. 7. We therefore propose that a competition based on the FSNS dataset should measure:

  • Word recall: Fraction of space-delimited words in the truth that are present in the OCR output.

  • Word precision: Fraction of space-delimited words in the OCR output that are present in the truth.

  • Sequence error: the fraction of truth text strings that are not produced exactly by the network, after folding multiple spaces to single space.

Word recall and precision are almost universally used, and need no introduction. We add sequence error here because the strings are short enough that we can expect a significant number of them to be completely correct. Using only these metrics allows for end-to-end systems to compete directly against systems built from smaller components that are designed for specific sub-problems.

7 An End-to-End Solution

We now describe a Tensor Flow graph that has been designed specifically to address the Challenge, end-to-end, using just the graph, with no algorithmic components. This means that the text line finding and handling of multiple views, including where there are less than four, is entirely learned and dealt with inside the network. Instead of using the orig_width field in the dataset, the images are input as fixed size and the random padding informs the network of the lack of useful content. The network is based on the design that has been shown to work well for many languages in Tesseract [14], with some extensions to handle the multi-line, multi-tile FSNS dataset. The design is named Street-name Tensor-flow Recurrent End-to-End Transcriber (STREET). To perform the tasks listed above, the graph design has a high-level structure with purpose, as shown in Fig. 4.

Fig. 4.
figure 4

High-level structure of the network graph

Conventional convolutional layers process the images to extract features. Since each view may contain up to three lines of text, the next step is intended to allow the network to find upto three text lines and recognize the text in each separately. The text may appear in different positions within each image, so some character position normalization is also required. Only then can the individual outputs be combined to produce a single target string. These components of the end-to-end system are described in detail below. Tensor Flow code for the STREET model described in this paper is available at the Tensor Flow Github repositoryFootnote 3.

7.1 Convolutional Feature Extraction

The input image, being \(600\,\times \,150\), is de-tiled to make the input a batch of 4 images of size \(150\times 150\). This is achieved by a generic reshape, which is a combination of TensorFlow reshape and transpose operations that split one dimension of the input tensor and map the split parts to other dimensions. Two convolutional layers are then used with max pooling, with the expectation that they will find edges, and combine them into features, as well as reduce the size of the image down to \(25\times 25\). Figure 5 shows the detail of the convolutions.

Fig. 5.
figure 5

Convolutional feature extraction and size reduction

7.2 Textline Finding and Reading

Vertically summarizing Long Short-Term Memory (LSTM)[7] cells are used to find text lines. Summarizing with an LSTM, inspired by the LSTM used for sequence to sequence translation [15], involves ignoring the outputs of all timesteps except the last. A vertically summarizing LSTM is a summarizing LSTM that scans the input vertically. It is thus expected to compute a vertical summary of its input, which will be taken from the last vertical timestep. Each x-position is treated independently. Three different vertical summarizations are used:

  1. 1.

    Upward to find the top textline.

  2. 2.

    Separate upward and downward LSTMs, with depth-concatenated outputs, to find the middle textline.

  3. 3.

    Downward to find the bottom textline.

Although each vertically summarizing LSTM sees the same input, and could theoretically summarize the entirety of what it sees, they are organized this way so that they only have to produce a summary of the most recently seen information. Since the middle line is harder to find, that gets two LSTMs working in opposite directions. Each receives a copy of the output from the convolutional layers and passes its output to a separate bi-directional horizontal LSTM to recognize the text. Bidirectional LSTMs have been shown to be able to read text with high accuracy [2]. The outputs of the bi-directional LSTMs are concatenated in the x-dimension, to string the text lines out in reading order. Figure 6 shows the details.

Fig. 6.
figure 6

Text line finding and reading

7.3 Character Position Normalization

Assuming that each network component so far has achieved what it was designed to do, we now have a batch of four sets of one to three lines of text, spread spatially across the x-dimension. Each of the four sign images in a batch may have the text positioned differently, due to different perspective within each sign image. It is therefore useful to give the network some ability to reshuffle the data along the x-dimension. To that end we provide two more LSTM layers, one scanning left-to-right across the x-dimension, and the other right-to-left, as shown in Fig. 7. Instead of a bidirectional configuration, they operate in two distinct layers. This allows state information to be passed to the right or left in the x-dimension, allowing the characters in each of the four views to be aligned.

Fig. 7.
figure 7

Character position normalization

7.4 Combination of Individual View Outputs

After giving the STREET network chance to normalize the position of the characters along the x-dimension, a generic reshape is used to move the batch of 4 views into the depth dimension, which then becomes the input to a single unidirectional LSTM and the final softmax layer, in Fig. 8. The main purpose of this last LSTM is to combine the four views for each sign to produce the most accurate result. If none of the layers that went before have done anything towards the Title Case normalization, this final LSTM layer is perfectly capable of learning to do that well.

Fig. 8.
figure 8

Combination of individual view outputs

The only regularization used is a 50 % dropout layer between the reshape that combines the four signs and the last LSTM layer. Details of each component of the STREET graph can be found in Table 5.

Table 5. Size and computational complexity of the layers in the graph

8 Experiments and Results

As a baseline, Tesseract [14] was tested, but the FSNS dataset is extremely difficult for it. The best results were obtained from the LSTM-based engine in version 4.00, with the addition of pre-processing to locate the rectangle of the sign, and invert the projective transformation, plus post-processing to Title Case the output to match the truth text, as well as combination of the highest confidence results from the four views. Even with this help, Tesseract only achieves word recall of 20–25 %. See Table 6. The majority of failure cases revolve around the textline finder, which includes noise connected components, drops characters, or merges textlines. The main cause of these difficulties appears to be the tight line spacing, compressed characters, and tight border that appears on most signs.

The STREET model was trained using the CTC [5] loss function, with the Adam optimizer [10] in Tensor Flow, with a learning rate of \(2\times 10^{-5}\), and 40 parallel training workers. The error metrics outlined in Sect. 6 were used. The results are also shown in Table 6. The results show that the model is somewhat over-trained, yet the results for validation, test and private test are very close, which suggests that these subsets are large enough to be a good reflection of the model’s true performance.

Table 6. Error rate results

Some examples of error cases are shown in Fig. 9. In the first example, the model can be confused by obstructions. On the second line, the model drops a small word, perhaps as not relevant. On the third line, a less frequent prefix is replaced by a more frequent one. In the final example, an accent is dropped.

Fig. 9.
figure 9

Some examples of error cases

9 Conclusion

The FSNS dataset provides an interesting machine learning challenge. We have shown that it is possible to obtain reasonable results for the entire task with a single end-to-end network, and the STREET network could easily be improved by application of common regularization approaches and/or changing the network structure. Alternatively there are many other possible approaches that involve applying algorithmic or learned solutions to parts of the problem. Here are a few:

  • Detecting the position/orientation of the sign by image processing or even structure from motion methods, correcting the perspective, and applying a simple OCR engine.

  • Text line finding followed by OCR on individual text lines.

  • Detecting the worst sign(s) and discarding them, by blur detection, obstruction detection, contrast, or even determining that there is more than one physical sign in the image.

A comparison of these approaches against the end-to-end approach would be very interesting and provide useful information for the direction of future research.