1 Introduction

Recognizing noisy text is difficult for most OCR algorithms [16]. Many documents that need to be digitized contain spots, pages with curved corners and many wrinkles - the so-called noise. Often that results in recognition errors. But when an image is cleared, then the accuracy can increase up to 100%. The quality of character recognition varies greatly depending on the text recognition used, filtering and image segmentation algorithm [13].

Currently, there are many different solutions to this problem, but most of them either do not provide satisfactory result, or require hardware resource and are highly time-consuming [3].

This article proposes an effective method for clearing printed and handwritten texts from noise [2], based on the use of a sequential convolutional neural network with a U-Net architecture and a multi-layer perceptron.

Neural network definition lies in the field of artificial intelligence, which based on attempts to reproduce the human nervous system. The main problem is the ability to learn and correct errors.

2 The Concept of a Neural Network

Neural network definition lies in the field of artificial intelligence, which based on attempts to reproduce the human nervous system. The main problem is the ability to learn and correct errors.

A mathematical neuron is a computational unit that receives input data, performs calculations, and transmits it further along the network.

Each neuron has inputs (x1, x2, x3, …, xn), they receive data. Each neuron stores the weights of its connections. When a neuron is activated, a nonlinear transformation (activation function) of the weighted sum of the neuron input values is calculated. In other words, the output value of the neuron is calculated as:

$$ {\text{O }} = {\rm{ f}}_{\text{a}} \left( {{\rm{x}}_{ 1} {\rm{w}}_{ 1} + {\text{x}}_{ 2} {\rm{w}}_{ 2} + \cdots + {\rm{x}}_{\rm{n}} {\text{w}}_{\rm{n}} } \right) $$
(1)

Where xi…n is the input values of the neuron, wi…n are the weight coefficients of the neuron, fa is the activation function. The task of training a neuron is to adjust the weights of the neuron using a particular method.

We considered architectures of neural networks that are feedforward networks [11] - sequentially connected layers of mathematical neurons. The input values of each subsequent layer are the output values of the neurons of the previous layer. When the network is activated, values are fed to the input layer, each layer is activated sequentially. As a result of network activation are output values of the last layer.

Direct propagation networks are usually trained using the backpropagation method [12] and their modifications. This method refers to guided training methods and is itself a form of gradient descent. The output values of the network are subtracted from the desired values, then, as a result, an error is generated that propagates through the network in the opposite direction. These weights are adjusted to maximize the output of the network to the desired.

3 Overview of Existing Methods

Until now, classical computer vision algorithms are the most popular in the tasks of clearing images from noise.

One way to eliminate dark spots and other similar artifacts in an image is the adaptive threshold [8]. This operation does not binarize the image by a constant threshold, but takes into account the values of the neighboring pixels, thus the areas of the spots will be eliminated. However, after performing a threshold transformation, noise remains in the image, in the form of a set of small dots in place of spots (some parts of the spot exceed the threshold). Result of adaptive threshold is demonstrated in Fig. 1.

Fig. 1.
figure 1

Result of adaptive threshold

The successive overlay of filters of erosion and dilation [9] helps to get rid of this effect, but such operations can damage the letters of the text. This is shown in Fig. 2.

Fig. 2.
figure 2

Erosion and dilatation filters applied

This result is not sufficient for recognition, so it can be improved using the non-local means method [10]. This method is applied before the adaptive threshold conversion and allows you to reduce the values of those points where small noise occurs. The result shown in Fig. 3 is much better, but it still shows small artifacts such as lines and points.

Fig. 3.
figure 3

Non-local means algorithm applied first

Analysis of existing methods has shown that the use of classical computer vision algorithms does not always show a good result and needs to be modernized.

4 Description of the Method Developed for Clearing Print and Handwritten Texts from Noise

4.1 Task Setting

The task of clearing text from noise, recognizing text in an image and converting it into text format consists of a number of subtasks [4, 6]:

  1. 1.

    Select a test image;

  2. 2.

    Converting color images to shades of gray;

  3. 3.

    Scaling and cutting images to a certain size;

  4. 4.

    Clearing the text from noise with the help of a convolutional neural network and a multilayer perceptron.

4.2 Preparation of Images Before Training

After reading, the input image is converted to a single channel (grayscale), where the value of each pixel is calculated as:

$$ {\text{Y}}' = \, 0. 2 9 9 {\rm{R }} + \, 0. 5 8 7 {\text{G }} + \, 0. 1 1 4 {\rm{B}} $$
(2)

This equation is used in the OpenCV library and is justified by the characteristics of human color perception [14, 15].

Image sizes can be arbitrary, but too large sizes are undesirable.

For training 144 pictures are used. Since the size of the available training sample was not enough to train the convolutional network, it was decided to divide the images into parts. Since the training sample consists of images of different sizes, each image was scaled to the size of 448 × 448 pixels using the linear interpolation method. After that, they were all cut into non-overlapping windows measuring 112 × 112 pixels. All images were rotated 90, 180 and 270°. Thus, about 9216 images were obtained in the training sample. As a result, an array with the dimension (16,112,112,1) is fed to the input of the network. In the same way, the test sample was processed. The test sample consisted of similar images, the differences were only in the texture of the noise and in the text. We can see process of slicing and resizing of an image in Fig. 4

Fig. 4.
figure 4

Process of slicing and resizing of an image

The training sample of a single-layer perceptron is formed as follows [7]:

  1. 1.

    Images of the training set are passed through the pre-trained network of the U-Net architecture. Of the 144 images, only 36 were processed;

  2. 2.

    28 different filters are superimposed on the resulting images. Thus, from each image we get 29 different, using the initial one;

  3. 3.

    Next, pairs are formed (input vector; resultant vector). The input vector is formed from pixels located at the same place in the 29 resulting images. The resulting vector consists of one corresponding pixel from the cleaned image;

  4. 4.

    Operation (3) is performed for each of the 36 images. As a result, the training sample has a volume of 36 * 448 * 488 elements.

As a result, the training sample has a volume of 36 * 448 * 488 elements.

4.3 Artificial Neural Network Training

In the proposed method, to clean print and handwritten texts from noise, a sequential convolutional neural network with U-Net architecture [17] and a multilayer perceptron are used. An array of non-overlapping areas of the original image measuring 112 × 122 is fed to the input of the network, and the output is a similar array with processed areas.

A smaller version of the U-Net architecture was selected, consisting of only two blocks (the original version of four) (Fig. 5).

Fig. 5.
figure 5

Abbreviated architecture of U-net

The advantage of this architecture is that a small amount of training data is required for network training. At the same time, the network has a relatively small number of weights due to its convolutional architecture.

The architecture is a sequence of layers of convolution and pooling [18], which reduce the spatial resolution of the image, then increase it by combining the image with the data and passing it through other layers of the convolution.

Despite the fact that the convolutional network coped with the majority of noise, the image became less clear and left artifacts on it. To improve the quality of the text in the image, another artificial neural network is used - the multilayer perceptron.

The output array of the convolutional network is glued together into a single image with dimensions of 448 × 488 pixels, after which it is fed to the input of a multilayer perceptron. The format of the training set was described in Sect. 4.2.

The structure of the multilayer perceptron consists of 3 layers: 29 input neurons, 500 neurons on the hidden layer and one output neuron [1].

4.4 Testing of an Artificial Neural Network

The results of processing the original image using the reduced U-Net architecture are shown in Fig. 6.

Fig. 6.
figure 6

Comparison of the original image and processed using a convolutional neural network.

As a result of the subsequent processing of the obtained image, its accuracy and contrast increased significantly. Small artifacts were also removed. An example is shown in Fig. 7.

Fig. 7.
figure 7

Cleared image

5 Developed Solution

During the study, a software module was developed for digitizing damaged documents using this method. Python 3 was chosen as the development language. The Keras open neural network library, the Numpy library, and the OpenCV computer vision library have been used.

The module also has the ability to recognize text from the processed image using the Tesseract OCR engine [5].

When processing the input data in the form of a noisy image with text using this module, the output data is obtained in the form of text in a format suitable for its processing.

6 Conclusion

We reviewed several existing methods for clearing noisy printed documents, identified their shortcomings and proposed a method that has higher efficiency. The method described in the work requires a small training sample, it works quickly and has an average accuracy of noise removal of 93% [20].

Thus, the image processed by the method described in this article is quite clean, has no significant distortion and is easily recognized by most OCR engines and applications.

In the future, these methods can be used in libraries, hospitals [19], news companies where people work with non-digitized papers and their digitization is needed.