Advertisement

Improving OCR for Historical Documents by Modeling Image Distortion

  • Keiya MaekawaEmail author
  • Yoichi Tomiura
  • Satoshi Fukuda
  • Emi Ishita
  • Hideaki Uchiyama
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11853)

Abstract

Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.

Keywords

OCR error Information retrieval Historical document image 

References

  1. 1.
    Ghosh, K., Chakraborty, A., Parui, S.K., Majumder, P.: Improving information retrieval performance on OCRed text in the absence of clean text ground truth. Inf. Process. Manag. 52(5), 873–884 (2016)CrossRefGoogle Scholar
  2. 2.
    Chen, Y., Wang, L.: Broken and degraded document images binarization. Neurocomputing 237, 272–280 (2017)CrossRefGoogle Scholar
  3. 3.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  4. 4.
    Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, pp. 2223–2232 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Keiya Maekawa
    • 1
    Email author
  • Yoichi Tomiura
    • 1
  • Satoshi Fukuda
    • 1
  • Emi Ishita
    • 1
  • Hideaki Uchiyama
    • 1
  1. 1.Kyushu UniversityNishi-kuJapan

Personalised recommendations