Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

  • Thi-Tuyet-Hai NguyenEmail author
  • Mickael Coustaty
  • Antoine Doucet
  • Adam Jatowt
  • Nhu-Van Nguyen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11279)


Post-processing is a crucial step in improving the performance of OCR process. In this paper, we present a novel approach which explores a modified way of candidate generating and candidate scoring at character level as well as word level. These features are combined with some important features suggested by related work for ranking candidates in a regression model. The experimental results show that our approach has comparable results with the top performing approaches in the Post-OCR text correction competition ICDAR 2017.


Post-OCR processing Noisy channel Language model Regression model 


  1. 1.
    Afli, H., Barrault, L., Schwenk, H.: OCR error correction using statistical machine translation. Int. J. Comput. Linguist. Appl. 7, 175–191 (2016)Google Scholar
  2. 2.
    Bassil, Y., Alwani, M.: OCR post-processing error correction algorithm using Google online spelling suggestion. arXiv preprint arXiv:1204.0191 (2012)
  3. 3.
    Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling (2013)Google Scholar
  4. 4.
    Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition, ICDAR, vol. 1, pp. 1423–1428. IEEE (2017)Google Scholar
  5. 5.
    Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)CrossRefGoogle Scholar
  6. 6.
    Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 45–51. ACM (2014)Google Scholar
  7. 7.
    Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001)Google Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  9. 9.
    Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 10 (2008)CrossRefGoogle Scholar
  10. 10.
    Islam, A., Inkpen, D.: Real-word spelling correction using Google Web IT 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1241–1249 (2009)Google Scholar
  11. 11.
    Jones, M.A., Story, G.A., Ballard, B.W.: Integrating multiple knowledge sources in a Bayesian OCR post-processor. In: International Journal on Document Analysis and Recognition, p. 925–933 (1991)Google Scholar
  12. 12.
    Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems, DAS, pp. 198–203. IEEE (2016)Google Scholar
  13. 13.
    Koehn, P., et al.: Moses: open source toolkit for statistical machine translation (2007)Google Scholar
  14. 14.
    Llobet, R., Navarro-Cerdan, J.R., Perez-Cortes, J.C., Arlandis, J.: Efficient OCR post-processing combining language, hypothesis and error models. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR/SPR 2010. LNCS, vol. 6218, pp. 728–737. Springer, Heidelberg (2010). Scholar
  15. 15.
    Mei, J., Islam, A., Wu, Y., Moh’d, A., Milios, E.E.: Statistical learning for OCR text correction. arXiv preprint arXiv:1611.06950 (2016)
  16. 16.
    Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)Google Scholar
  17. 17.
    Niwa, H., Kayashima, K.: Postprocessing for character recognition using keyword informationGoogle Scholar
  18. 18.
    Schulz, S., Kuhn, J.: Multi-modular domain-tailored OCR post-correction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2716–2726 (2017)Google Scholar
  19. 19.
    Tiedemann, J.: Character-based pivot translation for under-resourced languages and domains. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 141–151 (2012)Google Scholar
  20. 20.
    Tong, X., Evans, D.A.: A statistical approach to automatic OCR error correction in context. In: Fourth Workshop on Very Large Corpora (1996)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Thi-Tuyet-Hai Nguyen
    • 1
    Email author
  • Mickael Coustaty
    • 1
  • Antoine Doucet
    • 1
  • Adam Jatowt
    • 2
  • Nhu-Van Nguyen
    • 1
  1. 1.L3i, University of La RochelleLa RochelleFrance
  2. 2.Department of Social InformaticsKyoto UniversityKyotoJapan

Personalised recommendations