A novel Arabic OCR post-processing using rule-based and word context techniques

  • Iyad Abu Doush
  • Faisal Alkhateeb
  • Anwaar Hamdi Gharaibeh
Original Paper
  • 28 Downloads

Abstract

Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing, indexing, searching, and reducing the storage space. The resulted text from the OCR usually does not match the text in the original document. In order to minimize the number of incorrect words in the obtained text, OCR post-processing approaches can be used. Correcting OCR errors is more complicated when we are dealing with the Arabic language because of its complexity such as connected letters, different letters may have the same shape, and the same letter may have different forms. This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach. The proposed model is language independent and non-constrained with the string length. To the best of our knowledge, this is the first end-to-end OCR post-processing model that is applied to the Arabic language. In order to train the proposed model, we build Arabic OCR context database which contains 9000 images of Arabic text. Also, the evaluation of the OCR post-processing system results is automated using our novel alignment technique which is called fast automatic hashing text alignment. Our experimental results show that the rule-based system improves the word error rate from 24.02% to become 20.26% by using a training data set of 1000 images. On the other hand, after this training, we apply the rule-based system on 500 images as a testing dataset and the word error rate is improved from 14.95% to become 14.53%. The proposed hybrid OCR post-processing system improves the results based on using 1000 training images from a word error rate of 24.02% to become 18.96%. After training the hybrid system, we used 500 images for testing and the results show that the word error rate enhanced from 14.95 to become 14.42. The obtained results show that the proposed hybrid system outperforms the rule-based system.

Keywords

Automatic post-processing Arabic OCR post-processing Language model Alignment technique Error model 

References

  1. 1.
    Abdelraouf, A., Higgins, C.A., Khalil, M.: A database for Arabic printed character recognition. In: A database for Arabic printed character recognition, pp. 567–578. Springer, Berlin (2008)Google Scholar
  2. 2.
    Abdelraouf, A., Higgins, C.A., Pridmore, T., Khalil, M.: Building a multi-modal Arabic corpus (MMAC). Int. J. Doc. Anal. Recognit. (IJDAR) 13(4), 285–302 (2010)CrossRefGoogle Scholar
  3. 3.
    Abu Doush, I., Al-Trad, A.: Improving post-processing optical character recognition (OCR) documents with Arabic language using spelling error detection and correction. Int. J. Reason.-Based Intell. Syst. 8(4), 91–103 (2015)Google Scholar
  4. 4.
    Abu Doush, I., Alkhateeb, F., Al Raoof’bsoul, A.: Semi-automatic generation of Arabic digital talking books. In: 2014 3rd International Conference on User Science and Engineering (i-USEr)Google Scholar
  5. 5.
    Abu Doush, I., Alkhatib, F., Bsoul, A.A.R.: What we have and what is needed, how to evaluate Arabic Speech Synthesizer? Int. J. Speech Technol. 19(2), 415–432 (2016)CrossRefGoogle Scholar
  6. 6.
    Alginahi, Y.M.: A survey on Arabic character segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 16, 105–126 (2013)CrossRefGoogle Scholar
  7. 7.
    Alkhateeb, F., Abu Doush, I., Albsoul, A.: Arabic optical character recognition software: a review. Pattern Recognit. Image Anal. 27(4), 763–776 (2017)CrossRefGoogle Scholar
  8. 8.
    Alkoffash, M.S., Bawaneh, M.J., Muaidi, H., Alqrainy, S., Alzghool, M.: A survey of digital image processing techniques in character recognition. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 14(3), 65 (2014)Google Scholar
  9. 9.
    Amin, A.: Segmentation of printed Arabic text. In: Advances in Pattern Recognition—ICAPR 2001. Springer, Berlin, pp. 115–126 (2001)Google Scholar
  10. 10.
    Amin, A., Masini, G.: Machine recognition of multifont printed Arabic texts. In: Proceedings of International Conference on Pattern Recognition, Paris, France, pp. 392–395 (1986)Google Scholar
  11. 11.
    Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. Final Report, JHU Summer Workshop, p. 30 (1999)Google Scholar
  12. 12.
    Al Azawi, M., Breuel, T. M.: Context-dependent confusions rules for building error model using weighted finite state transducers for OCR post-processing. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 116–120 (2014)Google Scholar
  13. 13.
    Al Azawi, M., Hasan, A. U., Liwicki, M., Breuel, T. M.: Character-level alignment using WFST and LSTM for post-processing in multi-script recognition systems-a comparative study. In: Image Analysis and Recognition. Springer, Berlin, pp. 379–386 (2014)Google Scholar
  14. 14.
    Al Azawi, M., Liwicki, M., Breuel, T. M.: WFST-based ground truth alignment for difficult historical documents with text modification and layout variations. In: IS&T/SPIE Electronic Imaging, vol. 8658, pp. 18-865818-12 (2013)Google Scholar
  15. 15.
    Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using google online spelling suggestion (2012). arXiv preprint arXiv:1204.0191
  16. 16.
    Beaufort, R., Mancas-Thillou, C.: A weighted finite-state framework for correcting errors in natural scene OCR. Ninth Int. Conf. Doc. Anal. Recognit. 2, 889–893 (2007)Google Scholar
  17. 17.
    Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences Proceedings, pp. 21–29 (1997)Google Scholar
  18. 18.
    Broumandnia, A., Shanbehzadeh, J., Nourani, M.: Segmentation of printed Farsi/Arabic words. In: IEEE/ACS International Conference on Computer Systems and Applications, AICCSA’07, pp. 761–766 (2007)Google Scholar
  19. 19.
    Chang, J.J., Chen, S.-D.: The postprocessing of optical character recognition based on statistical noisy channel and language model. In: Proceedings of PACLIC, pp. 127–132 (1995)Google Scholar
  20. 20.
    Dađason, J.F.: Post-correction of Icelandic OCR text. Master’s thesis, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland (2012)Google Scholar
  21. 21.
    Gharaibeh, A.: A Hybrid Approach for Arabic OCR Post-Processing Using Rule Based and Word Context Techniques, Master Thesis, Yarmouk University (2016)Google Scholar
  22. 22.
    Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Data sets for OCR and document image understanding research. In: In Proceedings of the SPIE-Document Recognition IV, pp. 779–799 (1997)Google Scholar
  23. 23.
    Habeeb, I.Q., Yusof, S.A., Ahmad, F.B.: Two bigrams based language model for auto correction of Arabic OCR errors. Int. J. Digit. Content Technol. Appl. 8(1), 72 (2014)Google Scholar
  24. 24.
    Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12(4), 381–402 (1980)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Kalt, T.: A new probabilistic model of text classification and retrieval. Technical Report IR-78. Citeseer (1996)Google Scholar
  26. 26.
    Kanoun, S., Slimane, F., Guesmi, H., Ingold, R., Alimi, A. M., Hennebert, J.: Affixal approach versus analytical approach for off-line Arabic decomposable vocabulary recognition. In: 10th International Conference on Document Analysis and Recognition ( ICDAR’09), pp. 661–665 (2009)Google Scholar
  27. 27.
    Khorsheed, M.S.: Off-line Arabic character recognition-a review. Pattern Anal. Appl. 5(1), 31–45 (2002)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. (CSUR) 24(4), 377–439 (1992)CrossRefGoogle Scholar
  29. 29.
    Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., Hassan, H.: Language model based Arabic word segmentation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 399–406 (2003)Google Scholar
  30. 30.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Dokl. 10, 707–710 (1966)MathSciNetMATHGoogle Scholar
  31. 31.
    Liu, X., Croft, W.B.: Statistical language modeling for information retrieval. DTIC Document (2005)Google Scholar
  32. 32.
    Llobet, R., Navarro-Cerdan, J.R., Perez-Cortes, J.-C., Arlandis, J.: Efficient OCR post-processing combining language, hypothesis and error models. In: Structural, Syntactic, and Statistical Pattern Recognition. Springer, Berlin, pp. 728–737 (2010)Google Scholar
  33. 33.
    Magdy, W., Darwish, K.: Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408–414 (2006)Google Scholar
  34. 34.
    Magdy, W., Darwish, K.: Effect of OCR error correction on Arabic retrieval. Inf. Retr. 11(5), 405–425 (2008)CrossRefGoogle Scholar
  35. 35.
    Mostafa, M.G.: An adaptive algorithm for the automatic segmentation of printed Arabic text. In: 17th National Computer Conference, pp. 437–444 (2004)Google Scholar
  36. 36.
    Najoua, B.A., Noureddine, E.: A robust approach for Arabic printed character segmentation. Proc. Third Int. Conf. Doc. Anal. Recognit. 2, 865–868 (1995a)CrossRefGoogle Scholar
  37. 37.
    Nayak, M., Nayak, A.K.: Odia running text recognition using moment-based feature extraction and mean distance classification technique. In: Intelligent Computing, Communication and Devices, Springer (2015)Google Scholar
  38. 38.
    Saad, R., Elanwar, R., Abdel Kader, N., Mashali, S., Betke, M.: BCE-Arabic-v1 dataset: towards interpreting Arabic document images for people with visual impairments. In: PETRA ’16, Corfu Island, Greece (2016)Google Scholar
  39. 39.
    Schlosser, S.: ERIM Arabic Database. Environmental Research Institute of Michigan, Ann ARbor (2002)Google Scholar
  40. 40.
    Schulz, K.U., Mihov, S.: Fast string correction with Levenshtein automata. Int. J. Doc. Anal. Recognit. 5(1), 67–85 (2002)CrossRefMATHGoogle Scholar
  41. 41.
    Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: Database and Evaluation Protocols for Arabic Printed Text Recognition. DIUF-University of Fribourg, Switzerland (2009)Google Scholar
  42. 42.
    Slimane, F., Kanoun, S., El Abed, H., Alimi, A. M., Ingold, R., Hennebert, J.: ICDAR2013 competition on multi-font and multi-size digitally represented arabic text. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1433–1437 (2013)Google Scholar
  43. 43.
    Toselli, A.H., Romero, V., Vidal, E.: Alignment between text images and their transcripts for handwritten documents. In: Language Technology for Cultural Heritage, Springer, Berlin (2011)Google Scholar
  44. 44.
    Ul-Hasan, A., Bin Ahmed, S., Rashid, F., Shafait, F., Breuel, T. M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1061–1065 (2013)Google Scholar
  45. 45.
    Wemhoener, D., Yalniz, I.Z., Manmatha, R.: Creating an improved version using noisy OCR from multiple editions. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 160–164 (2013)Google Scholar
  46. 46.
    Yalniz, I.Z.: Efficient representation and matching of texts and images in scanned book collections. Doctoral Dissertations in University of Massachusetts (2014)Google Scholar
  47. 47.
    Yalniz, I.Z., Manmatha, R.: A fast alignment scheme for automatic ocr evaluation of books. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 754–758 (2011)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Iyad Abu Doush
    • 1
    • 2
  • Faisal Alkhateeb
    • 2
  • Anwaar Hamdi Gharaibeh
    • 2
  1. 1.Computer Science and Information Systems DepartmentAmerican University of KuwaitSalmiyaKuwait
  2. 2.Computer Sciences DepartmentYarmouk UniversityIrbidJordan

Personalised recommendations