Text recognition in document images obtained by a smartphone based on deep convolutional and recurrent neural network

  • Hassan El BahiEmail author
  • Abdelkarim Zatni


Automatic text recognition in document images is an important task in many real-world applications. Several systems have been proposed to accomplish this task. However, a little attention has been given to document images obtained by mobile phones. To meet this need, we propose a new system that integrates preprocessing, features extraction and classification in order to recognize text contained in the document images acquired by a smartphone. The preprocessing phase is applied to locate the text region, and then segment that region into text line images. In the second phase, a sliding window divides the text-line image into a sequence of frames; afterwards a deep convolutional neural network (CNN) model is used to extract features from each frame. Finally, an architecture that combines the bidirectional recurrent neural network (RNN), the gated recurrent units (GRU) block and the connectionist temporal classification (CTC) layer is explored to ensure the classification phase. The proposed system has been tested on the ICDAR2015 Smartphone document OCR dataset and the experimental results show that the proposed system is capable to achieve promising recognition rates.


Text recognition Document image Smartphone Convolutional neural network Recurrent neural network 



  1. 1.
    Ahmad I, Rothacker L, Fink GA, Mahmoud SA (2013) Novel sub-character hmm models for arabic text recognition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 658–662Google Scholar
  2. 2.
    Antonacopoulos A, Clausner C, Papadopoulos C, Pletschacher S (2015) Icdar2015 competition on recognition of documents with complex layouts-rdcl2015. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1151–1155Google Scholar
  3. 3.
    Bahi HE, Zatni A (2017) Segmentation and recognition of text images acquired by a mobile phone. International Journal of Tomography & SimulationTM 30(4):95–107Google Scholar
  4. 4.
    Banumathi KL, Jagadeesh Chandra AP (2016) Line and word segmentation of kannada handwritten text documents using projection profile technique. In: 016 international conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), IEEE, pp 196–201Google Scholar
  5. 5.
    Bertolami R, Bunke H (2008) Hidden markov model-based ensemble methods for offline handwritten text line recognition. Pattern Recogn 41(11):3452–3460zbMATHGoogle Scholar
  6. 6.
    Bukhari SS, Shafait F, Breuel TM (2011) Improved document image segmentation algorithm using multiresolution morphology. In: Document Recognition and Retrieval XVIII, vol 7874. International Society for Optics and Photonics, pp 78740DGoogle Scholar
  7. 7.
    Burie J-C, Chazalon J, Coustaty M, Eskenazi S, Luqman MM, Mehri M, Nayef N, Ogier J-M, Prum S, Rusiñol M (2015) Icdar2015 competition on smartphone document capture and ocr (smartdoc). In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1161–1165Google Scholar
  8. 8.
    Canny J (1987) A computational approach to edge detection. In: Readings in Computer Vision, Elsevier, pp 184–203Google Scholar
  9. 9.
    Castro DMR, Revel A, Ménard M (2015) Document image analysis by a mobile robot for autonomous indoor navigation. in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 156–160Google Scholar
  10. 10.
    Chen S, Zhang C, Dong M (2018) Deep age estimation: From classification to ranking. IEEE Transactions on Multimedia, 20(8)Google Scholar
  11. 11.
    Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
  12. 12.
    El Bahi H, Zatni A (2016) Pre-processing of document images obtained with a smartphone. International Review on Computers and Software 11(12):1187–1198Google Scholar
  13. 13.
    Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211Google Scholar
  14. 14.
    Eskenazi S, Gomez-Krämer P, Ogier J-M (2017) A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recogn 64:1–14Google Scholar
  15. 15.
    Espana-Boquera S, Castro-Bleda MJ, Gorbe-Moya J, Zamora-Martinez F (2011) Improving offline handwritten text recognition with hybrid hmm/ann models. IEEE Trans Pattern Anal Mach Intell 33(4):767–779Google Scholar
  16. 16.
    Gllavata J, Ewerth R, Freisleben B (2004) Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: ICPR 2004. Proceedings of the 17th International Conference on Pattern Recognition, 2004, vol 1. IEEE, pp 425–428Google Scholar
  17. 17.
    Granell E, Chammas E, Likforman-Sulem L, Martínez-Hinarejos CD, Mokbel C, Cîrstea B-I (2018) Transcription of spanish historical handwritten documents with deep neural networks. Journal of Imaging 4(1):15Google Scholar
  18. 18.
    Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ACM, pp 369–376Google Scholar
  19. 19.
    Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5-6):602–610Google Scholar
  20. 20.
    Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. in: Advances in Neural Information Processing Systems, pp 545–552Google Scholar
  21. 21.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780Google Scholar
  22. 22.
    Huang W, Yu Q, Tang X (2014) Robust scene text detection with convolution neural network induced mser trees. in: European Conference on Computer Vision, Springer, pp 497–511Google Scholar
  23. 23.
    Keysers D, Deselaers T, Gollan C, Ney H (2007) Deformation models for image recognition. IEEE Trans Pattern Anal Mach Intell 29(8):1422–1435Google Scholar
  24. 24.
    Keysers D, Deselaers T, Rowley HA, Wang L-L, Carbune V (2017) Multi-language online handwriting recognition. IEEE Trans Pattern Anal Mach Intell 39(6):1180–1194Google Scholar
  25. 25.
    Khare V, Shivakumara P, Raveendran P (2015) A new histogram oriented moments descriptor for multi-oriented moving text detection in video. Expert Syst Appl 42(21):7627–7640Google Scholar
  26. 26.
    Kim BS, Koo HI, Cho NI (2015) Document dewarping via text-line based optimization. Pattern Recogn 48(11):3600–3614Google Scholar
  27. 27.
    Kozielski M, Doetsch P, Ney H (2013) Improvements in rwth’s system for off-line handwriting recognition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 935–939Google Scholar
  28. 28.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436Google Scholar
  29. 29.
    Li J, Liang X, Shen SM, Xu T, Feng J, Yan S (2018) Scale-aware fast r-cnn for pedestrian detection. IEEE Trans Multimedia 20(4):985–996Google Scholar
  30. 30.
    Liang J, Doermann D, Li H (2005) Camera-based analysis of text and documents: a survey. Int J Doc Anal Recognit (IJDAR) 7(2-3):84–104Google Scholar
  31. 31.
    Liu C, Yu Z, Wang B, Ding X (2015) Restoring camera-captured distorted document images. Int J Doc Anal Recognit (IJDAR) 18(2):111–124Google Scholar
  32. 32.
    Liu X, Wang W (2015) An effective graph-cut scene text localization with embedded text segmentation. Multimed Tools Appl 74(13):4891–4906Google Scholar
  33. 33.
    Liu Z, Zhang C, Tian Y (2016) 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100Google Scholar
  34. 34.
    Maalej R, Tagougui N, Kherallah M (2016) Online arabic handwriting recognition with dropout applied in deep recurrent neural networks. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), IEEE, pp 417–421Google Scholar
  35. 35.
    Messina R, Louradour J (2015) Segmentation-free handwritten chinese text recognition with lstm-rnn. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 171–175Google Scholar
  36. 36.
    Morillot O, Likforman-Sulem L, Grosicki E (2013) New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks. J Electron Imaging 22(2):023028Google Scholar
  37. 37.
    Nagabhushan P, Alaei A (2010) Tracing and straightening the baseline in handwritten persian/arabic text-line: a new approach based on painting-technique. Int J Comput Sci Eng 2(4):907–916Google Scholar
  38. 38.
    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 807–814Google Scholar
  39. 39.
    Namboodiri AM, Jain AK (2007) Document structure and layout analysis. In: Digital Document Processing, Springer, pp 29–48Google Scholar
  40. 40.
    Nayef N, Luqman MM, Prum S, Eskenazi S, Chazalon J, Ogier J-M (2015) Smartdoc-qa: a dataset for quality assessment of smartphone captured document images-single and multiple distortions. in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1231–1235Google Scholar
  41. 41.
    Naz S, Umar AI, Ahmad R, Ahmed SB, Shirazi SH, Siddiqi I, Razzak MI (2016) Offline cursive urdu-nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177:228–241Google Scholar
  42. 42.
    Razak Z, Zulkiflee K, Idris MYI, Tamil EM, Noor MNM, Salleh R, Yaakob M, Yusof ZM, Yaacob M (2008) Off-line handwriting text line segmentation: a review. International Journal of Computer Science and Network Security 8(7):12–20Google Scholar
  43. 43.
    Rehman A, Saba T (2014) Neural networks for document image preprocessing: state of the art. Artif Intell Rev 42(2):253–273Google Scholar
  44. 44.
    Retsinas G, Louloudis G, Stamatopoulos N, Gatos B (2016) Keyword spotting in handwritten documents using projections of oriented gradients. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), IEEE, pp 411–416Google Scholar
  45. 45.
    Roy PP, Bhunia AK, Das A, Dey P, Pal U (2016) Hmm-based indic handwritten word recognition using zone segmentation. Pattern Recogn 60:1057–1075Google Scholar
  46. 46.
    Saha S, Basu S, Nasipuri M (2015) ilpr: an indian license plate recognition system. Multimed Tools Appl 74(23):10621–10656Google Scholar
  47. 47.
    Scherer D, Müller A, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Artificial Neural Networks–ICANN 2010, Springer, pp 92–101Google Scholar
  48. 48.
    Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681Google Scholar
  49. 49.
    Shekar BH, Smitha ML, Shivakumara P (2014) Discrete wavelet transform and gradient difference based approach for text localization in videos. In: 2014 fifth International Conference on Signal and Image Processing (ICSIP), IEEE, pp 280–284Google Scholar
  50. 50.
    Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304Google Scholar
  51. 51.
    Smith RW (2009) Hybrid page layout analysis via tab-stop detection. In: 2009 10Th International Conference on Document Analysis and Recognition, IEEE, pp 241–245Google Scholar
  52. 52.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  53. 53.
    Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network. In: Asian Conference on Computer Vision, Springer, pp 35–48Google Scholar
  54. 54.
    Sueiras J, Ruiz V, Sanchez A, Velez JF (2018) Offline continuous handwriting recognition using sequence to sequence neural networks. Neurocomputing 289:119–128Google Scholar
  55. 55.
    Tang Y, Wu X, Bu W (2014) Text line segmentation based on matched filtering and top-down grouping for handwritten documents. In: 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), IEEE, pp 365–369Google Scholar
  56. 56.
    Tran TA, Na IS, Kim SH (2016) Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology. Int J Doc Anal Recognit (IJDAR) 19(3):191–209Google Scholar
  57. 57.
    Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2018) Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access 6:1155–1166Google Scholar
  58. 58.
    Wang X, Song Y, Zhang Y, Xin J (2017) A hierarchical recursive method for text detection in natural scene images. Multimed Tools Appl 76(24):26201–26223Google Scholar
  59. 59.
    Wang X, Yi G, Wang Y, Yu J (2017) Automatic breast tumor detection in abvs images based on convolutional neural network and superpixel patterns. Neural Comput Applic, pp 1–13Google Scholar
  60. 60.
    Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) Hcp: a flexible cnn framework for multi-label image classification. IEEE Trans Pattern Anal Mach Intell 38(9):1901–1907Google Scholar
  61. 61.
    Yan C, Xie H, Chen J, Zha Z, Hao X, Zhang Y, Dai Q (2018) Cross-modality bridging and knowledge transferring for image understanding. IEEE Trans Multimedia Early AccessGoogle Scholar
  62. 62.
    Yan C, Xie H, Chen J, Zha Z, Hao X, Zhang Y, Dai Q (2018) A fast uyghur text detector for complex background images. IEEE Trans Multimedia 20 (12):3389–3398Google Scholar
  63. 63.
    Yan C, Xie H, Liu S, Yin J, Zhang Y, Dai Q (2018) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229Google Scholar
  64. 64.
    Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500Google Scholar
  65. 65.
    Yin X-C, Zuo Z-Y, Tian S, Liu C-L (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773MathSciNetzbMATHGoogle Scholar
  66. 66.
    Yousfi S, Berrani S-A, Garcia C (2017) Contribution of recurrent connectionist language models in improving lstm-based arabic text recognition in videos. Pattern Recogn 64:245–254Google Scholar
  67. 67.
    Zhang Y-D, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang S-H (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimedia Tools and Applications, pp 1–20Google Scholar
  68. 68.
    Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: recent advances and future trends. Front Comp Sci 10(1):19–36Google Scholar
  69. 69.
    Zhu Y, Zhang K (2017) Text segmentation using superpixel clustering. IET Image Process 11(7):455–464Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Laboratory of Metrology and Information ProcessingIbnou Zohr UniversityAgadirMorocco
  2. 2.Laboratory of Applied Mathematics and Computer ScienceCadi Ayyad UniversityMarrakechMorocco

Personalised recommendations