Skip to main content

Text Retrieval from Scanned Forms Using Optical Character Recognition

  • Conference paper
  • First Online:
Sensors and Image Processing

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 651))

Abstract

This paper investigates the use of image processing techniques and machine learning algorithm of logistic regression to extract text from scanned forms. Conversion of printed or handwritten documents into digital modifiable text is a tedious task and requires a lot of human effort. In order to automate this task, we apply the machine learning algorithm of logistic regression. The main components of this system are (i) text detection from the scanned document and (ii) character recognition of the individual characters in the detected text. In order to complete these tasks, we firstly use the image processing techniques to do line segmentation, character segmentation, and then ultimately character recognition. The character recognition is done by a one-vs-all classifier which is trained using the training data set and learns the parameters with the help of this data set. Once the classifier has learned the parameters, it could identify a total of 39 characters which include capital English alphabets, numerals, and a few symbols.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mohammad, F., et al. (IJCSIT) International Journal of Computer Science and Information Technologies, 5(2), 2088–2090 (2014)

    Google Scholar 

  2. Wolf, C., Jolion, M.J., Chassaing, F.: Text localization, enhancement and binarization in multimedia documents. In: International conference on pattern recognition, pp. 1037–1040, 2002

    Google Scholar 

  3. Kahan, S.T., Pavlidis, T., Baird, W.: “On recognition of printed characters of any font and size”, IEEE transactions of pattern recognition and machine intelligence, pami-91987, pp. 274–285

    Google Scholar 

  4. Hosmer, D., Lemeshow, S.: Applied logistic regression, 2nd edn. Wiley, New York (2000)

    Book  MATH  Google Scholar 

  5. Harrell, F.: Regression Modeling Strategies: With Applications To Linear Models, Logistic Regression, and Survival Analysis. Springer, New York (2001)

    Google Scholar 

  6. Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 35, 352–359 (2002)

    Article  Google Scholar 

  7. Jain, A.K., Bhattacharjee, S.: Text segmentation using Gabor filters for automatic document processing. Mach. Vis. Appl. 5(5), 169–184 (1992)

    Article  Google Scholar 

  8. An embedded application for degraded text recognition. EURASIP J. Adv. Signal Process. 2005(13), 2127–2135 (2005)

    Google Scholar 

  9. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)

    Article  Google Scholar 

  10. Wolf, C., Jolion, J-M. Extraction and Recognition of Artificial Text in Multimedia Documents. http://rfv.insalyon.fr/wolf/papers/tr-rfv-2002-01.pdf

  11. Doermann, D., Liang, J., Li, H.: Progress in camera-based document image analysis, in Proc. 7th IEEE International Conference on Document Analysis and Recognition (ICDAR’03), vol. 1, pp. 606–617, Aug 2003

    Google Scholar 

  12. Optical character recognition by open source OCR tool tesseract: a case study, Int. J. Comp. App. 55(10), 0975–8887 Oct 2012

    Google Scholar 

  13. Matsuo, K., Ueda, K., Michio, U.: Extraction of character string from scene image by binarizing local target area. Transaction of The Institute of Electrical Engineers of Japan, 122-C(2), 232–241, Feb 2002

    Google Scholar 

  14. Gao, J., Yang, J.: An adaptive algorithm for text detection from natural scenes, in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’01), vol. 2, pp. 84–89, Kauai, Hawaii, USA, 2001

    Google Scholar 

  15. Sobottka, K., Bunke, H., Kronenberg, H.: Identification of text on colored book and journal covers, International Conference on Document Analysis and Recognition 57–63 1999

    Google Scholar 

  16. Chen, X., Yuille,A.: Detecting and reading text in natural scenes. In: Computer Vision and Pattern Recognition, vol. 2 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vaishali Aggarwal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Aggarwal, V., Jajoria, S., Sood, A. (2018). Text Retrieval from Scanned Forms Using Optical Character Recognition. In: Urooj, S., Virmani, J. (eds) Sensors and Image Processing. Advances in Intelligent Systems and Computing, vol 651. Springer, Singapore. https://doi.org/10.1007/978-981-10-6614-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6614-6_21

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6613-9

  • Online ISBN: 978-981-10-6614-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics