Text Retrieval from Scanned Forms Using Optical Character Recognition

Aggarwal, Vaishali; Jajoria, Sourabh; Sood, Apoorvi

doi:10.1007/978-981-10-6614-6_21

Vaishali Aggarwal¹⁶,
Sourabh Jajoria¹⁶ &
Apoorvi Sood¹⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 651))

895 Accesses
2 Citations

Abstract

This paper investigates the use of image processing techniques and machine learning algorithm of logistic regression to extract text from scanned forms. Conversion of printed or handwritten documents into digital modifiable text is a tedious task and requires a lot of human effort. In order to automate this task, we apply the machine learning algorithm of logistic regression. The main components of this system are (i) text detection from the scanned document and (ii) character recognition of the individual characters in the detected text. In order to complete these tasks, we firstly use the image processing techniques to do line segmentation, character segmentation, and then ultimately character recognition. The character recognition is done by a one-vs-all classifier which is trained using the training data set and learns the parameters with the help of this data set. Once the classifier has learned the parameters, it could identify a total of 39 characters which include capital English alphabets, numerals, and a few symbols.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mohammad, F., et al. (IJCSIT) International Journal of Computer Science and Information Technologies, 5(2), 2088–2090 (2014)
Google Scholar
Wolf, C., Jolion, M.J., Chassaing, F.: Text localization, enhancement and binarization in multimedia documents. In: International conference on pattern recognition, pp. 1037–1040, 2002
Google Scholar
Kahan, S.T., Pavlidis, T., Baird, W.: “On recognition of printed characters of any font and size”, IEEE transactions of pattern recognition and machine intelligence, pami-91987, pp. 274–285
Google Scholar
Hosmer, D., Lemeshow, S.: Applied logistic regression, 2nd edn. Wiley, New York (2000)
Book MATH Google Scholar
Harrell, F.: Regression Modeling Strategies: With Applications To Linear Models, Logistic Regression, and Survival Analysis. Springer, New York (2001)
Google Scholar
Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 35, 352–359 (2002)
Article Google Scholar
Jain, A.K., Bhattacharjee, S.: Text segmentation using Gabor filters for automatic document processing. Mach. Vis. Appl. 5(5), 169–184 (1992)
Article Google Scholar
An embedded application for degraded text recognition. EURASIP J. Adv. Signal Process. 2005(13), 2127–2135 (2005)
Google Scholar
Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)
Article Google Scholar
Wolf, C., Jolion, J-M. Extraction and Recognition of Artificial Text in Multimedia Documents. http://rfv.insalyon.fr/wolf/papers/tr-rfv-2002-01.pdf
Doermann, D., Liang, J., Li, H.: Progress in camera-based document image analysis, in Proc. 7th IEEE International Conference on Document Analysis and Recognition (ICDAR’03), vol. 1, pp. 606–617, Aug 2003
Google Scholar
Optical character recognition by open source OCR tool tesseract: a case study, Int. J. Comp. App. 55(10), 0975–8887 Oct 2012
Google Scholar
Matsuo, K., Ueda, K., Michio, U.: Extraction of character string from scene image by binarizing local target area. Transaction of The Institute of Electrical Engineers of Japan, 122-C(2), 232–241, Feb 2002
Google Scholar
Gao, J., Yang, J.: An adaptive algorithm for text detection from natural scenes, in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’01), vol. 2, pp. 84–89, Kauai, Hawaii, USA, 2001
Google Scholar
Sobottka, K., Bunke, H., Kronenberg, H.: Identification of text on colored book and journal covers, International Conference on Document Analysis and Recognition 57–63 1999
Google Scholar
Chen, X., Yuille,A.: Detecting and reading text in natural scenes. In: Computer Vision and Pattern Recognition, vol. 2 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Netaji Subhas Institute of Technology, Sector 3, Dwarka, New Delhi, India
Vaishali Aggarwal, Sourabh Jajoria & Apoorvi Sood

Authors

Vaishali Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Sourabh Jajoria
View author publications
You can also search for this author in PubMed Google Scholar
Apoorvi Sood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vaishali Aggarwal .

Editor information

Editors and Affiliations

Gautam Buddha University, Greater Noida, Uttar Pradesh, India
Shabana Urooj
Department of Electrical and Instrumentation Engineering, Thapar University, Patiala, Punjab, India
Jitendra Virmani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aggarwal, V., Jajoria, S., Sood, A. (2018). Text Retrieval from Scanned Forms Using Optical Character Recognition. In: Urooj, S., Virmani, J. (eds) Sensors and Image Processing. Advances in Intelligent Systems and Computing, vol 651. Springer, Singapore. https://doi.org/10.1007/978-981-10-6614-6_21

Download citation

DOI: https://doi.org/10.1007/978-981-10-6614-6_21
Published: 04 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6613-9
Online ISBN: 978-981-10-6614-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics