Abstract
This paper investigates the use of image processing techniques and machine learning algorithm of logistic regression to extract text from scanned forms. Conversion of printed or handwritten documents into digital modifiable text is a tedious task and requires a lot of human effort. In order to automate this task, we apply the machine learning algorithm of logistic regression. The main components of this system are (i) text detection from the scanned document and (ii) character recognition of the individual characters in the detected text. In order to complete these tasks, we firstly use the image processing techniques to do line segmentation, character segmentation, and then ultimately character recognition. The character recognition is done by a one-vs-all classifier which is trained using the training data set and learns the parameters with the help of this data set. Once the classifier has learned the parameters, it could identify a total of 39 characters which include capital English alphabets, numerals, and a few symbols.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mohammad, F., et al. (IJCSIT) International Journal of Computer Science and Information Technologies, 5(2), 2088–2090 (2014)
Wolf, C., Jolion, M.J., Chassaing, F.: Text localization, enhancement and binarization in multimedia documents. In: International conference on pattern recognition, pp. 1037–1040, 2002
Kahan, S.T., Pavlidis, T., Baird, W.: “On recognition of printed characters of any font and size”, IEEE transactions of pattern recognition and machine intelligence, pami-91987, pp. 274–285
Hosmer, D., Lemeshow, S.: Applied logistic regression, 2nd edn. Wiley, New York (2000)
Harrell, F.: Regression Modeling Strategies: With Applications To Linear Models, Logistic Regression, and Survival Analysis. Springer, New York (2001)
Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 35, 352–359 (2002)
Jain, A.K., Bhattacharjee, S.: Text segmentation using Gabor filters for automatic document processing. Mach. Vis. Appl. 5(5), 169–184 (1992)
An embedded application for degraded text recognition. EURASIP J. Adv. Signal Process. 2005(13), 2127–2135 (2005)
Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)
Wolf, C., Jolion, J-M. Extraction and Recognition of Artificial Text in Multimedia Documents. http://rfv.insalyon.fr/wolf/papers/tr-rfv-2002-01.pdf
Doermann, D., Liang, J., Li, H.: Progress in camera-based document image analysis, in Proc. 7th IEEE International Conference on Document Analysis and Recognition (ICDAR’03), vol. 1, pp. 606–617, Aug 2003
Optical character recognition by open source OCR tool tesseract: a case study, Int. J. Comp. App. 55(10), 0975–8887 Oct 2012
Matsuo, K., Ueda, K., Michio, U.: Extraction of character string from scene image by binarizing local target area. Transaction of The Institute of Electrical Engineers of Japan, 122-C(2), 232–241, Feb 2002
Gao, J., Yang, J.: An adaptive algorithm for text detection from natural scenes, in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’01), vol. 2, pp. 84–89, Kauai, Hawaii, USA, 2001
Sobottka, K., Bunke, H., Kronenberg, H.: Identification of text on colored book and journal covers, International Conference on Document Analysis and Recognition 57–63 1999
Chen, X., Yuille,A.: Detecting and reading text in natural scenes. In: Computer Vision and Pattern Recognition, vol. 2 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Aggarwal, V., Jajoria, S., Sood, A. (2018). Text Retrieval from Scanned Forms Using Optical Character Recognition. In: Urooj, S., Virmani, J. (eds) Sensors and Image Processing. Advances in Intelligent Systems and Computing, vol 651. Springer, Singapore. https://doi.org/10.1007/978-981-10-6614-6_21
Download citation
DOI: https://doi.org/10.1007/978-981-10-6614-6_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6613-9
Online ISBN: 978-981-10-6614-6
eBook Packages: EngineeringEngineering (R0)