Advertisement

Extraction of type style-based meta-information from imaged documents

  • B.B. Chaudhuri
  • U. Garain
Original papers

Abstract.

Extraction of some meta-information from printed documents without carrying out optical character recognition (OCR) is considered. It can be statistically verified that important terms in technical articles are mainly printed in italic, bold, and all-capital style. A quick approach to detecting them is proposed here. This approach is based on the global shape heuristics of these styles of any font. Important words in a document are sometimes printed in larger size as well. A smart approach for the determination of font size is also presented. Detection of type styles helps in improving OCR performance, especially for reading italicized text. Another advantage to identifying word type styles and font size has been discussed in the context of extracting: (i) different logical labels; and (ii) important terms from the document. Experimental results on the performance of the approach on a large number of good quality, as well as degraded, document images are presented.

Key words: OCR – Meta-information – Type style – Font size – Information retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • B.B. Chaudhuri
    • 1
  • U. Garain
    • 1
  1. 1.Computer Vision & Pattern Recognition Unit, Indian Statistical Institute, 203 B.T. Road, Calcutta 700 035, India; e-mail: {bbc,utpal}@isical.ac.in IN

Personalised recommendations