Extraction of type style-based meta-information from imaged documents

Chaudhuri, B.B.; Garain, U.

doi:10.1007/PL00013557

Extraction of type style-based meta-information from imaged documents

Original papers
Published: March 2001

Volume 3, pages 138–149, (2001)
Cite this article

International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

B.B. Chaudhuri¹ &
U. Garain¹

93 Accesses
18 Citations
3 Altmetric
Explore all metrics

Abstract.

Extraction of some meta-information from printed documents without carrying out optical character recognition (OCR) is considered. It can be statistically verified that important terms in technical articles are mainly printed in italic, bold, and all-capital style. A quick approach to detecting them is proposed here. This approach is based on the global shape heuristics of these styles of any font. Important words in a document are sometimes printed in larger size as well. A smart approach for the determination of font size is also presented. Detection of type styles helps in improving OCR performance, especially for reading italicized text. Another advantage to identifying word type styles and font size has been discussed in the context of extracting: (i) different logical labels; and (ii) important terms from the document. Experimental results on the performance of the approach on a large number of good quality, as well as degraded, document images are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Computer Vision & Pattern Recognition Unit, Indian Statistical Institute, 203 B.T. Road, Calcutta 700 035, India; e-mail: {bbc,utpal}@isical.ac.in , , , , , , IN
B.B. Chaudhuri & U. Garain

Authors

B.B. Chaudhuri
View author publications
You can also search for this author in PubMed Google Scholar
U. Garain
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received July 12, 2000 / Revised October 1, 2000

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaudhuri, B., Garain, U. Extraction of type style-based meta-information from imaged documents. IJDAR 3, 138–149 (2001). https://doi.org/10.1007/PL00013557

Download citation

Issue Date: March 2001
DOI: https://doi.org/10.1007/PL00013557

Key words: OCR – Meta-information – Type style – Font size – Information retrieval

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extraction of type style-based meta-information from imaged documents

Abstract.

Access this article

Similar content being viewed by others

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

A Two-Stage Approach for Text and Non-text Separation from Handwritten Scientific Document Images

Extracting Descriptive Words from Untranscribed Handwritten Images

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Extraction of type style-based meta-information from imaged documents

Abstract.

Access this article

Similar content being viewed by others

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

A Two-Stage Approach for Text and Non-text Separation from Handwritten Scientific Document Images

Extracting Descriptive Words from Untranscribed Handwritten Images

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation