Script and language identification from document images

Peake, G. S.; Tan, T. N.

doi:10.1007/3-540-63931-4_203

Script and language identification from document images

G. S. Peake¹ &
T. N. Tan¹

Poster Session II
Conference paper
First Online: 01 January 2005

2707 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1352))

Abstract

In this paper we present a review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on either connected component analysis or character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 test documents from 7 scripts are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

J. Hochberg, L. Kerns, P. Kelly and T. Thomas, Automatic Script identification from Images Using Cluster-based Templates, IEEE PAMI, Vol. 19, No. 2, February 1997, pp. 176–181.
Google Scholar
S. L. Wood, X. Yao, K, Krishnamurthi, L. Dang, Language Identification For Printed Text Independent of Segmentation, Proc. of IEEE ICIP 95, pp. 428–431.
Google Scholar
A. L. Spitz, Script and Language Determination from Document Images, Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 11–13 April 1994, pp. 229–235.
Google Scholar
P. Sibun and A. L. Spitz, Language Determination: Natural Language Processing from Scanned Document Images, Proc. of ANLP'94, pp. 15–21.
Google Scholar
A. L. Spitz, Determination of the Script and Language Content of Document Images, IEEE PAMI, Vol. 19, No. 3, March 1997, pp 235–245.
Google Scholar
S-W. Lee and J.-S. Kim, Multi-lingual, Multi-font, Multi-size Large-set Character Recognition using Self-Organizing Neural Network, Proc. of IDCAR'95, pp. 23–33.
Google Scholar
A. L. Spitz, Text Characterization by Connected Component Transformations, SPIE Proceedings, Vol. 2181, 1994, pp. 97–105.
Google Scholar
M. R. Hashemi, O. Fatemi and R Safavi, Persian Cursive Script Recognition, Proc. of IDCAR'95, pp. 869–873.
Google Scholar
G. S. Peake and T. N. Tan, A General Algorithm For Document Skew Angle Estimation, Proc. of IEEE ICIP'97 (in press).
Google Scholar
T. N. Tan, Texture Edge Detection by Modelling Visual Cortical Channels, Pattern Recognition, Vol. 28, No. 9, 1995, pp. 1283–1298.
Google Scholar
T. N. Tan, Texture Feature Extraction via Visual Cortical Channel Modelling, Proc. 11th IAPR Inter. Conf. Pattern Recognition, Vol. III, 1992, pp. 607–610.
Google Scholar
T. N. Tan, Written Language Recognition Based on Texture Analysis, Proc. of ICIP'96, Lausanne, Switz., September 1996, Vol. 2, pp. 185–188.
Google Scholar
A. K. Jain and Y. Zhong, Page Segmentation using Texture Analysis, Pattern Recognition, Vol. 29, 1996, pp. 743–770.
Google Scholar
T. Reed and J. M. Hans Du But, A review of recent texture segmentation and feature extraction techniques, CVGIP: Image Understanding, Vol.57, 1993, pp. 358–372.
Google Scholar
D. Gabor, Theory of Communication, J. Inst. Elec. Engng. 93, 1946, pp. 429–459.
Google Scholar
J. G. Daugman, Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-Dimensional Visual Cortical Filters, J. Opt. Soc. Am. A, Vol. 2, 1985, pp. 1160–1169.
Google Scholar
R. M. Haralick, Statistical and Structural Approaches to Texture, Proc. of IEEE, Vol. 67, 1979, pp.786–804.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Reading, RG6 6AY, England
G. S. Peake & T. N. Tan

Authors

G. S. Peake
View author publications
You can also search for this author in PubMed Google Scholar
T. N. Tan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Roland Chin Ting-Chuen Pong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peake, G.S., Tan, T.N. (1997). Script and language identification from document images. In: Chin, R., Pong, TC. (eds) Computer Vision — ACCV'98. ACCV 1998. Lecture Notes in Computer Science, vol 1352. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63931-4_203

Download citation

DOI: https://doi.org/10.1007/3-540-63931-4_203
Published: 29 July 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63931-2
Online ISBN: 978-3-540-69670-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics