Abstract
In this paper we present a review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on either connected component analysis or character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 test documents from 7 scripts are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
J. Hochberg, L. Kerns, P. Kelly and T. Thomas, Automatic Script identification from Images Using Cluster-based Templates, IEEE PAMI, Vol. 19, No. 2, February 1997, pp. 176–181.
S. L. Wood, X. Yao, K, Krishnamurthi, L. Dang, Language Identification For Printed Text Independent of Segmentation, Proc. of IEEE ICIP 95, pp. 428–431.
A. L. Spitz, Script and Language Determination from Document Images, Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 11–13 April 1994, pp. 229–235.
P. Sibun and A. L. Spitz, Language Determination: Natural Language Processing from Scanned Document Images, Proc. of ANLP'94, pp. 15–21.
A. L. Spitz, Determination of the Script and Language Content of Document Images, IEEE PAMI, Vol. 19, No. 3, March 1997, pp 235–245.
S-W. Lee and J.-S. Kim, Multi-lingual, Multi-font, Multi-size Large-set Character Recognition using Self-Organizing Neural Network, Proc. of IDCAR'95, pp. 23–33.
A. L. Spitz, Text Characterization by Connected Component Transformations, SPIE Proceedings, Vol. 2181, 1994, pp. 97–105.
M. R. Hashemi, O. Fatemi and R Safavi, Persian Cursive Script Recognition, Proc. of IDCAR'95, pp. 869–873.
G. S. Peake and T. N. Tan, A General Algorithm For Document Skew Angle Estimation, Proc. of IEEE ICIP'97 (in press).
T. N. Tan, Texture Edge Detection by Modelling Visual Cortical Channels, Pattern Recognition, Vol. 28, No. 9, 1995, pp. 1283–1298.
T. N. Tan, Texture Feature Extraction via Visual Cortical Channel Modelling, Proc. 11th IAPR Inter. Conf. Pattern Recognition, Vol. III, 1992, pp. 607–610.
T. N. Tan, Written Language Recognition Based on Texture Analysis, Proc. of ICIP'96, Lausanne, Switz., September 1996, Vol. 2, pp. 185–188.
A. K. Jain and Y. Zhong, Page Segmentation using Texture Analysis, Pattern Recognition, Vol. 29, 1996, pp. 743–770.
T. Reed and J. M. Hans Du But, A review of recent texture segmentation and feature extraction techniques, CVGIP: Image Understanding, Vol.57, 1993, pp. 358–372.
D. Gabor, Theory of Communication, J. Inst. Elec. Engng. 93, 1946, pp. 429–459.
J. G. Daugman, Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-Dimensional Visual Cortical Filters, J. Opt. Soc. Am. A, Vol. 2, 1985, pp. 1160–1169.
R. M. Haralick, Statistical and Structural Approaches to Texture, Proc. of IEEE, Vol. 67, 1979, pp.786–804.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peake, G.S., Tan, T.N. (1997). Script and language identification from document images. In: Chin, R., Pong, TC. (eds) Computer Vision — ACCV'98. ACCV 1998. Lecture Notes in Computer Science, vol 1352. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63931-4_203
Download citation
DOI: https://doi.org/10.1007/3-540-63931-4_203
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63931-2
Online ISBN: 978-3-540-69670-4
eBook Packages: Springer Book Archive