Abstract
This paper introduces an image texture analysis method for minority language identification. In the first stage, each letter is associated with a given script type according to its energy status in the text-line area. Mapping is carried out by extracting unicode text and transforming it into coded text. There are four different script types, which correspond to four grey levels of an image. Then, the obtained image is subjected to a feature extraction process performed by the texture analysis. This way, the grey level co-occurrence matrix and its derivative features are calculated. Extracted features are compared and classified using the K-Nearest Neighbors and Naive Bayes methods to establish a difference that can identify a minority language such as Serbian language among other world languages in the text. Very good accuracy results prove the efficiency of the proposed approach, when compared to other state-of-the-art methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Brodić, D., Amelio, A., Milivojević, Z.N.: An approach to the language discrimination in different scripts using adjacent local binary pattern. J. Exp. Theor. Artif. Intell., 1–19 (2016, in press). doi:10.1080/0952813X.2016.1264090
Brodić, D., Amelio, A., Milivojević, Z.N.: Language discrimination by texture analysis of the image corresponding to the text. Neural Comput. Appl., 1–22 (2016, in press). doi:10.1007/s00521-016-2527-x
Brodić, D., Amelio, A., Milivojević, Z.N.: Clustering documents in evolving languages by image texture analysis. Appl. Intell. 46(4), 916–933 (2017)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Document Analysis and Information Retrieval, Las Vegas, USA, pp. 161–175 (1994)
Clausi, D.A.: An analysis of co-occurrence texture statistics as a function of grey level quantization. Can. J. Remote Sens. 28(1), 45–62 (2002)
Confusion Matrix. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
Dasarathy, B.V.: Nearest Neighbor: Pattern Classification Techniques (Nn Norms: Nn Pattern Classification Techniques). IEEE Computer Society Press, Los Alamitos (1990)
Dunning, T.: Statistical Identification of Language. Technical report MCCS 94–273, New Mexico State University (1994)
Dunning, T.: Statistical Identification of Language. Technical report CRLMCCS-94-273, Computing Research Lab, New Mexico State University (1994)
Eleyan, A., Demirel, H.: Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish J. Electr. Eng. Comput. Sci. 19(1), 97–107 (2011)
Elkan, C.: Nearest Neighbor Classification (2011). http://cseweb.ucsd.edu/~elkan/250Bwinter2010/nearestn.pdf
Grefenstette, G.: Comparing two language identification schemes. In: Statistical Analysis of Textual Data, Rome, Italy, pp. 1–6 (1995)
Grothe, L., De Luca, E.W., Nurnberger, A.: A comparative study on language identification methods. In: Language Resources and Evaluation, Marrakech, Morocco, pp. 980–985 (2008)
Haralick, R.M., Shanmugan, K., Dinstein, I.: Textural features for image classification. IEEE Trans. Syst. Man Cybern. 3(6), 610–621 (1978)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Kornai, A.: Digital language death. PLoS ONE 8(10), 1–11 (2013)
Newsam, S., Kamath, C.: Comparing shape and texture features for pattern recognition in simulation data. In: Image Processing: Algorithms and Systems IV, San Jose, USA, pp. 1–14 (2005)
Padro, M., Padro, L.: Comparing methods for language identification. In: XXCongreso de la Sociedad Espanola para el Procesamiento del Lenguage Natural, Barcelona, Spain, pp. 155–161 (2004)
Proietti, A., Panella, M., Leccese, F., Svezia, E.: Dust detection and analysis in museum environment based on pattern recognition. Measurement 66, 62–72 (2015)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd ed. Prentice Hall (2003). [1995]
Sibun, P., Spitz, A.L.: Language determination: natural language processing from scanned document images. In: 4th Conference on Applied Natural Language Processing, Stuttgart, Germany, pp. 15–21 (1994)
Souter, C., Churcher, G., Hayes, J., Hughes, J., Johnson, S.: Natural language identification using corpus-based models. Hermes J. Linguist. 13, 183–203 (1994)
Takcı, H., Soğukpınar, İ.: Letter based text scoring method for language identification. In: Yakhno, T. (ed.) ADVIS 2004. LNCS, vol. 3261, pp. 283–290. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30198-1_29
Wackerly, D.D., Mendenhall, W., Scheaffer, R.L.: Mathematical Statistics with Applications. Duxbury Press, Belmont (1996)
Web 2014. http://w3techs.com/technologies/overview/content_language/all
Zramdini, A.W., Ingold, R.: Optical font recognition using typographical features. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 877–882 (1998)
Acknowledgments
This work was partially supported by the Grant of the Ministry of Education, Science and Technological development of the Republic Serbia within the project TR33037.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Brodić, D., Amelio, A., Milivojević, Z.N. (2017). An Image Texture Analysis Method for Minority Language Identification. In: Brimkov, V., Barneva, R. (eds) Combinatorial Image Analysis. IWCIA 2017. Lecture Notes in Computer Science(), vol 10256. Springer, Cham. https://doi.org/10.1007/978-3-319-59108-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-59108-7_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59107-0
Online ISBN: 978-3-319-59108-7
eBook Packages: Computer ScienceComputer Science (R0)