Letter Based Text Scoring Method for Language Identification
In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.
Unable to display preview. Download preview PDF.
- 1.Dumas, S., Plat, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representation for text categorization. In: Proceedings of CIKM-1998, 7th ACM International Conference on Information and Knowledge Management, pp. 148–155 (1998)Google Scholar
- 2.Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)Google Scholar
- 3.Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)Google Scholar
- 4.Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4) (2002)Google Scholar
- 5.Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
- 6.Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)Google Scholar
- 7.Pawlowski, B.: Letter Frequency Statistics, http://www.ultrasw.com/pawlowski/brendan/Frequencies.html
- 9.Johnson, S.: Solving the problem of language recognition Technical report, School of Computer Studies, University of Leeds (1993)Google Scholar
- 10.Churcher, G.: Distinctive character sequences, Personal communication (1994)Google Scholar
- 11.Hayes, J.: Language Recognition using two and three letter clusters. Technical report, School of Computer Studies, University of Leeds (1993)Google Scholar