Advertisement

Letter Based Text Scoring Method for Language Identification

  • Hidayet Takcı
  • İbrahim Soğukpınar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3261)

Abstract

In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dumas, S., Plat, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representation for text categorization. In: Proceedings of CIKM-1998, 7th ACM International Conference on Information and Knowledge Management, pp. 148–155 (1998)Google Scholar
  2. 2.
    Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)Google Scholar
  3. 3.
    Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)Google Scholar
  4. 4.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4) (2002)Google Scholar
  5. 5.
    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  6. 6.
    Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)Google Scholar
  7. 7.
    Pawlowski, B.: Letter Frequency Statistics, http://www.ultrasw.com/pawlowski/brendan/Frequencies.html
  8. 8.
    Visa, A.: Technology of Text Mining. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 1–11. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  9. 9.
    Johnson, S.: Solving the problem of language recognition Technical report, School of Computer Studies, University of Leeds (1993)Google Scholar
  10. 10.
    Churcher, G.: Distinctive character sequences, Personal communication (1994)Google Scholar
  11. 11.
    Hayes, J.: Language Recognition using two and three letter clusters. Technical report, School of Computer Studies, University of Leeds (1993)Google Scholar
  12. 12.
    Takcı, H., Soğukpınar, İ.: Centroid-Based Language Identification Using Letter Feature Set. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 640–648. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Hidayet Takcı
    • 1
  • İbrahim Soğukpınar
    • 1
  1. 1.Gebze Institute of TechnologyGebze /Kocaeli

Personalised recommendations