Advertisement

Language Detection and Tracking in Multilingual Documents Using Weak Estimators

  • Aleksander Stensby
  • B. John Oommen
  • Ole-Christoffer Granmo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6218)

Abstract

This paper deals with the extremely complicated problem of language detection and tracking in real-life electronic (for example, in Word-of-Mouth (WoM)) applications, where various segments of the text are written in different languages. The difficulties in solving the problem are many-fold. First of all, the analyst has no knowledge of when one language stops and when the next starts. Further, the features which one uses for any one language (for example, the n-grams) will not be valid to recognize another. Finally, and most importantly, in most real-life applications, such as in WoM, the fragments of text available before the switching, are so small that it renders any meaningful classification using traditional estimation methods almost meaningless. Earlier, the authors of [10] had recommended that for a variety of problems, the use of strong estimators (i.e., estimators that converge with probability 1) is sub-optimal. In this vein, we propose to solve the current problem using novel estimators that are pertinent for non-stationary environments. The classification results which involve as many as 8 languages demonstrates that our proposed methodology is both powerful and efficient.

Keywords

Multilingual language detection Weak estimators 

References

  1. 1.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 161–175 (1994)Google Scholar
  2. 2.
    Creutz, M.: Unsupervised segmentation of words using prior distributions of morph length and frequency. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 280–287. Association for Computational Linguistics (2003)Google Scholar
  3. 3.
    Creutz, M., Lagus, K.: Unsupervised discovery of morphemes (2002)Google Scholar
  4. 4.
    Dunning, T.: Statistical Identification of Language. Technical report MCCS 94-273. New Mexico State University (1994)Google Scholar
  5. 5.
    Ingle, N.C.: A language identification table. The Incorporated Linguist. 15(4), 98–101 (1976)Google Scholar
  6. 6.
    Jang, Y.M.: Estimation and Prediction-Based Connection Admission Control in Broadband Satellite Systems. ETRI Journal 22(4), 40–50 (2000)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Ludovik, Y., Zacharski, R.: Multilingual document language recognition for creating corpora. Technical report, New Mexico State University (1999)Google Scholar
  8. 8.
    Mandl, T., Shramko, M., Tartakovski, O., Womser-Hacker, C.: Language identification in multi-lingual web-documents. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 153–163. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Oommen, B.J., Rueda, L.: Stochastic Learning-based Weak Estimation of Multinomial Random Variables and Its Applications to Non-stationary Environments. Pattern Recognition (2006) (in Press)Google Scholar
  10. 10.
    Oommen, B.J., Rueda, L.: Stochastic Learning-based Weak Estimation of Multinomial Random Variables and Its Applications to Non-stationary Environments. Pattern Recognition 39(1), 328–341 (2006)zbMATHCrossRefGoogle Scholar
  11. 11.
    Ozbek, G., Rosenn, I., Yeh, E.: Language classification in multilingual documents. Technical report, Stanford University (2006)Google Scholar
  12. 12.
    Souter, C., Churcher, G., Hayes, J., Hughes, J., Johnson, S.: Natural language identification using corpus-based models. Hermes Journal of Linguistics 13, 183–203 (1994)Google Scholar
  13. 13.
    Ziegler, D.: The automatic identification of languages using linguistic recognition signals. PhD thesis, Buffalo, NY, USA (1991)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Aleksander Stensby
    • 1
  • B. John Oommen
    • 1
    • 2
  • Ole-Christoffer Granmo
    • 1
  1. 1.Dept. of ICTUniversity of AgderGrimstadNorway
  2. 2.School of Computer ScienceCarleton UniversityOttawaCanada

Personalised recommendations