Language Identification on the Web: Extending the Dictionary Method

Řehůřek, Radim; Kolkus, Milan

doi:10.1007/978-3-642-00382-0_29

Radim Řehůřek¹⁷ &
Milan Kolkus¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2042 Accesses
31 Citations
1 Altmetric

Abstract

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ingle, N.: A Language Identification Table. Technical Translation International (1980)
Google Scholar
Dunning, T.: Statistical Identification of Language (1994)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Ann Arbor MI, pp. 161–175 (1994)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995) (1995)
Google Scholar
Teahan, W.: Text classification and segmentation using minimum cross-entropy. In: Proceeding of RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur, Paris, FR, pp. 943–961 (2000)
Google Scholar
Souter, C., Churcher, G., Hayes, J., Hughes, J., Johnson, S.: Natural Language Identification Using Corpus-Based Models. Hermes Journal of Linguistics 13, 183–203 (1994)
Google Scholar
Kilgarriff, A.: Web as corpus. In: Proceedings of Corpus Linguistics 2001, pp. 342–344 (2001)
Google Scholar
Kornai, A., et al.: Classifying the Hungarian Web. In: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, Association for Computational Linguistics Morristown, NJ, USA, vol. 1, pp. 203–210 (2003)
Google Scholar
Morrison, D.: PATRICIA – Practical Algorithm To Retrieve Information Coded in Alphanumeric. Journal of the ACM (JACM) 15(4), 514–534 (1968)
Article Google Scholar
Wikimedia Foundation Project: Wikipedia Static HTML Dumps (June 2008), http://static.wikipedia.org/

Download references

Author information

Authors and Affiliations

Masaryk University in Brno, Czech Republic
Radim Řehůřek
Seznam.cz, a.s., Czech Republic
Milan Kolkus

Authors

Radim Řehůřek
View author publications
You can also search for this author in PubMed Google Scholar
Milan Kolkus
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Řehůřek, R., Kolkus, M. (2009). Language Identification on the Web: Extending the Dictionary Method. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics