Abstract
While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics.
Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words.
We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pedersen, T., Mihalcea, R.: Advances in word sense disambiguation. In: 43rd Annual Meeting of the Association for Computational Linguistics, University of Michigan, Ann Arbor, USA (2005)
Martins, B., Silva, M.J.: Language identification in web pages. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied computing, pp. 764–768. ACM, New York (2005)
Winnemöller, R.: Knowledge based feature engineering using text sense representation trees. In: International Conference RANLP - 2005, Borovets, Bulgaria (2005)
Winnemöller, R.: Using meaning aspects for word sense disambiguation. In: 9th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Haifa, Israel (2008)
Mahesh, K., Nirenburg, S.: A situated ontology for practical nlp. In: Workshop on Basic Ontological Issues in Knowledge Sharing, International Joint Conference on Artificial Intelligence (IJCAI 1995), Montreal, Canada (1995)
Winnemöller, R.: Zur bedeutungsorientierten Auflösung von Wortmehrdeutigkeiten - Vorschlag einer Methodik. PhD thesis, University of Hamburg, Hamburg, Germany (2009)
Wittgenstein, L.: Philosophische Untersuchungen. In: Werkausgabe, B.I. (ed.) Frankfurt am Main. Suhrkamp Verlag (1984)
Bärenfänger, O.: Merkmals- und prototypensemantik: Einige grundsätzliche überlegungen. Linguistik online 12 (2002)
Meinhardt, H.J.: Invariante, variante und prototypische merkmale der wortbedeutung. Zeitschrift für Germanistik 5, 60–69 (1984)
Overberg, P.: Merkmalssemantik vs. prototypensemantik - anspruch und leistung zweier grundkonzepte der lexikalischen semantik. Master’s thesis, Universität Münster (1999)
Miller, G.A., Fellbaum, C., Tengi, R., Wolff, S., Wakefield, P., Langone, H., Haskell, B.: Wordnet - a lexical database for the english language (2005), http://www.cogsci.princeton.edu/~wn/index.shtml
Winnemöller, R.: Constructing text sense representations. In: Hirst, G., Nirenburg, S. (eds.) ACL 2004: Second Workshop on Text Meaning and Interpretation, Barcelona, Spain, pp. 17–24. Association for Computational Linguistics (2004)
Netscape Communications Corporation: Open directory project (2004), http://dmoz.org
Zadeh, L.: Fuzzy sets. Information Control 8, 338–353 (1965)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Zavarsky, P., Mikami, Y., Wada, S.: Language and encoding scheme identification of extremely large sets of multilingual text. In: Conference Proceedings: the tenth Machine Translation Summit, Phuket, Thailand, pp. 354–355 (2005)
Singh, A.K., Surana, H.: Can corpus based measures be used for comparative study of languages? In: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, Prague, Czech, pp. 40–47 (2007)
Rehurek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)
Biemann, C., Teresniak, S.: Disentangling from babylonian confusion - unsupervized language identification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 762–773. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Winnemöller, R. (2010). Drive-by Language Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-12116-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)