Abstract
This paper presents a classification system to automatically identify the language of individual tokens in Dutch-English bilingual Tweets. A dictionary-based approach is used as the basis of the system, and additional features are introduced to address the challenges associated with identifying closely related languages. Crucially, a separate system aimed specifically at differentiating between code-switching and borrowing is designed and then implemented as a classification step within the language identification (LID) system. The separate classification step is based on a linguistic framework for distinguishing between borrowing and CS. To test the effectiveness of the rules in the LID system, they are used to create feature vectors for training and testing machine learning systems. The discussion centres are based on a Decision Tree Classifier (DTC) and Support Vector Machines (SVM). The results show that there is only a small difference between the rule-based LID system (micro F1 = .95) and the DTC (micro F1 = .96).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Woordenlijst Nederlandse Taal is a word list that contains the correct spelling of current Dutch words. It is maintained by de Taalunie http://woordenlijst.org/.
- 2.
- 3.
References
European Commission: Europeans and their languages. Special Eurobarometer 386 (2012)
Poplack, S.: Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPANOL: toward a typology of code-switching. Linguistics 18, 581–618 (1980)
Claeser, D., Felske, D., Kent, S.: Token-level code-switching detection using Wikipedia as a lexical resource. In: Rehm, G., Declerck, T. (eds.) GSCL 2017. Language Technologies for the Challenges of the Digital Age. Lecture Notes in Artificial Intelligence, Lecture Notes in Computer Science, vol. 10713, pp. 192–198. Springer, Heidelberg (2018)
Johnson, S.: A dictionary of the english language: a digital edition of the 1755 classic. In: Besalke, B. (ed.) The History of the English Language. https://johnsonsdictionaryonline.com/the-history-of-the-english-language/. Accessed 15 April 2014
Muysken, P.: Code-switching and grammatical theory. In: Milroy, L., Muysken, P. (eds.) One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, pp. 177–198. Cambridge University Press, Cambridge (1995)
Auer, P.: Bilingual Conversation. Amsterdam/Philadelphia, Benjamins (1984)
Poplack, S., Sankoff, D.: Borrowing: the synchrony of integration. Linguistics 22, 99–135 (1984)
Clyne, M.: Dynamics of Language Contact. Cambridge University Press, Cambridge (2003)
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., Al-Ghamdi, F., Hirschberg, J., Chang, A., Fung, P.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 62–72. Doha, Qatar (2014)
Molina, G., AlGhamdi, F., Ghoneim, M., Hawwari, A., Rey-Villamizar, N., Diab, M., Solorio, T.: Overview for the second shared task on language identification in code-switched data. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 40–49. Austin, Texas (2016)
Shirvani, R., Piergallini, M., Gautam, G.S., Chouikha, M.: The Howard University system submission for the shared task in language identification in Spanish-English Codeswitching. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 116–120. Austin, Texas (2016)
Samih, Y., Maharjan, S., Attia, M., Solorio. T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59. Austin, Texas (2016)
Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: I am borrowing ya mixing?: An analysis of English-Hindi code mixing in Facebook. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 116–126 (2014)
Patro, J., Samanta, B., Singh, S., Basu, A., Mukherjee, P., Choudhury, M., Mukherjee, A.: All that is English may be Hindi: enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2264–2274, 7–11 September 2017
Nguyen, D., Doğruöz A.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, pp. 857–862 (2013)
Dongen, N.: Analysis and prediction of Dutch-English code-switching in social media messages. Unpublished master’s thesis. University of Amsterdam (2017)
Postma, M., van Miltenburg, E., Segers, R., Schoen, A., Vossen, P.: Open Dutch WordNet. In: Proceedings of the Eight Global Wordnet Conference, Bucharest, Romania (2016)
Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Trait. Autom. Lang. 54(3), 41–64 (2013)
Maharjan, S., Blair, E., Bethard, S., Solorio, T.: Developing language-tagged corpora for code-switching tweets. In: Proceedings of LAW IX - The 9th Linguistic Annotation Workshop, Denver, Colorado, pp. 72–84 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Kent, S., Claeser, D. (2019). Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter. In: Arai, K., Bhatia, R., Kapoor, S. (eds) Proceedings of the Future Technologies Conference (FTC) 2018. FTC 2018. Advances in Intelligent Systems and Computing, vol 880. Springer, Cham. https://doi.org/10.1007/978-3-030-02686-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-02686-8_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02685-1
Online ISBN: 978-3-030-02686-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)