Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter

Kent, Samantha; Claeser, Daniel

doi:10.1007/978-3-030-02686-8_32

Samantha Kent¹⁷ &
Daniel Claeser¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 880))

Included in the following conference series:

Proceedings of the Future Technologies Conference

1633 Accesses

Abstract

This paper presents a classification system to automatically identify the language of individual tokens in Dutch-English bilingual Tweets. A dictionary-based approach is used as the basis of the system, and additional features are introduced to address the challenges associated with identifying closely related languages. Crucially, a separate system aimed specifically at differentiating between code-switching and borrowing is designed and then implemented as a classification step within the language identification (LID) system. The separate classification step is based on a linguistic framework for distinguishing between borrowing and CS. To test the effectiveness of the rules in the LID system, they are used to create feature vectors for training and testing machine learning systems. The discussion centres are based on a Decision Tree Classifier (DTC) and Support Vector Machines (SVM). The results show that there is only a small difference between the rule-based LID system (micro F1 = .95) and the DTC (micro F1 = .96).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Woordenlijst Nederlandse Taal is a word list that contains the correct spelling of current Dutch words. It is maintained by de Taalunie http://woordenlijst.org/.
2.
http://data.opentaal.org/opentaalbank/woordrelaties/.
3.
https://nl.wiktionary.org/wiki/Hoofdpagina.

References

European Commission: Europeans and their languages. Special Eurobarometer 386 (2012)
Google Scholar
Poplack, S.: Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPANOL: toward a typology of code-switching. Linguistics 18, 581–618 (1980)
Article Google Scholar
Claeser, D., Felske, D., Kent, S.: Token-level code-switching detection using Wikipedia as a lexical resource. In: Rehm, G., Declerck, T. (eds.) GSCL 2017. Language Technologies for the Challenges of the Digital Age. Lecture Notes in Artificial Intelligence, Lecture Notes in Computer Science, vol. 10713, pp. 192–198. Springer, Heidelberg (2018)
Google Scholar
Johnson, S.: A dictionary of the english language: a digital edition of the 1755 classic. In: Besalke, B. (ed.) The History of the English Language. https://johnsonsdictionaryonline.com/the-history-of-the-english-language/. Accessed 15 April 2014
Muysken, P.: Code-switching and grammatical theory. In: Milroy, L., Muysken, P. (eds.) One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, pp. 177–198. Cambridge University Press, Cambridge (1995)
Chapter Google Scholar
Auer, P.: Bilingual Conversation. Amsterdam/Philadelphia, Benjamins (1984)
Book Google Scholar
Poplack, S., Sankoff, D.: Borrowing: the synchrony of integration. Linguistics 22, 99–135 (1984)
Article Google Scholar
Clyne, M.: Dynamics of Language Contact. Cambridge University Press, Cambridge (2003)
Book Google Scholar
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., Al-Ghamdi, F., Hirschberg, J., Chang, A., Fung, P.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 62–72. Doha, Qatar (2014)
Google Scholar
Molina, G., AlGhamdi, F., Ghoneim, M., Hawwari, A., Rey-Villamizar, N., Diab, M., Solorio, T.: Overview for the second shared task on language identification in code-switched data. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 40–49. Austin, Texas (2016)
Google Scholar
Shirvani, R., Piergallini, M., Gautam, G.S., Chouikha, M.: The Howard University system submission for the shared task in language identification in Spanish-English Codeswitching. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 116–120. Austin, Texas (2016)
Google Scholar
Samih, Y., Maharjan, S., Attia, M., Solorio. T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59. Austin, Texas (2016)
Google Scholar
Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: I am borrowing ya mixing?: An analysis of English-Hindi code mixing in Facebook. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 116–126 (2014)
Google Scholar
Patro, J., Samanta, B., Singh, S., Basu, A., Mukherjee, P., Choudhury, M., Mukherjee, A.: All that is English may be Hindi: enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2264–2274, 7–11 September 2017
Google Scholar
Nguyen, D., Doğruöz A.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, pp. 857–862 (2013)
Google Scholar
Dongen, N.: Analysis and prediction of Dutch-English code-switching in social media messages. Unpublished master’s thesis. University of Amsterdam (2017)
Google Scholar
Postma, M., van Miltenburg, E., Segers, R., Schoen, A., Vossen, P.: Open Dutch WordNet. In: Proceedings of the Eight Global Wordnet Conference, Bucharest, Romania (2016)
Google Scholar
Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Trait. Autom. Lang. 54(3), 41–64 (2013)
Google Scholar
Maharjan, S., Blair, E., Bethard, S., Solorio, T.: Developing language-tagged corpora for code-switching tweets. In: Proceedings of LAW IX - The 9th Linguistic Annotation Workshop, Denver, Colorado, pp. 72–84 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institut FKIE, Fraunhoferstrasse 20, 53343, Wachtberg, Germany
Samantha Kent & Daniel Claeser

Authors

Samantha Kent
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Claeser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samantha Kent .

Editor information

Editors and Affiliations

Saga University , Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia
The Science and Information (SAI) Organization, Bradford, UK
Supriya Kapoor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kent, S., Claeser, D. (2019). Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter. In: Arai, K., Bhatia, R., Kapoor, S. (eds) Proceedings of the Future Technologies Conference (FTC) 2018. FTC 2018. Advances in Intelligent Systems and Computing, vol 880. Springer, Cham. https://doi.org/10.1007/978-3-030-02686-8_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-02686-8_32
Published: 18 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02685-1
Online ISBN: 978-3-030-02686-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics