Skip to main content

Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter

  • Conference paper
  • First Online:
Proceedings of the Future Technologies Conference (FTC) 2018 (FTC 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 880))

Included in the following conference series:

  • 1633 Accesses

Abstract

This paper presents a classification system to automatically identify the language of individual tokens in Dutch-English bilingual Tweets. A dictionary-based approach is used as the basis of the system, and additional features are introduced to address the challenges associated with identifying closely related languages. Crucially, a separate system aimed specifically at differentiating between code-switching and borrowing is designed and then implemented as a classification step within the language identification (LID) system. The separate classification step is based on a linguistic framework for distinguishing between borrowing and CS. To test the effectiveness of the rules in the LID system, they are used to create feature vectors for training and testing machine learning systems. The discussion centres are based on a Decision Tree Classifier (DTC) and Support Vector Machines (SVM). The results show that there is only a small difference between the rule-based LID system (micro F1 = .95) and the DTC (micro F1 = .96).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Woordenlijst Nederlandse Taal is a word list that contains the correct spelling of current Dutch words. It is maintained by de Taalunie http://woordenlijst.org/.

  2. 2.

    http://data.opentaal.org/opentaalbank/woordrelaties/.

  3. 3.

    https://nl.wiktionary.org/wiki/Hoofdpagina.

References

  1. European Commission: Europeans and their languages. Special Eurobarometer 386 (2012)

    Google Scholar 

  2. Poplack, S.: Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPANOL: toward a typology of code-switching. Linguistics 18, 581–618 (1980)

    Article  Google Scholar 

  3. Claeser, D., Felske, D., Kent, S.: Token-level code-switching detection using Wikipedia as a lexical resource. In: Rehm, G., Declerck, T. (eds.) GSCL 2017. Language Technologies for the Challenges of the Digital Age. Lecture Notes in Artificial Intelligence, Lecture Notes in Computer Science, vol. 10713, pp. 192–198. Springer, Heidelberg (2018)

    Google Scholar 

  4. Johnson, S.: A dictionary of the english language: a digital edition of the 1755 classic. In: Besalke, B. (ed.) The History of the English Language. https://johnsonsdictionaryonline.com/the-history-of-the-english-language/. Accessed 15 April 2014

  5. Muysken, P.: Code-switching and grammatical theory. In: Milroy, L., Muysken, P. (eds.) One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, pp. 177–198. Cambridge University Press, Cambridge (1995)

    Chapter  Google Scholar 

  6. Auer, P.: Bilingual Conversation. Amsterdam/Philadelphia, Benjamins (1984)

    Book  Google Scholar 

  7. Poplack, S., Sankoff, D.: Borrowing: the synchrony of integration. Linguistics 22, 99–135 (1984)

    Article  Google Scholar 

  8. Clyne, M.: Dynamics of Language Contact. Cambridge University Press, Cambridge (2003)

    Book  Google Scholar 

  9. Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., Al-Ghamdi, F., Hirschberg, J., Chang, A., Fung, P.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 62–72. Doha, Qatar (2014)

    Google Scholar 

  10. Molina, G., AlGhamdi, F., Ghoneim, M., Hawwari, A., Rey-Villamizar, N., Diab, M., Solorio, T.: Overview for the second shared task on language identification in code-switched data. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 40–49. Austin, Texas (2016)

    Google Scholar 

  11. Shirvani, R., Piergallini, M., Gautam, G.S., Chouikha, M.: The Howard University system submission for the shared task in language identification in Spanish-English Codeswitching. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 116–120. Austin, Texas (2016)

    Google Scholar 

  12. Samih, Y., Maharjan, S., Attia, M., Solorio. T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59. Austin, Texas (2016)

    Google Scholar 

  13. Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: I am borrowing ya mixing?: An analysis of English-Hindi code mixing in Facebook. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 116–126 (2014)

    Google Scholar 

  14. Patro, J., Samanta, B., Singh, S., Basu, A., Mukherjee, P., Choudhury, M., Mukherjee, A.: All that is English may be Hindi: enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2264–2274, 7–11 September 2017

    Google Scholar 

  15. Nguyen, D., Doğruöz A.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, pp. 857–862 (2013)

    Google Scholar 

  16. Dongen, N.: Analysis and prediction of Dutch-English code-switching in social media messages. Unpublished master’s thesis. University of Amsterdam (2017)

    Google Scholar 

  17. Postma, M., van Miltenburg, E., Segers, R., Schoen, A., Vossen, P.: Open Dutch WordNet. In: Proceedings of the Eight Global Wordnet Conference, Bucharest, Romania (2016)

    Google Scholar 

  18. Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Trait. Autom. Lang. 54(3), 41–64 (2013)

    Google Scholar 

  19. Maharjan, S., Blair, E., Bethard, S., Solorio, T.: Developing language-tagged corpora for code-switching tweets. In: Proceedings of LAW IX - The 9th Linguistic Annotation Workshop, Denver, Colorado, pp. 72–84 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samantha Kent .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kent, S., Claeser, D. (2019). Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter. In: Arai, K., Bhatia, R., Kapoor, S. (eds) Proceedings of the Future Technologies Conference (FTC) 2018. FTC 2018. Advances in Intelligent Systems and Computing, vol 880. Springer, Cham. https://doi.org/10.1007/978-3-030-02686-8_32

Download citation

Publish with us

Policies and ethics