Skip to main content

Collecting and Annotating Indian Social Media Code-Mixed Corpora

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Abstract

The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-Mixing Index.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://twitter4j.org/.

  2. 2.

    www.facebook.com/Confessions.IITB.

  3. 3.

    www.tdil-dc.in/tdildcMain/articles/780732DraftPOSTagstandard.pdf.

  4. 4.

    www.ldcil.org/Download/Tagset/LDCIL/6Hindi.pdf.

References

  1. Androutsopoulos, J.: Language change and digital media: a review of conceptions and evidence. In: Kristiansen, T., Coupland, N. (eds.) Standard Languages and Language Standards in a Changing Europe, pp. 145–159. Novus, Oslo (2011)

    Google Scholar 

  2. Baldwin, T., Cook, P., Lui, M., MacKinlay, A., Wang, L.: How noisy social media text, how diffrnt social media sources? In: Proceedings of the 6th International Joint Conference on Natural Language Processing, pp. 356–364. AFNLP, Nagoya, Japan, October 2013

    Google Scholar 

  3. Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: “I am borrowing \(ya\) mixing?”: An analysis of English-Hindi code mixing in Facebook. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 116–126. ACL, Doha, Qatar, October 2014

    Google Scholar 

  4. Barman, U., Wagner, J., Chrupała, G., Foster, J.: DCU-UVT: word-level language classification with code-mixed data. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 127–132. ACL, Doha, Qatar, October 2014

    Google Scholar 

  5. Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao, K.: A common parts-of-speech tagset framework for Indian languages. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, pp. 1331–1337. ELRA, Marrakech, Marocco, May 2008

    Google Scholar 

  6. Cárdenas-Claros, M.S., Isharyanti, N.: Code switching and code mixing in internet chatting: between “yes”, “ya”, and “si” a case study. J. Comput.-Mediat. Commun. 5(3), 67–78 (2009)

    Google Scholar 

  7. Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)

    Google Scholar 

  8. Das, A., Gambäck, B.: Identifying languages at the word level in code-mixed Indian social media text. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 169–178, Goa, India, December 2014

    Google Scholar 

  9. Debole, F., Sebastiani, F.: An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 58(6), 584–596 (2005)

    Article  Google Scholar 

  10. Dholakia, P.S., Yoonus, M.M.: Rule based approach for the transition of tagsets to build the POS annotated corpus. Int. J. Adv. Res. Comput. Commun. Eng. 3(7), 7417–7422 (2014)

    Google Scholar 

  11. Diab, M., Kamboj, A.: Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation. In: Proceedings of the 9th Workshop on Asian Language Resources, pp. 36–40. AFNLP, Chiang Mai, Thailand, November 2011

    Google Scholar 

  12. Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)

    Article  Google Scholar 

  13. Gafaranga, J., Torras, M.C.: Interactional otherness: towards a redefinition of codeswitching. Int. J. Biling. 6(1), 1–22 (2002)

    Article  Google Scholar 

  14. Gambäck, B., Das, A.: On measuring the complexity of code-mixing. In: Proceedings of the 1st Workshop on Language Technologies for Indian Social Media, Goa, India, pp. 1–7, December 2014

    Google Scholar 

  15. Gambäck, B., Das, A.: Comparing the level of code-switching in corpora. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. ELRA, Portorož, Slovenia, May 2016 (to appear)

    Google Scholar 

  16. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for Twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 42–47. ACL, Portland, Oregon, June 2011

    Google Scholar 

  17. Gupta, P., Bali, K., Banchs, R.E., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International Conference on Research and Development in Information Retrieval, ACM SIGIR, Gold Coast, Queensland, Australia, pp. 677–686, July 2014

    Google Scholar 

  18. Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: The surprisingly formal nature of Twitter’s language. In: Proceedings of the 7th International Conference on Weblogs and Social Media. AAAI, Boston, Massachusetts, July 2013

    Google Scholar 

  19. Joshi, A.K.: Processing of sentences with intra-sentential code-switching. In: Proceedings of the 9th International Conference on Computational Linguistics. ACL, Prague, Czechoslovakia, pp. 145–150, July 1982

    Google Scholar 

  20. Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 (2001)

    Article  Google Scholar 

  21. Nguyen, D., Doğruöz, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 857–862. ACL, Seattle, Washington, October 2013

    Google Scholar 

  22. Paolillo, J.C.: Language choice on soc.culture.punjab. Electron. J. Commun./La Revue Electronique de Communication 6(3), n3 (1996)

    Google Scholar 

  23. Paolillo, J.: The virtual speech community: social network and language variation on IRC. J. Comput.-Mediat. Commun. 4(4), JCMC446 (1999)

    Google Scholar 

  24. Petrov, S., Das, D., McDonald, R.T.: A universal part-of-speech tagset. CoRR abs/1104.2086 (2011). http://arxiv.org/abs/1104.2086

  25. Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)

    Article  Google Scholar 

  26. Rudrapal, D., Jamatia, A., Chakma, K., Das, A., Gambäck, B.: Sentence boundary detection for social media text. In: Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India, pp. 91–97, December 2015

    Google Scholar 

  27. Sequiera, R., Choudhury, M., Gupta, P., Rosso, P., Kumar, S., Banerjee, S., Naskar, S.K., Bandyopadhyay, S., Chittaranjan, G., Das, A., Chakma, K.: Overview of FIRE-2015 shared task on mixed script information retrieval. In: Proceedings of the 7th Forum for Information Retrieval Evaluation, Gandhinagar, India, pp. 21–27, December 2015

    Google Scholar 

  28. Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A., Fung, P.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 62–72. ACL, Doha, Qatar, October 2014

    Google Scholar 

  29. Vyas, Y., Gella, S., Sharma, J., Bali, K., Choudhury, M.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 974–979. ACL, Doha, Qatar, October 2014

    Google Scholar 

Download references

Acknowledgements

Thanks to the different researchers who have made their datasets available: the organisers of the shared tasks on code-switching at EMNLP 2014 and in transliteration at FIRE 2014 and FIRE 2015, as well as Dong Nguyen and Seza Doğruöz (respectively University of Twente and Tilburg University, The Netherlands), and Monojit Choudhury and Kalika Bali (both at Microsoft Research India). Thanks also to an anonymous reviewer for extensive and useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anupam Jamatia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jamatia, A., Gambäck, B., Das, A. (2018). Collecting and Annotating Indian Social Media Code-Mixed Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75487-1_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75486-4

  • Online ISBN: 978-3-319-75487-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics