Text Processing of Telugu–English Code Mixed Languages

  • S. PadmajaEmail author
  • Sasidhar BanduEmail author
  • S. Sameen FatimaEmail author
Conference paper
Part of the Learning and Analytics in Intelligent Systems book series (LAIS, volume 3)


In social media, code mixed data has increased, due to which there is an enormous development in noisy and inadequate multilingual content. Automation of noisy social media text is one of the existing research areas. This work focuses on extracting sentiments for movie related code mixed Telugu–English bilingual Roman script data. The raw data of size 11250 tweets were extracted using Twitter API. Initially, the data was cleaned and the annotated data was addressed for sentiment extraction through two approaches namely, lexicon based and machine learning based. In lexicon based approach, the language of each word was identified to back transliterate and extract sentiments. In machine learning based approach, sentiment classification was accomplished with uni-gram, bi-gram and skip-gram features using support vector machine classifier. Machine learning performed better in skip-gram with an accuracy of 76.33% as compared to lexicon based approach holding an accuracy of 66.82%.


Natural language processing Sentiment extraction Language identification Twitter code mixed data 


  1. 1.
    Barman U, Das A, Wagner J, Foster J (2014) Code mixing: A challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching, pp 13–23Google Scholar
  2. 2.
    Das A, Gambäck B (2014) Identifying languages at the word level in code-mixed Indian social media text. International Institute of Information Technology, Goa, IndiaGoogle Scholar
  3. 3.
    Das A, Bandyopadhyay S (2010) Sentiwordnet for Indian languages. In: Proceedings of the eighth workshop on Asian language resources, pp 56–63Google Scholar
  4. 4.
    Garcia I, Stevenson V (2009) Reviews-Google translator toolkit. Multiling Comput Technol 20:6–22Google Scholar
  5. 5.
    Gella S, Bali K, Choudhury M (2010) ye word kis lang ka hai bhai? testing the limits of word level language identification. In: Proceedings of the eleventh international conference on natural language processing, pp 130–139Google Scholar
  6. 6.
    Ghosh S, Ghosh S, Das D (2017) Sentiment identification in code-mixed social media text. arXiv preprint arXiv:1707.01184
  7. 7.
    Goldhahn D, Eckart T, Quasthoff U (2010) Building large Monolingual Dictionaries at the Leipzig Corpora Collection: from 100 to 200 languages. In: LREC, pp 31–43Google Scholar
  8. 8.
    Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 168–177. ACMGoogle Scholar
  9. 9.
    Burnard, L (2000) Reference guide for the British National Corpus, world edition. Oxford University Computing Services, OxfordGoogle Scholar
  10. 10.
    Malgaonkar S, Khan A, Vichare A (2017) Mixed bilingual social media analytics: case study: live Twitter data. In: 2017 international conference on advances in computing, communications and informatics (ICACCI), pp 11407–1412. IEEEGoogle Scholar
  11. 11.
    Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (ed) Advances in Kernel methods - support vector learning. MIT Press.˜jplatt/smo.html,˜jplatt/,˜jplatt/smo-book.pdf
  12. 12.
    Pravalika A, Oza V, Meghana NP, Kamath SS (2017) Domain-specific sentiment analysis approaches for code-mixed social network data. In: 2017 8th international conference on computing, communication and networking technologies (ICCCNT), pp 1–6. IEEEGoogle Scholar
  13. 13.
    Sarkar K (2018) JU KS@ SAIL CodeMixed-2017: sentiment analysis for Indian code mixed social media texts. arXiv preprint arXiv:1802.05737
  14. 14.
    Sharma S, Srinivas PYKL, Balabantaray, RC (2015) Sentiment analysis of code-mix script. In: 2015 international conference on computing and network communications (CoCoNet), pp 530–534. IEEEGoogle Scholar
  15. 15.
    Sharma S, Srinivas PYKL, Balabantaray RC (2015) Text normalization of code mix and sentiment analysis. In: 2015 international conference on advances in computing, communications and informatics (ICACCI), pp 1468–1473. IEEEGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Keshav Memorial Institute of TechnologyHyderabadIndia
  2. 2.Prince Sattam Bin Abdul Aziz UniversityAl-KharjSaudi Arabia
  3. 3.Osmania UniversityHyderabadIndia

Personalised recommendations