Abstract
In social media, code mixed data has increased, due to which there is an enormous development in noisy and inadequate multilingual content. Automation of noisy social media text is one of the existing research areas. This work focuses on extracting sentiments for movie related code mixed Telugu–English bilingual Roman script data. The raw data of size 11250 tweets were extracted using Twitter API. Initially, the data was cleaned and the annotated data was addressed for sentiment extraction through two approaches namely, lexicon based and machine learning based. In lexicon based approach, the language of each word was identified to back transliterate and extract sentiments. In machine learning based approach, sentiment classification was accomplished with uni-gram, bi-gram and skip-gram features using support vector machine classifier. Machine learning performed better in skip-gram with an accuracy of 76.33% as compared to lexicon based approach holding an accuracy of 66.82%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Barman U, Das A, Wagner J, Foster J (2014) Code mixing: A challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching, pp 13–23
Das A, Gambäck B (2014) Identifying languages at the word level in code-mixed Indian social media text. International Institute of Information Technology, Goa, India
Das A, Bandyopadhyay S (2010) Sentiwordnet for Indian languages. In: Proceedings of the eighth workshop on Asian language resources, pp 56–63
Garcia I, Stevenson V (2009) Reviews-Google translator toolkit. Multiling Comput Technol 20:6–22
Gella S, Bali K, Choudhury M (2010) ye word kis lang ka hai bhai? testing the limits of word level language identification. In: Proceedings of the eleventh international conference on natural language processing, pp 130–139
Ghosh S, Ghosh S, Das D (2017) Sentiment identification in code-mixed social media text. arXiv preprint arXiv:1707.01184
Goldhahn D, Eckart T, Quasthoff U (2010) Building large Monolingual Dictionaries at the Leipzig Corpora Collection: from 100 to 200 languages. In: LREC, pp 31–43
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 168–177. ACM
Burnard, L (2000) Reference guide for the British National Corpus, world edition. Oxford University Computing Services, Oxford
Malgaonkar S, Khan A, Vichare A (2017) Mixed bilingual social media analytics: case study: live Twitter data. In: 2017 international conference on advances in computing, communications and informatics (ICACCI), pp 11407–1412. IEEE
Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (ed) Advances in Kernel methods - support vector learning. MIT Press. http://research.microsoft.com/˜jplatt/smo.html, http://research.microsoft.com/˜jplatt/smo-book.ps.gz, http://research.microsoft.com/˜jplatt/smo-book.pdf
Pravalika A, Oza V, Meghana NP, Kamath SS (2017) Domain-specific sentiment analysis approaches for code-mixed social network data. In: 2017 8th international conference on computing, communication and networking technologies (ICCCNT), pp 1–6. IEEE
Sarkar K (2018) JU KS@ SAIL CodeMixed-2017: sentiment analysis for Indian code mixed social media texts. arXiv preprint arXiv:1802.05737
Sharma S, Srinivas PYKL, Balabantaray, RC (2015) Sentiment analysis of code-mix script. In: 2015 international conference on computing and network communications (CoCoNet), pp 530–534. IEEE
Sharma S, Srinivas PYKL, Balabantaray RC (2015) Text normalization of code mix and sentiment analysis. In: 2015 international conference on advances in computing, communications and informatics (ICACCI), pp 1468–1473. IEEE
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Padmaja, S., Bandu, S., Fatima, S.S. (2020). Text Processing of Telugu–English Code Mixed Languages. In: Satapathy, S.C., Raju, K.S., Shyamala, K., Krishna, D.R., Favorskaya, M.N. (eds) Advances in Decision Sciences, Image Processing, Security and Computer Vision. ICETE 2019. Learning and Analytics in Intelligent Systems, vol 3. Springer, Cham. https://doi.org/10.1007/978-3-030-24322-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-24322-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24321-0
Online ISBN: 978-3-030-24322-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)