A Hybrid Approach of Machine Learning and Lexicons to Sentiment Analysis: Enhanced Insights from Twitter Data of Natural Disasters

Abstract

The success factor of sentimental analysis lies in identifying the most occurring and relevant opinions among users relating to the particular topic. In this paper, we develop a framework to analyze users’ sentiments on Twitter on natural disasters using the data pre-processing techniques and a hybrid of machine learning, statistical modeling, and lexicon-based approach. We choose TF-IDF and K-means for sentiment classification among affinitive and hierarchical clustering. Latent Dirichlet Allocation, a pipeline of Doc2Vec and K-means used to capture themes, then perform multi-level polarity indices classification and its time series analysis. In our study, we draw insights from 243,746 tweets for Kerala’s 2018 natural disasters in India. The key findings of the study are the classification of sentiments based on similarity and polarity indices and identifying themes among the topics discussed on Twitter. We observe different sets of emotions and influencers, among others. Through this case example of Kerala floods, it shows how the government and other organizations could track the positive/negative sentiments concerning time and location; gain a better understanding of the topic of discussion trending among the public, and collaborate with crucial Twitter users/influencers to spread and figure out the gaps in the implementation of schemes in terms of design and execution. This research’s uniqueness is the streamlined and efficient combination of algorithms and techniques embedded in the framework used in achieving the above output, which can be integrated into a platform with GUI for further automation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

References

  1. Abedin, B., & Babar, A. (2018). Institutional vs. non-institutional use of social media during emergency response: A case of Twitter in 2014 Australian bush fire. Information Systems Frontiers, 20(4), 729–740.

  2. Alotaibi, F. S., & Gupta, V. (2018). A cognitive inspired unsupervised language-independent text stemmer for information retrieval. Cognitive Systems Research, 52, 291–300.

    Article  Google Scholar 

  3. Araque, O., Corcuera-Platas, I., Sanchez-Rada, J. F., & Iglesias, C. A. (2017). Enhancing in-depth learning sentiment analysis with ensemble techniques in social applications. Expert Systems with Applications, 77, 236–246.

    Article  Google Scholar 

  4. Arroyo-Fernández, I., Méndez-Cruz, C. F., Sierra, G., Torres-Moreno, J. M., & Sidorov, G. (2019). Unsupervised sentence representations as word information series: Revisiting TF–IDF. Computer Speech & Language, 56, 107–129.

    Article  Google Scholar 

  5. Ben-Lhachemi, N., & Nfaoui, E. H. (2018). Using tweets embeddings for hashtag recommendation on twitter. Procedia Computer Science, 127, 7–15.

    Article  Google Scholar 

  6. Bhuvana, N., & Aram, I. A. (2019). Facebook and Whatsapp as disaster management tools during the Chennai (India) floods of 2015. International Journal of Disaster Risk Reduction, 101135.

  7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

    Google Scholar 

  8. Bandyopadhyay, A., Ganguly, D., Mitra, M., Saha, S. K., & Jones, G. J. (2018). An embedding based IR model for disaster situations. Information Systems Frontiers, 20(5), 925–932.

    Article  Google Scholar 

  9. Bouguettaya, A., Yu, Q., Liu, X., Zhou, X., & Song, A. (2015). Efficient agglomerative hierarchical clustering. Expert Systems with Applications, 42(5), 2785–2797.

    Article  Google Scholar 

  10. Calabrese, B. (2018). Data Cleaning. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 472.

  11. Dehkharghani, R., Mercan, H., Javeed, A., & Saygin, Y. (2014). Sentimental causal rule discovery from twitter. Expert Systems with Applications, 41(10), 4950–4958.

    Article  Google Scholar 

  12. Deveaud, R., SanJuan, E., & Bellot, P. (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique, 17(1), 61–84.

    Article  Google Scholar 

  13. Fang, J., Hu, J., Shi, X., & Zhao, L. (2019). Assessing disaster impacts and response using social media data in China: A case study of 2016 Wuhan rainstorm. International Journal of Disaster Risk Reduction, 34, 275–282.

    Article  Google Scholar 

  14. Fersini, E., Messina, E., & Pozzi, F. A. (2016). Expressive signals in social media languages to improve polarity detection. Information Processing & Management, 52(1), 20–35.

    Article  Google Scholar 

  15. Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976.

    Article  Google Scholar 

  16. Gerber, M. S. (2014). Predicting crime using twitter and kernel density estimation. Decision Support Systems, 61, 115–125.

    Article  Google Scholar 

  17. Hong, L., Fu, C., Wu, J., & Frias-Martinez, V. (2018). Information needs and communication gaps between citizens and local governments online during natural disasters. Information Systems Frontiers, 20(5), 1027–1039.

    Article  Google Scholar 

  18. Indian Express, 483-dead-in-Kerala-floods-and-landslides-losses-more-than-annual-plan-outlay-pinarayi-vijayan, 30 August 2018.

  19. Kankanamge, N., Yigitcanlar, T., Goonetilleke, A., & Kamruzzaman, M. (2019). Determining disaster severity through social media analysis: Testing the methodology with south East Queensland flood tweets. International Journal of Disaster Risk Reduction, 101360.

  20. Kapoor, K. K., Tamilmani, K., Rana, N. P., Patil, P., Dwivedi, Y. K., & Nerur, S. (2018). Advances in social media research: Past, present and future. Information Systems Frontiers, 20(3), 531–558.

    Article  Google Scholar 

  21. Kastrati, Z., & Imran, A. S. (2019). Performance analysis of machine learning classifiers on improved concept vector space models. Future Generation Computer Systems, 96, 552–562.

    Article  Google Scholar 

  22. Kauer, A. U., & Moreira, V. P. (2016). Using information retrieval for sentiment polarity prediction. Expert Systems with Applications, 61, 282–289.

    Article  Google Scholar 

  23. Khan, F. H., Bashir, S., & Qamar, U. (2014). TOM: Twitter opinion mining framework using hybrid classification scheme. Decision Support Systems, 57, 245–257.

    Article  Google Scholar 

  24. Kim, D., Seo, D., Cho, S., & Kang, P. (2019). Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information Sciences, 477, 15–29.

    Article  Google Scholar 

  25. Kogan, J., Teboulle, M., & Nicholas, C. (2005). Data driven similarity measures for k-means like clustering algorithms. Information Retrieval, 8(2), 331–349.

    Article  Google Scholar 

  26. Kontopoulos, E., Berberidis, C., Dergiades, T., & Bassiliades, N. (2013). Ontology-based sentiment analysis of twitter posts. Expert Systems with Applications, 40(10), 4065–4074.

    Article  Google Scholar 

  27. Liu, F., & Xu, D. (2018). Social roles and consequences in using social media in disasters: A structurational perspective. Information Systems Frontiers, 20(4), 693–711.

    Article  Google Scholar 

  28. Liu, X., Wang, G. A., Johri, A., Zhou, M., & Fan, W. (2014). Harnessing global expertise: A comparative study of expertise profiling methods for online communities. Information Systems Frontiers, 16(4), 715–727.

    Article  Google Scholar 

  29. Lozano, M. G., Schreiber, J., & Brynielsson, J. (2017). Tracking geographical locations using a geo-aware topic model for analyzing social media data. Decision Support Systems, 99, 18–29.

    Article  Google Scholar 

  30. Mondal, T., Pramanik, P., Bhattacharya, I., Boral, N., & Ghosh, S. (2018). Analysis and early detection of rumors in a post disaster scenario. Information Systems Frontiers, 20(5), 961–979.

    Article  Google Scholar 

  31. Mora, K., Chang, J., Beatson, A., & Morahan, C. (2015). Public perceptions of building seismic safety following the Canterbury earthquakes: A qualitative analysis using twitter and focus groups. International Journal of Disaster Risk Reduction, 13, 1–9.

    Article  Google Scholar 

  32. Nair, M. R., Ramya, G. R., & Sivakumar, P. B. (2017). Usage and analysis of twitter during 2015 Chennai flood towards disaster management. Procedia computer science, 115, 350–358.

    Article  Google Scholar 

  33. NewScientist, Floods kill 350 people in Kerala, Volume 239, Issue 3192, 25 August 2018, https://doi.org/10.1016/S0262-4079(18)31500-8.

  34. Nugent, R., Dean, N., & Ayers, E. (2010). Skill set profile clustering: The empty K-means algorithm with automatic specification of starting cluster centers.

  35. Öztürk, N., & Ayvaz, S. (2018). Sentiment analysis on twitter: A text mining approach to the Syrian refugee crisis. Telematics and Informatics, 35(1), 136–147.

    Article  Google Scholar 

  36. Pandey, A. C., Rajpoot, D. S., & Saraswat, M. (2017). Twitter sentiment analysis using hybrid cuckoo search method. Information Processing & Management, 53(4), 764–779.

    Article  Google Scholar 

  37. Rudra, K., Sharma, A., Ganguly, N., & Imran, M. (2018). Classifying and summarizing information from microblogs during epidemics. Information Systems Frontiers, 20(5), 933–948.

    Article  Google Scholar 

  38. Saif, H., He, Y., Fernandez, M., & Alani, H. (2016). Contextual semantics for sentiment analysis of twitter. Information Processing & Management, 52(1), 5–19.

    Article  Google Scholar 

  39. Saleena, N. (2018). An ensemble classification system for twitter sentiment analysis. Procedia computer science, 132, 937–946.

    Article  Google Scholar 

  40. Špeh, J., Muhic, A., & Rupnik, J. (2013). Parameter estimation for the latent dirichlet allocation, Proceedings of the Conference on Data Mining and Data Warehouses, Ljubljana, Slovenia, pp. 1–4.

  41. Syed, S., & Spruit, M. (2017). Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE international conference on data science and advanced analytics (DSAA) (pp. 165-174). IEEE.

  42. Tang, H., Tan, S., & Cheng, X. (2009). A survey on sentiment detection of reviews. Expert Systems with Applications, 36(7), 10760–10773.

    Article  Google Scholar 

  43. Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016). Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web (pp. 287-297). International world wide web conferences steering committee.

  44. Tripathy, A., Agrawal, A., & Rath, S. K. (2015). Classification of sentimental reviews using machine learning techniques. Procedia Computer Science, 57, 821–829.

    Article  Google Scholar 

  45. Vomfell, L., Härdle, W. K., & Lessmann, S. (2018). Improving crime count forecasts using twitter and taxi data. Decision Support Systems, 113, 73–85.

    Article  Google Scholar 

  46. Wu, D., & Cui, Y. (2018). Disaster early warning and damage assessment analysis using social media data and geo-location information. Decision Support Systems, 111, 48–59.

    Article  Google Scholar 

  47. Xing, F. Z., Pallucchini, F., & Cambria, E. (2019). Cognitive-inspired domain adaptation of sentiment lexicons. Information Processing & Management, 56(3), 554–564.

    Article  Google Scholar 

  48. Yang, S., & Stewart, B. (2019). @ Houstonpolice: An exploratory case of twitter during hurricane Harvey. Online Information Review, 43(7), 1334–1351.

    Article  Google Scholar 

  49. Yoo, S., Song, J., & Jeong, O. (2018). Social media contents based sentiment analysis and prediction system. Expert Systems with Applications, 105, 102–111.

    Article  Google Scholar 

  50. Zahra, K., Imran, M., & Ostermann, F. O. (2020). Automatic identification of eyewitness messages on twitter during disasters. Information Processing & Management, 57(1), 102107.

    Article  Google Scholar 

  51. Zhao, W. L., Deng, C. H., & Ngo, C. W. (2018). K-means: A revisit. Neurocomputing, 291, 195–206.

    Article  Google Scholar 

  52. Zhang, J., & Piramuthu, S. (2018). Product recommendation with latent review topics. Information Systems Frontiers, 20(3), 617–625.

    Article  Google Scholar 

  53. Zhang, L., Wu, Z., Bu, Z., Jiang, Y., & Cao, J. (2018). A pattern-based topic detection and analysis system on Chinese tweets. Journal of computational science, 28, 369–381.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Pankaj Dutta.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mendon, S., Dutta, P., Behl, A. et al. A Hybrid Approach of Machine Learning and Lexicons to Sentiment Analysis: Enhanced Insights from Twitter Data of Natural Disasters. Inf Syst Front (2021). https://doi.org/10.1007/s10796-021-10107-x

Download citation

Keywords

  • Sentimental analysis
  • K-means clustering
  • Latent Dirichlet allocation
  • Machine learning
  • Twitter
  • Natural disasters