Abstract
Twitter generates an enormous amount of data daily. Various studies over the years have concluded that tweets have a significant impact in predicting and understanding the stock price movement. Designing a system to store relevant tweets and extracting information for specific stocks and industry is a relevant and unattempted problem for Indian stock market, which is the eighth largest in terms of market capitalization. As people with diverse backgrounds are tweeting about many topics simultaneously, it is nontrivial to identify tweets which are relevant for the stock market. Therefore, a critical component of the aforesaid system should contain one module for the extraction and storage of the tweets and another module for text classification. In the current study, we have proposed a hybrid approach for text classification which combines lexicon-based and machine learning-based techniques. The proposed scheme handles class imbalance problems effectively and has an adaptive characteristic, where it automatically grows the lexicon both through WordNet and by using a machine learning techniques. This system achieves F1-score over 98% of the relevant class, as compared to 60% achieved using the baseline method over a corpus of 10,000 tweets. The coverage of tweets by lexicons also improves by 8%.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Natalie Hockham makes this point in her talk Machine learning with imbalanced data sets, which focuses on imbalance in the context of credit card fraud detection.
References
Liu, H., et al. (2016). The good, the bad, and the ugly: Uncovering novel research opportunities in social media mining. International Journal of Data Science and Analytics, 1(3–4), 137–143.
Ediger, D., Jiang, K., Riedy, J., Bader, D.A., & Corley, C. (2010, September). Massive social network analysis: Mining Twitter for social good. In 2010 39th International Conference on Parallel Processing (ICPP) (pp. 583–593). IEEE.
Ashktorab, Z., Brown, C., Nandi, M., & Culotta, A. (2014, May). Tweedr: Mining Twitter to inform disaster response. In ISCRAM.
Abboute, A., Boudjeriou, Y., Entringer, G., Az, J., Bringay, S., & Poncelet, P. (2014, June). Mining Twitter for suicide prevention. In International Conference on Applications of Natural Language to Data Bases/Information Systems (pp. 250–253). Cham: Springer.
Goswami, S., Chakraborty, S., Ghosh, S., Chakrabarti, A., & Chakraborty, B. (2016). A review on application of data mining techniques to combat natural disasters. Ain Shams Engineering Journal, 9(3), 362–378.
Jain, V. K., & Kumar, S. (2017). Effective surveillance and predictive mapping of mosquito-borne diseases using social media. Journal of Computational Science, 25, 406–415.
Ghiassi, M., Skinner, J., & Zimbra, D. (2013). Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with Applications, 40(16), 6266–6282.
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.
Rao, T., & Srivastava, S. (2012, August). Analyzing stock market movements using Twitter sentiment analysis. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) (pp. 119–123). IEEE Computer Society.
Zhang, X., Shi, J., Wang, D., & Fang, B. (2017). Exploiting investors social network for stock prediction in Chinas market. Journal of Computational Science, 28, 294–303.
Ruan, Y., Durresi, A., & Alfantoukh, L. (2018). Using Twitter trust network for stock market analysis. Knowledge-Based Systems, 1(145), 207–218.
Nisar, T. M., & Yeung, M. (2018). Twitter as a tool for forecasting stock market movements: A short-window event study. The Journal of Finance and Data Science, 4(2), 101–119.
Rajput, H. (2014). Social media and politics in India: A study on Twitter usage among Indian Political Leaders. Asian Journal of Multidisciplinary Studies, 2(1), 63–69.
Khan, A. Z., Atique, M., & Thakare, V. M. (2015). Combining lexicon-based and learning-based methods for Twitter sentiment analysis. International Journal of Electronics, Communication and Soft Computing Science and Engineering (IJECSCSE), 89.
Mudinas, A., Zhang, D., & Levene, M. (2012, August). Combining lexicon and learning based approaches for concept-level sentiment analysis. In Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining (p. 5). ACM.
Christiane, F. (Ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
Rothwell, A. C., Jagger, L. D., Dennis, W. R., & Clarke, D. R. (2004). Networks Associates Technology Inc, 2004. Intelligent SPAM detection system using an updateable neural analysis engine. U.S. Patent 6,769,016.
Juola, P. (2008). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233–334.
Kumar, M., & Rangan, V. (2011). Clearwell Systems Inc, 2011. Methods and systems for e-mail topic classification. U.S. Patent 7,899,871.
Veningston, K., Shanmugalakshmi, R., & Nirmala, V. (2015). Semantic association ranking schemes for information retrieval applications using term association graph representation. Sadhana, 40(6), 1793–1819.
Rani, P., Pudi, V., & Sharma, D. M. (2016). A semi-supervised associative classification method for POS tagging. International Journal of Data Science and Analytics, 1(2), 123–136.
Lpez, V., et al. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Melville, P., Gryc, W., & Lawrence, R. D. (2009, June). Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1275–1284). ACM.
Yenala, H., et al. (2017). Deep learning for detecting inappropriate content in text. International Journal of Data Science and Analytics, 6(4), 273–286.
Lu, B., & Tsou, B. K. (2010, July). Combining a large sentiment lexicon and machine learning for subjectivity classification. In 2010 International Conference on Machine Learning and Cybernetics (ICMLC) (Vol. 6, pp. 3311–3316). IEEE.
Zhao, S., et al. (2016). Correlating Twitter with the stock market through non-Gaussian SVAR. In 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI). IEEE.
Pagolu, V. S., et al. (2016). Sentiment analysis of Twitter data for predicting stock market movements. In 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES). IEEE.
Oliveira, N., Paulo C., & Nelson, A. (2013). Some experiments on modeling stock market behavior using investor sentiment analysis and posting volume from Twitter. In Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics. ACM.
Leitch, D., & Sherif, M. (2017). Twitter mood, CEO succession announcements and stock returns. Journal of Computational Science, 21, 1–10.
Chung, S., & Sandy, L. (2011). Predicting stock market fluctuations from Twitter. Berkeley, California.
Mao, Y., Wei, W., & Bing, W. (2013). Twitter volume spikes: analysis and application in stock trading. In Proceedings of the 7th Workshop on Social Network Mining and Analysis. ACM.
Simsek, M. U., & Suat, Z. (2012). Analysis of the relation between Turkish Twitter messages and stock market index. In 2012 6th International Conference on Application of Information and Communication Technologies (AICT). IEEE.
Smailovi, J., et al. (2013). Predictive sentiment analysis of tweets: A stock market application. In Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data (pp. 77–88). Berlin, Heidelberg: Springer.
R Core Team. (2017). R: A language and environment for statistical computing. In R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/.
Fellbaum, C. (1998). WordNet: An electronic lexical database. Bradford Books.
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.
Rinker, T. W. (2018). Textstem: Tools for stemming and lemmatizing text version 0.1.4. New York: Buffalo.
Faruqui, M., et al. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Torgo, L. (2010). Data mining with R, learning with case studies. Boca Rotan: Chapman and Hall/CRC.
R Development Core Team. (2008). R: A language and environment for statistical computing. In R Foundation for Statistical Computing, Vienna, Austria. ISBN:3-900051-07-0.
Kuhn, M. (2018). Caret: classification and regression training. Contributions from Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., The R Core Team, Benesty, M., Lescarbeau, R., Ziem, A., Scrucca, L., Tang, Y., Candan, C., & Tyler Hunt. In R Package Version 6.0-79.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., et al. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Malakar, S., Goswami, S., Chakrabarti, A., Chakraborty, B. (2020). A Hybrid and Adaptive Approach for Classification of Indian Stock Market-Related Tweets. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_24
Download citation
DOI: https://doi.org/10.1007/978-981-13-9364-8_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9363-1
Online ISBN: 978-981-13-9364-8
eBook Packages: EngineeringEngineering (R0)