Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation

Keerthi Kumar, H. M.; Harish, B. S.

doi:10.1007/978-981-10-8633-5_3

H. M. Keerthi Kumar¹⁸ &
B. S. Harish¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 709))

886 Accesses
12 Citations

Abstract

In recent decades, microblogs generate large volumes of data in the form of short text. Twitter has been one of the most widely used microblogging sites. Twitter data consist of noise due to shortness, which need to be preprocessed to find the accurate sentiment expressed by the user. The major challenges in short texts are the presence of noisy data like URLs, misspelling, slang words, repeated characters, punctuation, etc. To handle these challenges, this paper proposes to combine various preprocessing techniques with different classification methods as a tool for Twitter sentiment analysis. We evaluated the effect of noisy data like URLs, hashtags, negations, repeated characters, punctuations, stopwords and stemming. We use n-gram representation model to find the bindings and further applied support vector machine (SVM) and K-nearest neighbors (KNN) multi-class classifiers for sentiment classification. Experiments are conducted to observe the effect of various preprocessing techniques on Stanford Twitter Sentiment Dataset. The extensive experimental results are presented to show the effect of various preprocessing techniques to classify short texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

http://help.sentiment140.com/for-students/
Adeniyi, D., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using k-nearest neighbor (knn) classification method. Appl. Comput. Inform. 12(1), 90–108 (2016)
Article Google Scholar
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38. Association for Computational Linguistics (2011)
Google Scholar
Bao, Y., Quan, C., Wang, L., Ren, F.: The role of pre-processing in twitter sentiment analysis. In: International Conference on Intelligent Computing, pp. 615–624. Springer (2014)
Google Scholar
Bhuta, S., Doshi, A., Doshi, U., Narvekar, M.: A review of techniques for sentiment analysis of twitter data. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 583–591. IEEE (2014)
Google Scholar
Chang, C.C., Lin, C.J.: LibSVM: a library for support vector machines. ACM Trans. Intell. Syst. (TIST) 2(3), 27 (2011)
Google Scholar
Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 231–240. ACM (2008)
Google Scholar
Fusilier, D.H., Montes-y Gomez, M., Rosso, P., Cabrera, R.G.: Detecting positive and negative deceptive opinions using pu-learning. Inf. Process. Manage. 51(4), 433–443 (2015)
Article Google Scholar
Ghag, K.V., Shah, K.: Comparative analysis of effect of stopwords removal on sentiment classification. In: 2015 International Conference on Computer, Communication and Control (IC4), pp. 1–6. IEEE (2015)
Google Scholar
Haddi, E., Liu, X., Shi, Y.: The role of text pre-processing in sentiment analysis. Procedia Comput. Sci. 17, 26–32 (2013)
Article Google Scholar
Lima, A.C.E., de Castro, L.N., Corchado, J.M.: A polarity analysis framework for twitter messages. Appl. Math. Comput. 270, 756–767 (2015)
Google Scholar
Melville, P., Gryc, W., Lawrence, R.D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1275–1284. ACM (2009)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical methods in Natural Language Processing-Volume 10, pp. 79–86. Association for Computational Linguistics (2002)
Google Scholar
Ren, Y., Wang, R., Ji, D.: A topic-enhanced word embedding for twitter sentiment classification. Inf. Sci. 369, 188–198 (2016)
Article Google Scholar
Saif, H., He, Y., Alani, H.: Alleviating data sparsity for twitter sentiment analysis. In: CEUR Workshop Proceedings (CEUR-WS. org) (2012)
Google Scholar
Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Proced. Comput. Sci. 89, 549–554 (2016)
Article Google Scholar
Smailovi_c, J., Gr_car, M., Lavra_c, N., _Znidar_si_c, M.: Stream-based active learning for sentiment analysis in the _nancial domain. Information Sciences 285, 181–203 (2014)
Google Scholar
Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment in twitter events. J. Am. Soc. Inform. Sci. Technol. 62(2), 406–418 (2011)
Article Google Scholar
Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016)
Article Google Scholar
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014)
Article Google Scholar
Zainuddin, N., Selamat, A.: Sentiment analysis using support vector machine. In: 2014 International Conference on Computer, Communications, and Control Technology (I4CT), pp. 333–337. IEEE (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

JSS Research Foundation, Mysuru, Karnataka, India
H. M. Keerthi Kumar
Sri Jayachamarajendra College of Engineering, Mysuru, Karnataka, India
B. S. Harish

Authors

H. M. Keerthi Kumar
View author publications
You can also search for this author in PubMed Google Scholar
B. S. Harish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to H. M. Keerthi Kumar .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, Odisha, India
Pankaj Kumar Sa
Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, Odisha, India
Sambit Bakshi
Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
Ioannis K. Hatzilygeroudis
Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, Odisha, India
Manmath Narayan Sahoo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Keerthi Kumar, H.M., Harish, B.S. (2018). Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation. In: Sa, P., Bakshi, S., Hatzilygeroudis, I., Sahoo, M. (eds) Recent Findings in Intelligent Computing Techniques . Advances in Intelligent Systems and Computing, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-8633-5_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-8633-5_3
Published: 04 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8632-8
Online ISBN: 978-981-10-8633-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics