Skip to main content

Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation

  • Conference paper
  • First Online:
Recent Findings in Intelligent Computing Techniques

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 709))

Abstract

In recent decades, microblogs generate large volumes of data in the form of short text. Twitter has been one of the most widely used microblogging sites. Twitter data consist of noise due to shortness, which need to be preprocessed to find the accurate sentiment expressed by the user. The major challenges in short texts are the presence of noisy data like URLs, misspelling, slang words, repeated characters, punctuation, etc. To handle these challenges, this paper proposes to combine various preprocessing techniques with different classification methods as a tool for Twitter sentiment analysis. We evaluated the effect of noisy data like URLs, hashtags, negations, repeated characters, punctuations, stopwords and stemming. We use n-gram representation model to find the bindings and further applied support vector machine (SVM) and K-nearest neighbors (KNN) multi-class classifiers for sentiment classification. Experiments are conducted to observe the effect of various preprocessing techniques on Stanford Twitter Sentiment Dataset. The extensive experimental results are presented to show the effect of various preprocessing techniques to classify short texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://help.sentiment140.com/for-students/

  2. Adeniyi, D., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using k-nearest neighbor (knn) classification method. Appl. Comput. Inform. 12(1), 90–108 (2016)

    Article  Google Scholar 

  3. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38. Association for Computational Linguistics (2011)

    Google Scholar 

  4. Bao, Y., Quan, C., Wang, L., Ren, F.: The role of pre-processing in twitter sentiment analysis. In: International Conference on Intelligent Computing, pp. 615–624. Springer (2014)

    Google Scholar 

  5. Bhuta, S., Doshi, A., Doshi, U., Narvekar, M.: A review of techniques for sentiment analysis of twitter data. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 583–591. IEEE (2014)

    Google Scholar 

  6. Chang, C.C., Lin, C.J.: LibSVM: a library for support vector machines. ACM Trans. Intell. Syst. (TIST) 2(3), 27 (2011)

    Google Scholar 

  7. Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 231–240. ACM (2008)

    Google Scholar 

  8. Fusilier, D.H., Montes-y Gomez, M., Rosso, P., Cabrera, R.G.: Detecting positive and negative deceptive opinions using pu-learning. Inf. Process. Manage. 51(4), 433–443 (2015)

    Article  Google Scholar 

  9. Ghag, K.V., Shah, K.: Comparative analysis of effect of stopwords removal on sentiment classification. In: 2015 International Conference on Computer, Communication and Control (IC4), pp. 1–6. IEEE (2015)

    Google Scholar 

  10. Haddi, E., Liu, X., Shi, Y.: The role of text pre-processing in sentiment analysis. Procedia Comput. Sci. 17, 26–32 (2013)

    Article  Google Scholar 

  11. Lima, A.C.E., de Castro, L.N., Corchado, J.M.: A polarity analysis framework for twitter messages. Appl. Math. Comput. 270, 756–767 (2015)

    Google Scholar 

  12. Melville, P., Gryc, W., Lawrence, R.D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1275–1284. ACM (2009)

    Google Scholar 

  13. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical methods in Natural Language Processing-Volume 10, pp. 79–86. Association for Computational Linguistics (2002)

    Google Scholar 

  14. Ren, Y., Wang, R., Ji, D.: A topic-enhanced word embedding for twitter sentiment classification. Inf. Sci. 369, 188–198 (2016)

    Article  Google Scholar 

  15. Saif, H., He, Y., Alani, H.: Alleviating data sparsity for twitter sentiment analysis. In: CEUR Workshop Proceedings (CEUR-WS. org) (2012)

    Google Scholar 

  16. Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Proced. Comput. Sci. 89, 549–554 (2016)

    Article  Google Scholar 

  17. Smailovi_c, J., Gr_car, M., Lavra_c, N., _Znidar_si_c, M.: Stream-based active learning for sentiment analysis in the _nancial domain. Information Sciences 285, 181–203 (2014)

    Google Scholar 

  18. Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment in twitter events. J. Am. Soc. Inform. Sci. Technol. 62(2), 406–418 (2011)

    Article  Google Scholar 

  19. Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016)

    Article  Google Scholar 

  20. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014)

    Article  Google Scholar 

  21. Zainuddin, N., Selamat, A.: Sentiment analysis using support vector machine. In: 2014 International Conference on Computer, Communications, and Control Technology (I4CT), pp. 333–337. IEEE (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to H. M. Keerthi Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Keerthi Kumar, H.M., Harish, B.S. (2018). Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation. In: Sa, P., Bakshi, S., Hatzilygeroudis, I., Sahoo, M. (eds) Recent Findings in Intelligent Computing Techniques . Advances in Intelligent Systems and Computing, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-8633-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8633-5_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8632-8

  • Online ISBN: 978-981-10-8633-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics