Data Representation in Machine Learning-Based Sentiment Analysis of Customer Reviews

  • Ivan Shamshurin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6744)


In this paper, we consider the problem of extracting opinions from natural language texts, which is one of the tasks of sentiment analysis. We provide an overview of existing approaches to sentiment analysis including supervised (Naive Bayes, maximum entropy, and SVM) and unsupervised machine learning methods. We apply three supervised learning methods–Naive Bayes, KNN, and a method based on the Jaccard index – to the dataset of Internet user reviews about cars and report the results. When learning a user opinion on a specific feature of a car such as speed or comfort, it turns out that training on full unprocessed reviews decreases the classification accuracy. We experiment with different approaches to preprocessing reviews in order to obtain representations that are relevant for the feature one wants to learn and show the effect of each representation on the accuracy of classification.


Supervised Learning Unsupervised Learning Sentiment Analysis K-nearest Neighbor Naive Bayes method Jaccard index 


  1. 1.
    Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2006)zbMATHGoogle Scholar
  2. 2.
    Segaran, T.: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly, Sebastopol (2007)Google Scholar
  3. 3.
    Hu, M., Liu, B.: Mining and summarizing customer reviews. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Seattle, Washington, USA (2004)Google Scholar
  4. 4.
    Berry, M.W., Browne, M.: Lecture notes in data mining. World Scientific Publishing Co, Singapore (2007)zbMATHGoogle Scholar
  5. 5.
    Giudici, P.: Applied Data Mining. Statistical Methods for Business and Industry. Wiley, Chichester (2003)zbMATHGoogle Scholar
  6. 6.
    Hatzivassiloglou, V., McKeown, K.R.: Predicting the Semantic Orientation of Adjectives. In: Proceedings of the 35th Annual Meeting of the ACL and the 8th Conference of the European Chapter of the ACL, pp. 174–181. ACL, New BrunswickGoogle Scholar
  7. 7.
    Lutz, M.: Programming Python. O’Reilly, Sebastopol (2010)zbMATHGoogle Scholar
  8. 8.
    van Rijsbergen, C.V.: Information Retrieval, 2nd edn. Butterworth, London; Boston (1979)zbMATHGoogle Scholar
  9. 9.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly, Sebastopol (2009)zbMATHGoogle Scholar
  10. 10.
    Stavrianoui, A., Andritsos, P., Nicoloyannis, N.: Overview and Semantic Issues of Text Mining. SIGMOD Record 36(3), 23–34 (2007)CrossRefGoogle Scholar
  11. 11.
    Poirier, D., Bothorel, C., Boulle, M.: Two possible approaches for opinion analysis in film reviews: statistic and linguistic. In: EMOT-2008: LREC 2008 Workshop on Sentiment Analysis: Emotion, Metaphor, Ontology (2008)Google Scholar
  12. 12.
    Williams, G.K., Anand, S.S.: Predicting the Polarity Strength of Adjectives UsingWordNet. In: Third International AAAI Conference on Weblogs and Social Media (2009)Google Scholar
  13. 13.
    Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  14. 14.
    Budanitsky, A., Hirst, G.: Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh (2001)Google Scholar
  15. 15.
    Turney, P.D.: Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 417–424 (2002)Google Scholar
  16. 16.
    Huang, A.: Similarity Measures for Text Document Clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference NZCSRSC 2008, Christchurch, New Zealand, pp. 49–56 (2008)Google Scholar
  17. 17.
    Geisser, S.: Predictive Inference. Chapman and Hall, New York (1993)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ivan Shamshurin
    • 1
  1. 1.School of Applied Mathematics and InformaticsNational Research University – Higher School of EconomicsMoscowRussia

Personalised recommendations