Skip to main content

An Empirical Study of Skip-Gram Features and Regularization for Learning on Sentiment Analysis

  • Conference paper
Advances in Information Retrieval (ECIR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9626))

Included in the following conference series:

  • 4428 Accesses

Abstract

The problem of deciding the overall sentiment of a user review is usually treated as a text classification problem. The simplest machine learning setup for text classification uses a unigram bag-of-words feature representation of documents, and this has been shown to work well for a number of tasks such as spam detection and topic classification. However, the problem of sentiment analysis is more complex and not as easily captured with unigram (single-word) features. Bigram and trigram features capture certain local context and short distance negations—thus outperforming unigram bag-of-words features for sentiment analysis. But higher order n-gram features are often overly specific and sparse, so they increase model complexity and do not generalize well.

In this paper, we perform an empirical study of skip-gram features for large scale sentiment analysis. We demonstrate that skip-grams can be used to improve sentiment analysis performance in a model-efficient and scalable manner via regularized logistic regression. The feature sparsity problem associated with higher order n-grams can be alleviated by grouping similar n-grams into a single skip-gram: For example, “waste time” could match the n-gram variants “waste of time”, “waste my time”, “waste more time”, “waste too much time”, “waste a lot of time”, and so on. To promote model-efficiency and prevent overfitting, we demonstrate the utility of logistic regression incorporating both L1 regularization (for feature selection) and L2 regularization (for weight distribution).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    On the IMDB dataset, skip-grams perform worse than word vectors on the predefined test set, but better on randomly sampled test sets, as discussed in Sect. 3.

  2. 2.

    In the LibLinear package that we use, a different notation is used; there \(C=1/\lambda \).

  3. 3.

    Our code is publicly available at https://github.com/cheng-li/pyramid.

  4. 4.

    The paragraph vector implementation is from https://github.com/klb3713/sentence2vec/. The parameters we use are size=400, alpha=0.025, window=10, min_count=5, sample=0, seed=1, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0.

  5. 5.

    After producing paragraph vectors, we run LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) with c=32, g=0.0078. An RBF kernel performs better than a linear kernel.

  6. 6.

    The training parameters are the same as in IMDB.

References

  1. http://nlp.stanford.edu/software/

  2. https://lucene.apache.org/

  3. http://www.lemurproject.org/

  4. http://terrier.org/

  5. http://www.elasticsearch.org/

  6. Dahl, G.E., Adams, R.P., Larochelle, H.: Training restricted Boltzmann machines on word observations. arXiv preprint (2012). arxiv:1202.5695

  7. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  8. Fernández, J., Gutiérrez, Y., Gómez, J.M., Martınez-Barco, P.: Gplsi: supervised sentiment analysis in twitter using skipgrams. In: SemEval 2014, pp. 294–299 (2014)

    Google Scholar 

  9. Friedman, J., Hastie, T., Tibshirani, R.: glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1 (2009)

    Google Scholar 

  10. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)

    Article  Google Scholar 

  11. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram modelling. In: LREC-2006, pp. 1–4 (2006)

    Google Scholar 

  12. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, vol. 2. Springer, New York (2009)

    Book  MATH  Google Scholar 

  13. König, A.C., Brill, E.: Reducing the human overhead in text categorization. In: KDD, pp. 598–603. ACM (2006)

    Google Scholar 

  14. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv preprint (2014). arxiv:1405.4053

  15. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: ACL 2011, pp. 142–150. Association for Computational Linguistics (2011)

    Google Scholar 

  16. Massung, S., Zhai, C., Hockenmaier, J.: Structural parse tree features for text representation. In: ICSC, pp. 9–16. IEEE (2013)

    Google Scholar 

  17. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 165–172. ACM (2013)

    Google Scholar 

  18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). arxiv:1301.3781

  19. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86. Association for Computational Linguistics (2002)

    Google Scholar 

  20. Paskov, H.S., West, R., Mitchell, J.C., Hastie, T.: Compressive feature learning. In: NIPS, pp. 2931–2939 (2013)

    Google Scholar 

  21. Wager, S., Wang, S., Liang, P.S.: Dropout training as adaptive regularization. In: NIPS, pp. 351–359 (2013)

    Google Scholar 

  22. Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Statistica Sinica 16(2), 589 (2006)

    MathSciNet  MATH  Google Scholar 

  23. Wang, S.I., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the ACL, pp. 90–94 (2012)

    Google Scholar 

  24. Wiegand, M., Balahur, A., Roth, B., Klakow, D., Montoyo, A.: A survey on the role of negation in sentiment analysis. In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 60–68. Association for Computational Linguistics (2010)

    Google Scholar 

Download references

Acknowledgments

The research is supported by NSF grant IIS-1421399.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, C., Wang, B., Pavlu, V., Aslam, J.A. (2016). An Empirical Study of Skip-Gram Features and Regularization for Learning on Sentiment Analysis. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30671-1_6

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30670-4

  • Online ISBN: 978-3-319-30671-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics