An Empirical Study of Skip-Gram Features and Regularization for Learning on Sentiment Analysis

Li, Cheng; Wang, Bingyu; Pavlu, Virgil; Aslam, Javed A.

doi:10.1007/978-3-319-30671-1_6

Cheng Li²¹,
Bingyu Wang²¹,
Virgil Pavlu²¹ &
…
Javed A. Aslam²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9626))

Included in the following conference series:

European Conference on Information Retrieval

4428 Accesses

Abstract

The problem of deciding the overall sentiment of a user review is usually treated as a text classification problem. The simplest machine learning setup for text classification uses a unigram bag-of-words feature representation of documents, and this has been shown to work well for a number of tasks such as spam detection and topic classification. However, the problem of sentiment analysis is more complex and not as easily captured with unigram (single-word) features. Bigram and trigram features capture certain local context and short distance negations—thus outperforming unigram bag-of-words features for sentiment analysis. But higher order n-gram features are often overly specific and sparse, so they increase model complexity and do not generalize well.

In this paper, we perform an empirical study of skip-gram features for large scale sentiment analysis. We demonstrate that skip-grams can be used to improve sentiment analysis performance in a model-efficient and scalable manner via regularized logistic regression. The feature sparsity problem associated with higher order n-grams can be alleviated by grouping similar n-grams into a single skip-gram: For example, “waste time” could match the n-gram variants “waste of time”, “waste my time”, “waste more time”, “waste too much time”, “waste a lot of time”, and so on. To promote model-efficiency and prevent overfitting, we demonstrate the utility of logistic regression incorporating both L1 regularization (for feature selection) and L2 regularization (for weight distribution).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
On the IMDB dataset, skip-grams perform worse than word vectors on the predefined test set, but better on randomly sampled test sets, as discussed in Sect. 3.
2.
In the LibLinear package that we use, a different notation is used; there \(C=1/\lambda \).
3.
Our code is publicly available at https://github.com/cheng-li/pyramid.
4.
The paragraph vector implementation is from https://github.com/klb3713/sentence2vec/. The parameters we use are size=400, alpha=0.025, window=10, min_count=5, sample=0, seed=1, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0.
5.
After producing paragraph vectors, we run LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) with c=32, g=0.0078. An RBF kernel performs better than a linear kernel.
6.
The training parameters are the same as in IMDB.

References

http://nlp.stanford.edu/software/
https://lucene.apache.org/
http://www.lemurproject.org/
http://terrier.org/
http://www.elasticsearch.org/
Dahl, G.E., Adams, R.P., Larochelle, H.: Training restricted Boltzmann machines on word observations. arXiv preprint (2012). arxiv:1202.5695
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Fernández, J., Gutiérrez, Y., Gómez, J.M., Martınez-Barco, P.: Gplsi: supervised sentiment analysis in twitter using skipgrams. In: SemEval 2014, pp. 294–299 (2014)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1 (2009)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Article Google Scholar
Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram modelling. In: LREC-2006, pp. 1–4 (2006)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, vol. 2. Springer, New York (2009)
Book MATH Google Scholar
König, A.C., Brill, E.: Reducing the human overhead in text categorization. In: KDD, pp. 598–603. ACM (2006)
Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv preprint (2014). arxiv:1405.4053
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: ACL 2011, pp. 142–150. Association for Computational Linguistics (2011)
Google Scholar
Massung, S., Zhai, C., Hockenmaier, J.: Structural parse tree features for text representation. In: ICSC, pp. 9–16. IEEE (2013)
Google Scholar
McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 165–172. ACM (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). arxiv:1301.3781
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86. Association for Computational Linguistics (2002)
Google Scholar
Paskov, H.S., West, R., Mitchell, J.C., Hastie, T.: Compressive feature learning. In: NIPS, pp. 2931–2939 (2013)
Google Scholar
Wager, S., Wang, S., Liang, P.S.: Dropout training as adaptive regularization. In: NIPS, pp. 351–359 (2013)
Google Scholar
Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Statistica Sinica 16(2), 589 (2006)
MathSciNet MATH Google Scholar
Wang, S.I., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the ACL, pp. 90–94 (2012)
Google Scholar
Wiegand, M., Balahur, A., Roth, B., Klakow, D., Montoyo, A.: A survey on the role of negation in sentiment analysis. In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 60–68. Association for Computational Linguistics (2010)
Google Scholar

Download references

Acknowledgments

The research is supported by NSF grant IIS-1421399.

Author information

Authors and Affiliations

College of Computer and Information Science, Northeastern University, Boston, MA, USA
Cheng Li, Bingyu Wang, Virgil Pavlu & Javed A. Aslam

Authors

Cheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Bingyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Virgil Pavlu
View author publications
You can also search for this author in PubMed Google Scholar
Javed A. Aslam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Li .

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Padova, Italy
Nicola Ferro
Faculty of Informatics, University of Lugano (USI), Lugano, Switzerland
Fabio Crestani
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Marie-Francine Moens
Systèmes d’informations, Big Data et Recherche d’Information, Institut de Recherche en Informatique de Toulouse IRIT/équipe SIG, Toulouse Cedex 04, France
Josiane Mothe
Yahoo! Labs London, London, UK
Fabrizio Silvestri
Department of Information Engineering, University of Padua, Padova, Italy
Giorgio Maria Di Nunzio
TU Delft - EWI/ST/WIS, Delft, The Netherlands
Claudia Hauff
Department of Information Engineering, University of Padua, Padova, Italy
Gianmaria Silvello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C., Wang, B., Pavlu, V., Aslam, J.A. (2016). An Empirical Study of Skip-Gram Features and Regularization for Learning on Sentiment Analysis. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-30671-1_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30670-4
Online ISBN: 978-3-319-30671-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics