Abstract
Opinion mining on microblogs is of significance because microblogging websites have attracted many users to share their experiences and express their opinions on a variety of topics. However, conventional opinion mining methods focus mainly on sentiment of texts and ignore opinion target. This paper focuses on a fine-grained opinion mining task that jointly extract opinion target and corresponding sentiment by sequence labeling. We propose a convolutional neural network (CNN)-based sequence labeling method and apply it to fine-grained opinion mining of microblogs. We empirically evaluated neural networks with different filter length and depth and analyzed the boundary of contextual feature extraction for opinion mining of microblogs. The experimental results demonstrate that the proposed CNN-based methods are better than RNN-based methods in both effectiveness and efficiency.
Keywords
1 Introduction
Microblog websites, such as Twitter and Sina Weibo, have attracted a number of users to express their opinions on variety of topics, making it invaluable sources of public opinions. Many researchers have investigated how to capture microblog users’ opinion on products, services and public figures.
Conventional opinion mining methods mainly focus on sentiment classification of microblogs [1,2,3], which assign a sentiment score or sentiment polarity to represent the opinion expressed in a microblog. However, sentiment classification-based opinion mining may not meet the demands of fine-grained opinion mining because it ignores opinion targets. Besides, It may encounter some problems when a message expresses different opinions to different targets or the sentiment to a target is not the same with the sentiment of the message. Therefore, this paper focus on a fine-grained opinion mining task, i.e., sentiment parsing, which aims to jointly extract opinion target and corresponding sentiment [4].
Sentiment parsing aims to extract all \(\langle T,S \rangle \) tuples from microblogging messages, where T means a target and S is the sentiment to target T. Sentiment parsing needs to find out all targets and sentiments as well as determine their relationships. It has been tackled as a sequence labeling problem in previous work [4]. This approach views a microblog sentence as a sequence of tokens labeled with the “PNO” tagging scheme: P denotes that the token is inside an opinion target and the sentiment to the target is positive; N indicates a token inside an opinion target and the sentiment to the target is negative; and O is used for other tokens in the sentence. An example sentence and the corresponding labels are shown in Table 1, the labels denote that the sentence expresses positive sentiment to “Russell Westbrook” and negative sentiment to “Steph Curry”.
Convolution neural network (CNN) is well known as its capability of capturing contextual information and has been successfully applied to variety of natural language processing tasks such as character-level word embedding [5,6,7], text classification [8,9,10], sentiment analysis [11, 12], machine translation [13] and Web search [14]. However, because sequence length always decreases after CNN layers, CNN is rarely used in word-level sequence labeling tasks. This motivate us to propose a CNN-based sequence labeling method and explore an application of CNN to the task of sentiment parsing. To evaluate the proposed method, we compare it with RNN-based sequence labeling method and experimental results show that the proposed method is better than RNN-based sequence labeling method in both effectiveness and efficiency.
2 Related Works
Opinion mining of microblogs. Microblogging websites are invaluable sources of public opinions and many studies have been launched on opinion mining of microblogs. Early studies on opinion mining of microblogs usually build a sentiment lexicon and calculate a sentiment score for each microblog message. O’Connor et al. [15] calculate a sentiment score for each tweet and summarize the scores of tweets containing the candidates to predict the approval ratings in elections. Bollen et al. [1] use a sentiment lexicon to determine the ratio of positive versus negative tweets on a given day and apply it in the stock market predicting. Some learning-based approach are applied to opinion mining of microblogs. Kumaresan [2] propose a hybrid architecture for twitter sentiment classification by combining random forest, SVM and naive Bayesian classifier. Hu [3] takes social relation into consideration and determines sentiment of tweets with the text and social relations of the user. Bravo [16] combines strengths, emotions and polarities for sentiment analysis of twitter. However, these studies focus mainly on sentiment of microblogs and ignores opinion target. Therefore, we defined a fine-grained opinion mining task, i.e., sentiment parsing, of microblogs and applied RNN to the task in previous work [4]. This paper focuses on sentiment parsing task as well and attempts to improve the performance of sequence labeling.
CNN in NLP. Owing to the capability of capturing local correlations of spatial or temporal structures, CNNs have been successfully applied to many NLP tasks. Some studies prove that CNN is an effective approach to grasp morphological information and apply it to generate word embedding in character-level [5,6,7]. They combine conventional word-level embedding, character-level embedding generated by CNN and additional word-level features to construct features of each word and use these features as input of high level neural networks for different NLP tasks. In word-level processing, a lot of researches employ CNN for text modeling and further exploit the text features in document-level and sentence-level NLP tasks, such as text classification [8,9,10], sentiment analysis [11, 12], machine translation [13] and Web search [14]. However, because the dimension of features always decrease sharply through CNN layers, only a few work utilizes CNNs for sequence labeling in NLP. Xu et al. [17] use a CNN layer to learn word features in window context and employ a TriCRF layer for slot filling and intent detection. The model achieves the stat-of-the-art in both tasks. Therefore, this paper attempts to apply CNN to sequence labeling problem for sentiment parsing of microblogs.
3 Methodology
This section introduces the neural network architecture and explains how to extract opinion target and sentiment polarity jointly with CNN based sequence labeling. As show in Fig. 1, the neural network architecture contains three kinds of layers: embedding layer projects word into fixed-length vectors; convolution layers extract features of each word in sentence; labeling layer predicts label of each word with the features from convolution layers.
3.1 Word Embedding
Word embedding layer aims to represent each word with a vector and thus words can be calculated in high-level layers. Bengio et al. [18] suggest that learning jointly the representation (word embedding) and language model is very useful. Collobert et al. [19] point out that pre-trained word embedding on large unlabeled datasets are useful for different tasks, and they released their word embedding trained on Wikipedia. Recent years, word embedding is commonly used in most of neural network-based natural language processing tasks. Different training models for word embedding, such as Word2Vec [20] and Glove [21], are proposed. Specifically, let D represents the token dictionary of a dataset, word embedding \(E\in {{R}^{d\times |D|}}\) represents each token \(t\in D\) with a fixed-length vector \({{e}_{t}}\in {{R}^{d}}\). The embedding matrix E is usually pre-trained with a large unsupervised dataset and taken as a group of parameters in task-specific training process. Lai et al. [22] analyzed different corpus and different embedding models, pointing out that corpus domain is more important than corpus size. Therefore, this paper trains word embedding on a microblog corpus with Word2Vec [20].
3.2 Convolution
In sentiment parsing task, the label of a word in a sentence is determined by the meaning of the word as well as its contextual information. Word embedding layer expresses general meaning of each word with a vector, and convolution layer aims to extract contextual information of each word.
Generally, CNN has two kinds of operations, i.e., convolution and pooling. Convolution operation extracts steady contextual features by sliding some fixed-length windows. Each window, usually called a filter, extract one type of contextual feature in different locations with the same weights. Pooling operation aggregate features over a region by calculating the maximum or mean value of the features in the region. Multi-layers of alternate convolution and pooling operations can extract features in different scales. For sequence labeling tasks, the length of output sequence usually needs to be same with input sequence. However, both convolution and pooling will reduce the dimension of sequence, making it rarely used in sequence labeling tasks. In order to keep the length of sequence, this paper discard pooling operation, because it reduces sequence length sharply, and add some paddings to the beginning and end of sentences according to filter size and number of layers.
For a sentence \(S=\{{{t}_{1}},{{t}_{2}},...,{{t}_{s}}\}\), each token \({{t}_{i}}\) has been projected to be a vector \({{e}_{{{t}_{i}}}}\). In a convolution layer, suppose m is the filter length, n is the number of filters, then the filter input of token \({{t}_{i}}\) is:
where m is set to be odd to void bias. In \({X}_{i}\), when \(k<1\) or \(k>s\), \(e_{t_{k}}=e_{p}\), where \(e_{p}\) is the embedding of padding. Let \({{W}_{j}}\in {{R}^{m\times d}}\) be the weight matrix of the jth filter and \({{b}_{j}}\) be the bias vector of the jth filter, the feature of token \({{t}_{j}}\) with the jth filter is:
where \(\circ \) denotes element-wise multiplication and f is nonlinear active function. Then the feature of \({{t}_{j}}\) after the convolution layer is
For multilayer of convolution, we use same filter length in one model for different layers. We define CWS as the covered window size, which represents the number of tokens covered by CNN layers when determine the label of a token. For example, in one convolution layer with filter length m, the covered window of token \(t_{i}\) is \([t_{i-m/2+1/2},...,t_{i},...,t_{i+m/2-1/2}]\) and the covered window size is m, which means that the label of \(t_{i}\) is only determined by the m tokens around it (including itself). When depth of convolution layer increase, each layer will add \((m-1)\) tokens into the covered window. Therefore, the covered window size is determined by filter length m and depth dep of convolution layer:
Covered window size determines the boundary of contextual feature extraction and thus is a significant indicator of the capability of a model.
3.3 Labeling
As mentioned before, we use the “PNO” tagging scheme to formulate sentiment parsing to be a sequence labeling problem in this paper. Let \(L=\{{{l}_{1}},{{l}_{2}},...,{{l}_{s}}\}\) denotes the label sequence of sentence \(S=\{{{t}_{1}},{{t}_{2}},...,{{t}_{s}}\}\), where \({{l}_{i}} \in \{P,N,O\}\) is the label of \({{t}_{i}}\). In labeling layer, We represent each label with a normalized 3-dimensional vector. For instance, the label vector is of token \({{t}_{i}}\) is
Labeling layer translates the output of convolution layer at each step into a three-dimension vector and normalize it with a softmax function:
where W is a weight matrix and b is a bias vector. The summation of elements in \({y}_{i}\) is 1 and each element in \({y}_{j}\) can be seen as the probability of its related label. For instance, vector (0.6, 0.3, 0.1) denotes that the label of the corresponding token has the probability of 0.6 to be P, 0.3 to be N, and 0.1 to be O.
3.4 Training and Prediction
In training process, \(\theta =\{E, W_{*},b_{*}\}\) is the set of model parameters. Neural network predicts a label code \({y}_{i}\) for each token \(t_{i}\) in each sentence, we take the cross-entry error of \({{y}_{i}}\)and \({{\hat{y}}_{i}}\) as the loss of token \({{t}_{i}}\):
The loss value of training dataset is the mean loss value of all tokens in training dataset:
where N is the number of sentences in training dataset.
In predicting process, each token in a sentence get a label code through the neural network. The largest element in the output vectors represents the predicted label of this token in the sentence.
4 Experiment
4.1 Experiment Setting
We evaluate the proposed CNN-based method on a Chinese microblog dataset [4], which is collected from Sina Weibo and contains messages and replies of 5 controversial hot topics. The dataset have 67,033 unlabeled messages and 5000 labeled sentences. Each labeled message have been annotated with the mentioned targets and corresponding sentiment. We train word embedding on the 67,033 unlabeled messages with Word2Vec [20] and take it as initial word embedding. We use F-score of opinion tuples in labeled messages and training time to evaluate the effectiveness and efficiency of different models, respectively.
RNN-based sequence labeling method is taken as the baseline. Specifically, according to the experiment of previous work [4], we compare the proposed method with bidirectional simple RNN(SRNN), long short term memory(LSTM) and gated recurrent unit(GRU) at depth from 1 to 5. We explore four different filter lengths (3, 5, 7 and 9) with seven different convolutions layers (from 1 to 7) and compare the F-score and training time with RNN-based methods. All models are trained via the Adam optimizer [23]. We implement the neural networks using the Keras libraryFootnote 1, a highly modular neural networks library. The models are running on a NVIDIA GeForce GTX1080 GPU.
4.2 Result and Discussion
Effectiveness. The F-scores of different models are displayed in Fig. 2, different lines represent different models and each line is the F-scores in different layers. The solid lines are CNN-based methods and dashed lines are RNN-based methods. When the depth increase, F-scores of all models increase at first and tend to decrease after a depth. However, CNN-based methods are steadier than RNN-based methods because their F-score does not decrease sharply after reaching the best depth. Most CNN-based models achieve better result than RNN-based methods. CNN-based models with filter length 5 and 7 are better than models with filter length 3 and 9. The best F-score is 0.631, achieved by CNN-based method of two convolution layers with filer length 7. It is better than the best F-score of RNN-based method—0.622.
Efficiency. The training time of different models are shown in Fig. 3 and the setting of lines are same with Fig. 2. It can be seen that the time cost of CNN-based methods are much less than that of RNN-based methods. When neural network layers increase, the time cost of RNN-based methods increases linearly and that of RNN-based methods increase exponentially. Besides, from the detail figure of CNN-based methods in the upper-left corner, it can be seen that the training time of CNN-based methods is mainly determined by number of CNN layers. When filter length increase, the training time increases slightly.
Covered window size. In order to find the relationship of performance and covered window size, we calculate the covered window size of each neural network and compare it with the length distribution of sentence and sub-sentence in the dataset. As show in Fig. 4, the sub-figure at the top is F-scores of neural networks in different covered window sizes. When covered window extends, performance of neural networks become better at first and keep steady after that. When covered window size is larger than 10, the performance tends to keep steady. When covered window size is near 20, neural networks get best performance and begin to get worse. The sub-figure at the bottom is the cumulative probability distribution of the length of sentences and sub-sentences in the dataset. 85% of sub-sentences are short than 10 tokens and 98% of sub-sentences are short than 20 tokens. Therefore, neural networks with covered window size 10 cover most of sub-sentences when determine the label of a token; and neural networks with covered window size 20 cover most of sub-sentences and their neighboring sub-sentences when determine the label of a token. This indicates that most opinions may not be expressed in a sub-sentences. When we wish to extract opinion accurately, we should consider more than one sub-sentence for an opinion.
5 Conclusion
In this paper, we focus on a fine-grained sentiment analysis task—sentiment parsing—of microblogs. We propose a CNN-based sequence labeling method and apply it to sentiment parsing task. We empirically evaluated neural networks with different filter length and depth and analyzed the influence of covered window size of CNN neural networks to opinion mining of microblogs. Experiments show that the proposed CNN-based methods perform better than RNN-based sequence labeling in both effectiveness and efficiency.
Notes
References
Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011)
Kumaresan, R.: A hybrid approach for supervised twitter sentiment classification. Int. J. Comput. Sci. Bus. Inf. 7(1), 35 (2013)
Hu, X., Tang, L., Tang, J., Liu, H.: Exploiting social relations for sentiment analysis in microblogging. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 537–546. ACM (2013)
Cheng, J., Zhang, X., Li, P., Zhang, S., Ding, Z., Wang, H.: Exploring sentiment parsing of microblogging texts for opinion polling on Chinese public figures. Appl. Intell. 1–14 (2016)
dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING), Dublin, Ireland (2014)
Chiu, J.P.C., Nichols, E.: Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308 (2015)
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058 (2014)
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Wu, H., Gu, Y., Sun, S., Gu, X.: Aspect-based opinion summarization with convolutional neural networks. arXiv preprint arXiv:1511.09128 (2015)
Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)
Meng, F., Lu, Z., Wang, M., Li, H., Jiang, W., Liu, Q.: Encoding source language with convolutional neural network for machine translation. arXiv preprint arXiv:1503.01838 (2015)
Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: Learning semantic representations using convolutional neural networks for web search. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 373–374. International World Wide Web Conferences Steering Committee (2014)
O’Connor, B., Balasubramanyan, R., Routledge, B.R., Smith, N.A.: From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11(122–129), 1–2 (2010)
Bravo-Marquez, F., Mendoza, M., Poblete, B.: Combining strengths, emotions and polarities for boosting twitter sentiment analysis. In: Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 2. ACM (2013)
Xu, P., Sarikaya, R.: Joint intent detection and slot filling using convolutional neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2014)
Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.L.: Neural probabilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning. Studies in Fuzziness and Soft Computing, vol. 194. Springer, Berlin (2006). doi:10.1007/3-540-33486-6_6
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12 (2014)
Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? arXiv preprint arXiv:1507.05523 (2015)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Acknowledgments
The research is supported by National Natural Science Foundation of China (No. 71331008).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Cheng, J., Li, P., Zhang, X., Ding, Z., Wang, H. (2017). CNN-Based Sequence Labeling for Fine-Grained Opinion Mining of Microblogs. In: Kang, U., Lim, EP., Yu, J., Moon, YS. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10526. Springer, Cham. https://doi.org/10.1007/978-3-319-67274-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-67274-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67273-1
Online ISBN: 978-3-319-67274-8
eBook Packages: Computer ScienceComputer Science (R0)