1 Introduction

Since the creation of Web 2.0 technology, information exchange through the internet has increased rapidly. This new technology gave the power of sharing information not only to the data manager as its predecessor did, but also to the normal user of the web, which in turn led to the social media revolution. Social media gives people the opportunity to interact with each other directly and freely;allowing them to share news or information, express their feelings or opinions, make comments on events or articles, or even make new relationships both personal and professional. This flood of data in social media requires time and effort to read, evaluate, and analyze manually, pressing the need to have an automated system that could extract valuable insight efficiently. Accordingly, this has led to the emergence of the new research field of Sentiment Analysis (SA).

SA is concerned with classifying text into the sentiment polarity that it holds i.e. (positive, negative, neutral). SA has many beneficial aspects. For example, companies can use it to analyze customer comments and evaluate their satisfaction with the company’s products. This feedback provides valuable information that could help them when making their marketing strategies [1]. SA can also be used to determine the user’s desires and thus determine the appropriate advertisements based on the type of product the user has commented upon.

One approach to SA is based on using sentiment lexicons. Sentiment lexicons are compiled lists of words with their polarity (positive, negative) [2]. Sentiment intensity could also be provided; it indicates the strength in which the sentiment is being conveyed. In previous work, AlTwairesh et al. [3] generated tweet-specific Arabic sentiment lexicons using two approaches. One of these approaches utilizes the statistical measure Pointwise Mutual Information (PMI). In this paper, we use the same datasets used in [3], but propose two new statistical approaches that exploit the Entropy and Chi-Square measures. We then test and evaluate these lexiconsand compare their results with the results of PMI lexicons published in [3].

This paper is organized as follows: Sect. 2 reviews the related work on sentiment lexicon generation. Section 3 presents the details of the datasets used to generate the lexicons. Section 4 describes the new approaches used to generate the new lexicons. Section 5 details the conducted intrinsic and extrinsic evaluationof the new lexicons while Sect. 6 presents and discusses the results. Finally, we conclude the paper in Sect. 7.

2 Related Work

A sentiment lexicon contains words that are classified as positive, negative and sometimes neutral. The lexicon could contain in addition to the polarity of the word, a score that indicates the sentiment intensity. There are three approaches to generating sentiment lexicons [2]: manual approach, dictionary-based approach, and corpus-based approach. The manual approach as the name implies is done manually, but is usually done in conjunction with automated approaches as a correction step. The dictionary-based approach exploits relations found in a dictionary such as synonyms and antonyms to derive the polarity of words. Most of the works under this approach utilize WordNet e.g. [4,5,6,7]. Arabic sentiment lexicons generated using this approach e.g. [8, 9].

The corpus-based approach utilizes a corpus and a set of sentiment bearing words. Words are extracted from the corpus and compared to the set of sentiment words using different statistical methods that measure semantic similarity. Statistical approaches that are commonly used include PMI, and Chi-Square [2]. The PMI is a measure for the strength of association between two words in a corpus, i.e. the probability of the two words to co-occur in the corpus [10]. It has been adapted in sentiment analysis as a measure of the frequency of a word occurring in positive text to the frequency of the same word occurring in negative text. Turney [11]; Turney and Littman [12], was the first work that proposed to use this measurement in sentiment analysis. Other works that used this statistical measure are [13] for English and [14] for Arabic. As for the Chi-Square measure [15] used it for building a sentiment lexicon and their work was adopted in this paper also.

3 Dataset

Since we continue on the work of [3] we will use the same dataset and present here an overview of the dataset and how it was collected. Using the Twitter API, a large dataset of Arabic tweets was collected. The dataset collection was done in two phases. In the first phase, tweets that contained the emoticons “:)” (to be considered positive) and “:(” (to be considered negative) and their “lang” field was set to Arabic were collected during two months. In the second phase, a seed list of 10 Arabic positive words and 10 Arabic negative words were used as search keywords to collect tweets. Accordingly, tweets that contained the positive emoticon or positive keywords were grouped into a set that designated positive tweets and tweets that contained the negative emoticon or negative keywords were grouped into a set that designated negative tweets.

The number of collected tweets was around 6.3 million. However, due to the informal nature of Twitter data; preprocessing and cleaning was conducted on the tweets and the result after filtering and cleaning was 2.2 million Arabic tweets. Statistics of the dataset are shown in Table 1.

Table 1. Dataset statistics

4 Lexicon Generation

In this paper, we build on the previous work [3], to explore other approaches in scoring Arabic sentiment lexicons, utilizing entropy and chi-square methods.

These approaches are used to determine the intensity of the polarity of each word in the lexicon, using the frequencies of each word in positive and negative datasets, and are further detailed in the following subsections. However, they do not tell us whether the word is positive or negative. The sign of each, or direction of polarity, is determined in a uniform way, by comparing the conditional probability of the lexicon given its polarity. Concretely:

$$ Sign = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if\,P(c|neg) < P(c|pos)} \hfill \\ { - 1} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(1)

where;

$$ P\left( {c |i} \right) = \frac{{freq\left( {c,i} \right)}}{freq\left( c \right)} $$
(2)

where

  • c: is the word,

  • i: is the polarity (positive or negative).

  • freq(c,i) is the frequency of word c in dataset i:the (positive or negative).

  • freq(c) is the frequency of word c in the whole dataset.

Next, the sign is multiplied by the word score found by each of the following formula (3, 5), to determine the word’s intensity.

4.1 AraSenTi-Entropy

Entropy [16] is often used in Information Theory to measure expected information content; in the case of two labels, entropy is highest when the data is evenly distributed, and lowest when all of the data is under one label. In our context, a word can either be positive or negative, so entropy can be used to measure the intensity of a word’s polarity. If the entropy is high, it means that the word occurs in comparable frequency in both positive and negative text, which means that the word has weak polarity. On the other hand, if the entropy is low, it means that the word has a strong polarity, as it occurs in some sentiment significantly more than the other.

Knowing that entropy has an inverse relationship with a word’s polarity, given the frequencies of words in positive and negative datasets, we find AraSenTi-Entropy lexicon scores based on the following equation:

$$ Score\left( c \right) = sign * \frac{1}{{ - \sum\nolimits_{{i \in \{ pos,neg\}^{{p_{i} }} }} {log_{2} } \,p_{i} }} $$
(3)

where:

$$ p_{i} = \frac{{freq\left( {c,i} \right)}}{freq\left( c \right)} $$
(4)

In the case where the word appears in one polarity only, the score is set to sign × 1, as Eq. 3 ill be undefined with the denominator being zero.

4.2 AraSenTi-ChiSq

A chi-square test is used to check the validity of some null-hypothesis by evaluating the statistical significance of the difference between observed and expected values.

In the context of sentiment analysis, the intensity of polarity of the word is determined by evaluating the null-hypothesis: “The frequency of the occurrences of a word is the same in positive and negative text”. As in AraSenTi-Entropy, frequencies of words in positive and negative text are the sole determinants of scores.

The exact formula for AraSenTi-ChiSq lexicon, was based on the work of [15], and is detailed below:

$$ Score\left( c \right) = X^{2} \left( c \right) = sign * \mathop \sum \limits_{y \in (pos, neg)} \frac{{ \{ freq\left( {c,y} \right) - \overline{freq} \left( {c,y} \right)\}^{2} }}{{\overline{freq} \left( {c,y} \right)}} $$
(5)

where:

$$ \overline{freq} \left( {c,y} \right) is \,the \,expected \,freq \,and \,X^{2} \left( c \right) \ge 0 $$

Basically, the score will be the sum of square differences of frequencies normalized by the frequency under each polarity. If the null hypothesis holds, the expected value of frequency (or the frequency under the other polarity), will be equivalent to the original one, and the score will be zero (the intensity of polarity is low). In the case where a word appears under one polarity only, the denominator is set to 1 instead of 0, and the score would be most extreme.

5 Evaluation

To evaluate the performance of the generated lexicons, two evaluation methods were performed; intrinsic and extrinsic. In the intrinsic evaluation, AraSenTi-Entropy, AraSenTi-ChiSq and AraSenTi-PMI [3] lexicons were compared with each other. However, in the extrinsic evaluation, the lexicons were evaluated for their utility in classifying sentiment of three different datasets of Arabic tweets.

5.1 Intrinsic Evaluation

In this evaluation method, the three lexicons were compared to each other to determine the percentage of agreement, i.e. how many words did the lexicons agree on their polarity. Table 2 shows the number of positive and negative words used in this evaluation for each lexicon with a total of 93,295 words.

Table 2. The number of positive and negative words in the lexicons

In Table 3, the result of this evaluation is illustrated, and from it, you can notice that the highest agreement percentage was between AraSenTi-PMI [3] and AraSenTi-Entropy. In general, the agreements between the lexicons were very high.

Table 3. The percentage of agreement for the lexicons

5.2 Extrinsic Evaluation

We conducted an extrinsic evaluation for the three lexicons to observe the performance of the lexicons on different datasets. We evaluated the lexicons using the same datasets from the previous work which are AraSenTi-Tweet dataset [3] and two external datasets ASTD [17] and RR [18]. Information of these datasets is illustrated in Table 4.

Table 4. Datasets used in the extrinsic evaluation.

In addition, we computed the balanced F-score (Favg), precision (P) and recall (R) to measure the performance of the lexicons for the positive and negative categories by the following formulas:

$$ P = \frac{TP}{TP + FP} $$
(6)
$$ R = \frac{TP}{TP + FN} $$
(7)
$$ F = \frac{2 \times P \times R}{P + R} $$
(8)

Where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives and FN is the number of false negatives. Then we calculated the F-score as follow:

$$ F_{\text{avg}} = \frac{{F_{\text{pos }} + F_{\text{neg}} }}{2} $$
(9)

For AraSenTi-Entropy and AraSenTi-ChiSq lexicons we followed the same approach used with AraSenTi-PMI [3] lexicon in the previous work. We classified the tweets into positive or negative according to the sum of the sentiment score of the words in each tweet. The threshold we used to classify the data into positive or negative was initially zero. As such, if the sum of the sentiment score of the words in a tweet is greater than zero then the tweet is considered to be a positive tweet. Otherwise the tweet is considered to be a negative tweet. Additionally, we experimented with other values of the threshold to get the best results, we used 0, 0.5 and 1.

6 Results and Discussion

First, it is worth mentioning that the scores for AraSenTi-ChiSq lexicons were clipped to remain between −10 and 10, as there were a few outliers too great in magnitude, affecting its performance. Figure 1 shows the distribution of scores for the different lexicons before and after clipping. In Fig. 1(a), we observe that outliers in ChiSq are great in magnitude, reaching a max of around 1.8 × e7. In Fig. 1(b), after clipping the AraSenTi-ChiSq to a min −10 and max 10. Note the similarities between the plots for PMI and Entropy.

Fig. 1.
figure 1

(a) Boxplot of the distribution of the raw scores, as per the formulas defined previously. (b) Shows the distribution of scores after clipping ChiSq, where 1,2,3 is PMI, Entropy, and ChiSq lexicons, respectively.

The results of classifying the datasets using this simple approach with varying levels of threshold, θ = [0, 0.5, 1], are displayed in Tables 5, 6 and 7 respectively. It is evident that AraSenTi-PMI lexicon performs best regardless of the chosen threshold (with max Favg = 85.22%) for the AraSenTi dataset, and AraSenTi-Entropy close behind it in all experiments. AraSenTi-ChiSq has little variation across experiments, indicating that the differences in thresholds chosen are negligible to the sum of chi-square scores which determines the class. AraSenTi-ChiSq hasworse performance overall, but it is most drastic in the AraSenTi dataset (with max Favg = 71.82%).

Table 5. Results with theta 0.
Table 6. Results with theta 0.5.
Table 7. Results with theta 1.

Table 8 shows the performance of ChiSq before clipping, which was static across experiments. We can see that aside from AraSenTiFavg, clipping the scores improved its performance. The degradation in AraSenTi dataset can be attributed to the loss of relative polarity for words with scores greater than the limits.

Table 8. Performance of ChiSq before clipping, it was invariant across experiments

All lexicons perform best on AraSenTi dataset, with a difference of 20 points or more in Favg. AraSenTi lexicons capture the idiosyncrasies of Twitter data, which apparently does not map well to other benchmark datasets, which may contain modern standard Arabic or other dialects.

For AraSenTi-PMI and AraSenTi-Entropy, the effect of varying threshold decreases the Favg of AraSenTi dataset, but improves it for the other datasets: ASTD and RR. This is expected since the lexicons, which had been extracted from the AraSenTi dataset, have zero-median scores (as can be seen from box plots above). Furthermore, raising the threshold decreases the number of false positives, which increases the true negatives. The amount of negative data in both ASTD and RR far exceeds the amount of positive data, so such an effect is desirable

7 Conclusion

In this paper, we attempted to address a gap of the lack of Arabic sentiment lexicons that are generated from Twitter data. A previous attempt was achieved by exploiting the PMI statistical measure in [3]. New statistical approaches were investigated, these are: ChiSquare and Entropy. Intrinsic and extrinsic evaluations were conducted on the three lexicons. The results show that the performance of the lexicon that was generated using PMI outperforms other lexicons. However, the accuracy achieved from the other lexicons on the experimental datasets was very satisfying.