Introduction

The introduction of the Web made it possible and easy for users to express their opinions online. The number of online opinions has snowballed over the past years and is still growing [1]. Processing and analyzing online opinions have emerged as an important task for organizations and researchers since they contain valuable information. Manually identifying the sentiment of opinions and summarizing opinions is very challenging and impracticable [2]. Consequently, there is a rising demand for approaches that overcome the drawbacks of manually processing opinions. Sentiment mining approaches are computational approaches that automatically obtain the sentiment of an opinion [3]. Sentiment lexicons play a key role in these approaches since most of them use a sentiment lexicon [4]. They can be constructed manually or automatically. Manually creating these lexicons ensures their high quality because they are made by language and domain experts. However, this process is time-consuming, and those experts may not always be available. Consequently, the coverage of the manually built sentiment lexicon is low. These disadvantages turned the focus to automatically building sentiment lexicons. Sentiment lexicons can be constructed for the general domain or a specific domain, such as the financial domain. Building a domain-specific sentiment lexicon is more challenging since words could have domain-specific meanings and sentiments.

In this paper, we focus on building sentiment lexicons for the financial domain. Financial investors make trades based on available information. Some of this information is made available by social media. Previous research has proved that social media messages and news articles are useful sources for supporting stock market decisions [5, 6]. Consequently, sentiment analysis is being increasingly used to predict stock market variables [7]. For example, Malandri et al. [8] use a financial sentiment lexicon to predict the best asset allocation. Xing et al. [9] use sentiment analysis to create market views. These market views are integrated into an asset allocation method. Picasso et al. [10] and Weng et al. [11] use, among other things, sentiment analysis on news articles to forecast stock prices. Although the interest in sentiment analysis in the stock market is rising, the domain lacks good sentiment lexicons. In the past, manually made financial sentiment lexicons, like the sentiment lexicon made by Loughran and McDonald [12], are not always performing well compared to automatically built financial sentiment lexicons [13, 14].

In this research, we investigate existing automatic approaches that can be used to build financial sentiment lexicons. Furthermore, we investigate how they can be extended to account for negation while building a financial sentiment lexicon. These solutions are all focused on building the sentiment lexicons without any domain or language knowledge. This kind of approach is also known as an a priori approach. We use three different types of a priori approaches to create sentiment lexicons for the financial domain, namely probability-based, information retrieval-based, and sentiment-aware word embedding-based approaches. The financial sentiment lexicons are built by using messages from StockTwits, which is a financial microblogging platform. The messages are marked as either bullish or bearish. In the financial domain, bullish indicates positive sentiment, and bearish indicates negative sentiment. Hereafter, we use the terms bullish and positive interchangeably. In addition, we also use the terms bearish and negative interchangeably. Moreover, we do not consider the sentiment class neutral for financial corpora in this research due to this class’s ambiguity. However, it is still possible that words in the sentiment lexicon end up having a sentiment strength of zero, i.e., a neutral sentiment orientation. We define the sentiment orientation as the sign of the sentiment strength.

After building the financial sentiment lexicons, we evaluate these lexicons by classifying financial messages. We compare the financial sentiment lexicons and other general and financial sentiment lexicons created by other researchers in two different settings. We evaluate the sentiment classification in an unsupervised and supervised setting. For the evaluation part, we use three different financial corpora. The financial corpora consist of messages from StockTwits, financial-related tweets from Twitter, and financial headlines. The different classification (unsupervised and supervised) tasks show us that the probability-based approaches outperform the information retrieval-based and sentiment-aware word embedding-based approaches. Moreover, the proposed weighted versions of the Pointwise Mutual Information (PMI) approaches outperform other researchers’ general and financial sentiment lexicons in all the sentiment classification tasks. Furthermore, we notice that accounting for negation while building the sentiment lexicons leads to better performing sentiment lexicons, which other approaches neglect when building them.

The main contributions of this paper are as follows:

  • We propose weighted versions of the PMI approaches. The sentiment lexicons built by these weighted approaches outperform other lexicons in different sentiment classification tasks in the financial domain;

  • We discuss how to deal with negation in sentences, and we show how the sentiment lexicon building approaches could be extended to account for negation when determining the sentiment orientation and strength of a word. We propose two different methods, namely the Negated Word (NW) approach and the Flip Sentiment (FS) approach.

The remainder of this paper is structured as follows. In the next section, the “Related Work” section, we review the literature that is relevant to our research. The related work is followed by a description of the implementation of the various approaches that are used to automatically build financial sentiment lexicons in the “Methodology” section. The process of building the financial sentiment lexicons and the performed evaluation of these are described in the "Results" section. In the "Conclusion" section, we provide concluding remarks and suggest future research directions.

Related Work

Sentiment lexicons play a crucial role in the sentiment analysis approaches since most of the existing sentiment mining approaches use a sentiment lexicon [4]. There are multiple ways to create a sentiment lexicon. They can be divided into two main categories: manual and automatic approaches. Moreover, the latter category can also be divided into two subcategories: dictionary-based and corpus-based approaches [15, 16].

The first category, manual approaches, consists of sentiment lexicons that are entirely made by hand. These approaches are the most labor-intensive and expensive approaches because they require domain and language experts to manually assign sentiment orientations and sentiment strengths to words and phrases. Consequently, these sentiment lexicons are of high quality. On the other hand, they are time-consuming to build, hard to maintain, and not immune to the evolution of words and their sentiment orientation. Moreover, the coverage of the manually built sentiment lexicon is low. The Harvard General Inquirer [17] and the MPQA subjectivity sentiment lexicon [18] are great examples of manually built sentiment lexicons. The Harvard General Inquirer is an extensive collection of words containing syntactic, semantic, and pragmatic information of part-of-speech tagged words. The Harvard General Inquirer also includes whether a word can be classified as a word with either a positive or negative sentiment orientation. The MPQA subjectivity sentiment lexicon has the same structure as the Harvard General Inquirer, but it also contains the subjectivity strength of a word or phrase. The subjectivity strength could be strong if the word or phrase has a strong meaning, like “excellent,” or the subjectivity strength could be weak if it has a weak meaning, like “fine.”

For the financial domain, the manually made lexicons by Loughran and McDonald [12] and Jegadeesh and Wu [19] are the best known manually built sentiment lexicons. Loughran and McDonald [12] made use of 10-K documents from the U.S. Securities and Exchange Commission. They built six lexicons named after the sentiment they represent: positive, negative, uncertainty, litigious, modal strong, and modal weak. Jegadeesh and Wu [19] also worked with the 10-K documents from the U.S. Securities and Exchange Commission. However, they focused on the importance of assigning a weight to the words in the sentiment lexicon.

The second category, the dictionary-based approaches, consists of approaches that exploit semantic relations, such as synonyms and antonyms, between words. Most of the approaches start with a small set of seed words. This set of seed words consists of a small group of words for which the sentiment orientation is already known. The small set of seed words is expanded by looking up the seed words’ synonyms and antonyms in a dictionary [20]. An example of a dictionary is the online (semantic) lexical resource WordNet [21]. In WordNet, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by utilizing conceptual-semantic and lexical relations. An example of a synset is the synset of the word “stock market.” This synset contains the synonyms “stock exchange” and “securities market.” Using a dictionary-based approach, one starts defining the set of seed words to build a sentiment lexicon. Thereafter, the process continues expanding the seed set by searching for synonyms and antonyms of the words that are contained in the seed set.

The third category, corpus-based approaches, consists of approaches that extract the sentiment lexicon’s words from a corpus or corpora. These approaches could also use a list of seed words, but the list is expanded using corpora instead of a dictionary. An advantage of using corpus-based approaches is that they use the fact that these corpora contain domain-specific knowledge. This domain-specific knowledge gives words a domain-specific sentiment orientation. There are multiple types of approaches in the category of corpus-based approaches. We point out the studies that are most related to our work. The first set of studies uses unsupervised techniques, such as information-theoretic techniques and other statistical measurements. The first significant work that uses these techniques is the work of Turney [22]. The author applies PMI and information retrieval measurements to estimate the semantic orientation of words or phrases. Later, other information-theoretic techniques and statistical measurements were used [13, 14, 23, 24]. These works show that approaches using information-theoretic techniques and statistical measurements belong to the state-of-the-art approaches to create domain-specific sentiment lexicons. Next to the unsupervised techniques, many studies use supervised techniques that create sentiment lexicons. Li and Shah [25], Tang et al. [26], and Wang and Xia [27] learn word embeddings by using a neural network to capture both the syntactic structure and semantics of a word. The approach of Vo and Zhang [28] consists of a simple neural network that learns the sentiment orientation of a word. The neural network learns the sentiment orientations of words by optimizing the accuracy of predicting the sentiment orientation of messages. The authors show that building sentiment lexicons by optimizing predictions improves the sentiment lexicon’s accuracy compared to sentiment lexicons built by counting-based methods. Recently, there is an upcoming interest in methods that adapt existing lexicons to a specific domain. An example of such an approach is the work of Xing et al. [29]. The authors introduce a cognitive-inspired approach that uses the wrongly predicted sentences to adjust the polarity scores. The newly constructed sentiment lexicons achieve higher accuracies in the sentiment classification tasks than the original sentiment lexicons.

Furthermore, other approaches make use of both a dictionary and a corpus. The dictionary-based approaches usually do not give domain or context-dependent meanings to words. In addition, employing a corpus-based approach makes it hard to find a large set of opinion words if the corpus is not large. The disadvantages of both types of approaches can be tackled by combining these types [16]. An example of a study that combines both types of approaches is the work of Hu and Liu [2]. Hu and Liu [2] start by extracting adjectives from corpora, which are, in this case, consumer reviews. Thereafter, the authors assign a sentiment orientation to these adjectives based on the known sentiment orientation of a list of original seed adjectives. The list of seed adjectives is iteratively expanded by using the seed adjectives’ semantic relations in WordNet. This way, it contains both domain-specific adjectives obtained from the corpus and general adjectives, which are the original seed adjectives.

Methodology

In this section, we discuss the methodology we use to create financial sentiment lexicons and evaluate the created financial sentiment lexicons. We start by discussing the probability-based approaches, information retrieval-based approaches, and the sentiment-aware word embedding-based approach. Thereafter, we elaborate on how to account for negation while building a sentiment lexicon. Last, we discuss the methods we use to evaluate the quality of the created financial sentiment lexicons.

Financial Sentiment Lexicon Approaches

In this section, we dive deeper into the different approaches we use to create financial sentiment lexicons. Before we dive deeper into these approaches, we introduce some general notation in Table 1.

Table 1 General definitions and notations

Probability-Based Approaches

The probability-based approaches are focused on the probabilities of a sentiment class given a word, i.e., the probabilities of a word being positive and negative. The different probabilities are obtained by counting the occurrences in a training set. Hence, we also refer to this type of approach as counting-based approaches. We start with the Bayes’ Theorem Benchmark (BTB) approach, which is the most intuitive approach. The BTB approach makes use of Bayes’ theorem and is focused on counting the frequencies of words. Thereafter, we continue with the PMI approach, which is similar to the BTB approach. However, the PMI approach is focused on counting the frequency of messages.

Bayes’ Theorem Benchmark.

Our first approach is defined by Labille et al. [23]. It is derived from the Bayes’ theorem introduced by Bayes and Price [30]. We define the sentiment strength of word w, computed by the BTB approach, \(\text {SS}_{\text {BTB}}(w)\), as the difference between the probability of being positive, p(pos|w), and the probability of being negative, p(neg|w). The \(\text {SS}_{\text {BTB}}(w)\) is stated in Eq. 1.

$$\begin{aligned} \text {SS}_{\text {BTB}}(w)= & \ p(pos|w) - p(neg|w),\\= & \ \frac{\sum _{m \in M_{pos}}^{}n_{wm}}{\sum _{m \in M}^{}n_{wm}} - \frac{\sum _{m \in M_{neg}}^{}n_{wm}}{\sum _{m \in M}^{}n_{wm}},\nonumber \end{aligned}$$
(1)

where \(n_{xy}\) denotes the number of word(s) x in the set y. The probabilities p(pos|w) and p(neg|w) can be interpreted as counting the number of times word w appears in messages with that specific sentiment class, divided by the total appearances of word w in all messages.

Pointwise Mutual Information.

PMI measures the association between two words or sets of words. The PMI measurement was derived by Church and Hanks [31] from Fano’s original definition of mutual information [32]. In this research, we follow the works of Turney [22] and Oliveira et al. [14] to suit the PMI measure to the needs of sentiment analysis. However, the interpretation of the PMI measure slightly differs from the interpretation of the works mentioned above. We interpret it as counting the frequency of messages instead of counting the frequency of words. The sentiment strength \(\text {SS}_{\text {PMI}}(w)\) is defined as follows:

$$\begin{aligned} \text {SS}_{\text {PMI}}(w)= & \ \text {PMI}(w, pos) - \text {PMI}(w, neg),\\= & \ \log _{2}\bigg (\frac{M_{w,pos}\times M}{M_{w} \times M_{pos}}\bigg ) - \log _{2}\bigg (\frac{M_{w,neg}\times M}{M_{w} \times M_{neg}}\bigg ).\nonumber \end{aligned}$$
(2)

There are two significant drawbacks of the PMI approach, as defined in Eq. 2. The first drawback is that we could come across a word that only appears in messages that belong to one of the two sentiment classes. Consequently, the logarithm’s inner term in the PMI measure of the other sentiment class becomes equal to zero. Since the logarithm is undefined for zero, we are unable to compute the corresponding PMI measure. We tackle this problem by setting the PMI measure of the corresponding sentiment class to be equal to zero, as has been suggested by Bouma [33].

To illustrate the second drawback, we look at an example. In this example, we assume that \(M = 10,~ M_{pos} = 5,\) and \(M_{neg} = 5\). Further details of this example are stated in Table 2. If we look at \(w_2\), we see that it occurs in all the positive messages and only in one negative message. Furthermore, it has a \(\text {PMI}(w,pos)\) value of 0.74 and a \(\text {PMI}(w,neg)\) value of -1.58. If we compare the absolute values of both PMI measures, we notice that the value of \(\text {PMI}(w,neg)\) is more than twice as large as the \(\text {PMI}(w,pos)\) value. Consequently, the influence of \(\text {PMI}(w,neg)\) on \(\text {SS}_{\text {PMI}}(w)\) is not in line with the occurrence of \(w_2\) in the negative message compared to the occurrences of \(w_2\) in the positive messages. One expects that the value of \(\text {PMI}(w,pos)\) would be larger than the value of \(\text {PMI}(w,neg)\) for \(w_{2}\) and thus have a larger influence on \(\text {SS}_{\text {PMI}}(w)\). To tackle this problem, Bouma [33] suggests normalizing the PMI measure. The maximum value of the Normalized PMI (NPMI) measure is equal to one, which only occurs if a word solely appears in messages of a specific sentiment class.

Next, we compute \(\text {NPMI}(w,pos)\) and \(\text {NPMI}(w,neg)\) for \(w_2\). The NPMI measure values are 0.74 and -0.48, respectively. These values are in line with the values that one expects given \(M_{w,pos}\) and \(M_{w,neg}\) for \(w_2\). However, there is also a disadvantage of using the NPMI measure. To illustrate this disadvantage, we take a look at \(w_3\) and \(w_4\) from Table 2. Both words occur in the same ratio in positive and negative messages, namely 2:1 and 4:2, respectively. Since they have the same ratio, one could intuitively expect that they have the same PMI and NPMI values. Nevertheless, this holds only for the PMI measure and not for the NPMI measure.

Table 2 Example 1 drawback PMI - Part 1

To choose between using the PMI measure and the NPMI measure, we look at the sentiment strengths of \(w_1\), \(w_2\), \(w_3\), and \(w_4\), which are displayed in Table 3. In our example, \(w_1\) solely occurs in positive messages. Consequently, we expect \(w_1\) to have the highest sentiment strength. We use the sentiment strength of \(w_1\) as our benchmark to compare the sentiment strengths of \(w_2\), \(w_3\), and \(w_4\). First, we take a look at the \(\text {SS}_{\text {PMI}}(w)\) values. The \(\text {SS}_{\text {PMI}}(w)\) value for \(w_2\) is larger than the \(\text {SS}_{\text {PMI}}(w)\) value of \(w_1\) and does not reflect that \(w_2\) is also found in negative messages. Furthermore, the \(\text {SS}_{\text {PMI}}(w)\) values for \(w_3\) and \(w_4\) are approximately equal to one, which is also unwanted since \(w_3\) and \(w_4\) also occur in negative messages. Therefore, we do not want to use \(\text {SS}_{\text {PMI}}(w)\) to compute the sentiment strengths in our sentiment lexicon. Next, we compute the \(\text {SS}_{\text {NPMI}}(w)\) values for all the words. The \(\text {SS}_{\text {NPMI}}(w)\) is defined as follows:

$$\begin{aligned} \text {SS}_{\text {NPMI}}(w)= & \ \text {NPMI}(w, pos) - \text {NPMI}(w, neg),\\= & \ \frac{\log _{2}\bigg (\frac{M_{w,pos}\times M}{M_{w} \times M_{pos}}\bigg )}{-\log _{2}\bigg (\frac{M_{w,pos}}{M}\bigg )} - \frac{\log _{2}\bigg (\frac{M_{w,neg}\times M}{M_{w} \times M_{neg}}\bigg )}{-\log _{2}\bigg (\frac{M_{w,neg}}{M}\bigg )}. \nonumber \end{aligned}$$
(3)

Now, we see in Table 3 that only for \(w_2\), the \(\text {SS}_{\text {NPMI}}(w)\) value of 1.21 is not in line with our expectations. Therefore, we decided not to use \(\text {SS}_{\text {NPMI}}(w)\) for our sentiment lexicon creation. We propose to use weighted versions of the \(\text {SS}_{\text {PMI}}(w)\) and \(\text {SS}_{\text {NPMI}}(w)\), which is one of this paper’s contributions. We weigh the (N)PMI values by the ratio of occurrence in messages with the specific sentiment class. The Weighted PMI (W-PMI) sentiment strength, \(\text {SS}_{\text {W-PMI}}(w)\), is computed as follows:

$$\begin{aligned} \text {SS}_{\text {W-PMI}}(w)= & {} \frac{M_{w,pos}}{M_w} \times \text {PMI}(w, pos) - \frac{M_{w,neg}}{M_w} \times \text {PMI}(w, neg), \nonumber \\= & {} \frac{M_{w,pos}}{M_w} \times \log _{2}\bigg (\frac{M_{w,pos}\times M}{M_{w} \times M_{pos}}\bigg ) \nonumber \\&- \frac{M_{w,neg}}{M_w} \times \log _{2}\bigg (\frac{M_{w,neg}\times M}{M_{w} \times M_{neg}}\bigg ). \end{aligned}$$
(4)

The Weighted NPMI (W-NPMI) sentiment strength, \(\text {SS}_{\text {W-NPMI}}(w)\), is computed as follows:

$$\begin{aligned} \text {SS}_{\text {W-NPMI}}(w)= & {} \frac{M_{w,pos}}{M_w} \times \text {NPMI}(w, pos) \nonumber \\&- \frac{M_{w,neg}}{M_w} \times \text {NPMI}(w, neg), \nonumber \\= & {} \frac{M_{w,pos}}{M_w} \times \frac{\log _{2}\bigg (\frac{M_{w,pos}\times M}{M_{w} \times M_{pos}}\bigg )}{-\log _{2}\bigg (\frac{M_{w,pos}}{M}\bigg )} \nonumber \\&- \frac{M_{w,neg}}{M_w} \times \frac{\log _{2}\bigg (\frac{M_{w,neg}\times M}{M_{w} \times M_{neg}}\bigg )}{-\log _{2}\bigg (\frac{M_{w,neg}}{M}\bigg )}. \end{aligned}$$
(5)

After defining the weighted versions of \(\text {SS}_{\text {PMI}}(w)\) and \(\text {SS}_{\text {NPMI}}(w)\), we compute \(\text {SS}_{\text {W-PMI}}(w)\) and \(\text {SS}_{\text {W-NPMI}}(w)\) for words \(w_1\), \(w_2\), \(w_3\), and \(w_4\). The \(\text {SS}_{\text {W-PMI}}(w)\) and \(\text {SS}_{\text {W-NPMI}}(w)\) are displayed in Table 3. One can see that all the \(\text {SS}_{\text {W-PMI}}(w)\) and \(\text {SS}_{\text {W-NPMI}}(w)\) values for words \(w_2\), \(w_3\), and \(w_4\) are smaller than the \(\text {SS}_{\text {W-PMI}}(w)\) and \(\text {SS}_{\text {W-NPMI}}(w)\) values of \(w_1\). The weighted versions give us desired sentiment strength values. Furthermore, we obtained different values of \(\text {SS}_{\text {W-NPMI}}(w)\) for \(w_3\) and \(w_4\), even though the words have the same ratio between \(M_{w,pos}\) and \(M_{w,neg}\). On the one hand, one could argue that having the same ratio between \(M_{w,pos}\) and \(M_{w,neg}\) should result in an equal sentiment strength, which is the case for \(\text {SS}_{\text {W-PMI}}(w)\). On the other hand, one could argue that the sentiment strengths, in this case, should not be equal because one should also take into account the ratio between \(M_{w,c}\) and M, which is not the same for \(w_3\) and \(w_4\).

Table 3 Example 1 drawback PMI - Part 2

Similar to the discussion between \(\text {SS}_{\text {W-PMI}}(w)\) and \(\text {SS}_{\text {W-NPMI}}(w)\) above, one could argue that \(w_1\) and \(w_5\) should have equal sentiment strengths, which is the case for \(\text {SS}_{\text {W-PMI}}(w)\). The argument is that both words appear only in positive messages and, therefore, should have the same sentiment strength. However, one could again argue that one should take into account the relation between \(M_{w,c}\) and M. Since \(w_1\) occurs more often in the positive messages (i.e., higher \(M_{w,pos}\)), one could argue that you are more certain about the sentiment orientation and sentiment strength of \(w_1\) compared to \(w_5\). Therefore, \(w_1\) and \(w_5\) should have different sentiment strengths. In this research, we use both \(\text {SS}_{\text {W-PMI}}(w)\) and \(\text {SS}_{\text {W-NPMI}}(w)\) to compute the sentiment strengths for our sentiment lexicon.

Last, using \(\text {SS}_{\text {W-PMI}}(w)\) to compute the sentiment strengths has a small disadvantage in the case of a word that is hugely unevenly distributed over \(M_{w,pos}\), and \(M_{w,neg}\). We illustrate this disadvantage with an example, which is displayed in Table 4. In this example, M is equal to 40, and \(M_c\) is equal to 20. In the case of \(w_7\), the \(\text {SS}_{\text {W-PMI}}(w)\) value is slightly larger than the \(\text {SS}_{\text {W-PMI}}(w)\) value of \(w_6\), which is unwanted since \(w_6\), in contrast to \(w_7\), only appears in positive messages. Therefore, we suggest clamping the sentiment strength, such that it is between -1 and 1. These values are the minimum and maximum sentiment strengths in the cases of having a word that only occurs in either positive or negative messages. We clamp of the sentiment strength with the following equation:

$$\begin{aligned} \text {SS}_{\text {W-PMI}}(w)&= \text {max}(\text {min}(\text {SS}_{\text {W-PMI}}(w), 1), -1). \end{aligned}$$
(6)
Table 4 Example 2 drawback \(\text {SS}_{\text {W-PMI}}(w)\)

Information Retrieval-Based Approach

In general, there exist many information retrieval techniques. One of the most popular information retrieval techniques is the Term Frequency-Inverse Document Frequency (TF-IDF) statistic proposed by Salton and Buckley [34]. The statistic reflects how important a specific term t is to a document d in a corpus. Wang and Zhang [35] introduced the Term Frequency-Inverse Category Frequency (TF-ICF) statistic, a similar statistic to the TF-IDF statistic but designed explicitly for categories instead of documents. The intuition behind the ICF term: the more categories in which word w occurs, the less discrimination power word w has. Next to the TF-ICF measure, Wang and Zhang [35] propose an extension of the TF-ICF measure, namely the Inverse Category Frequency-based (ICF) measure. This ICF-based measure combines the TF-ICF measure and the Relevance Frequency (RF) measure introduced by Lan et al. [36].

To define a sentiment score based on the information retrieval measure, we follow the work of Oliveira et al. [14]. Oliveira et al. propose the following equation to compute the sentiment strength for word w using the TF-IDF measure:

$$\begin{aligned} \text {SS}_{\text {TF-IDF}}(w) = \frac{\text {TF-IDF}(w, pos) - \text {TF-IDF}(w, neg)}{\text {TF-IDF}(w, pos) + \text {TF-IDF}(w, neg)}. \end{aligned}$$
(7)

In our case, we adjust Eq. 7 to the following equation:

$$\begin{aligned} \mathrm {SS}_{\mathrm {ICF}}(w)= & \ \frac{\mathrm {ICF}(w, pos, neg) - \mathrm {ICF}(w, neg, pos)}{\mathrm {ICF}(w, pos, neg) + \mathrm {ICF}(w, neg, pos)},\\= & \ \frac{\frac{\Bigg (tf_{w,pos} \times \log _2\bigg (2+\frac{M_{w,pos}}{\mathrm {max}(1, M_{w,neg})} \times \frac{\vert C \vert }{cf_{w}} \bigg )}{- tf_{w,neg} \times \log _2\bigg (2+\frac{M_{w,neg}}{\mathrm {max}(1, M_{w,pos})} \times \frac{\vert C \vert }{cf_{w}} \bigg )\Bigg )}}{\frac{\Bigg (tf_{w,pos} \times \log _2\bigg (2+\frac{M_{w,pos}}{\mathrm {max}(1, M_{w,neg})} \times \frac{\vert C \vert }{cf_{w}} \bigg )}{+ tf_{w,neg} \times \log _2\bigg (2+\frac{M_{w,neg}}{\mathrm {max}(1, M_{w,pos})} \times \frac{\vert C \vert }{cf_{w}} \bigg )\Bigg )}}\nonumber \end{aligned}$$
(8)

where \(tf_{w,c}\) is the number of times word w occurs across all messages with sentiment class c; |C| is the cardinality of the set of all sentiment classes, i.e., the number of sentiment classes, which equals two in our case; \(cf_{w}\) is the number of sentiment classes that contain word w.

Sentiment-Aware Word Embedding-Based Approach

The final type of approach makes use of sentiment-aware word embeddings. Word embeddings represent words or phrases that are mapped to vectors with real numbers. Words with similar contexts appear closer to each other than words that do not have a similar context. The algorithms that create the word embeddings use a large corpus to capture and process the words’ semantic and syntactic contexts. Popular algorithms that create word embeddings are word2vec [37] and GloVe [38]. In the field of sentiment analysis, there is a demand for word embeddings that also contain the sentiment of the words. However, standard word embedding creating algorithms cannot always capture the sentiment successfully in the word embeddings [27]. Consequently, one cannot utilize these general word embeddings and should focus on sentiment-aware word embeddings, which also contain the sentiment of words.

In our research, we use the Simple Neural Network (SNN) approach of Vo and Zhang [28] to construct the sentiment-aware word embeddings. We start by defining the words as word embeddings. A word w takes the form of [np], where n stands for the negative sentiment value of the word and p for the positive sentiment value. The positive and negative sentiment values are the weights of the neural network obtained after training the neural network. We refer to the work of Vo and Zhang [28] for further details about the neural network that we use to train the sentiment-aware word embeddings. We compute the sentiment strength \(\text {SS}_{\text {SNN}}(w)\) of word w by simply subtracting n from p.

Adjustments for Negation

Taking negation into account when performing sentiment analysis could improve determining the messages’ sentiment orientation [39]. For example, the sentiment of the sentence “It is looking good.” is the opposite of the sentiment of the sentence “It is not looking good.”, while the sentences are word-wise very similar to each other. The challenge of detecting the negation consists of two parts: negation cue detection and negation scope detection [40]. The negation cue is the negation keyword that indicates that there is a negation in a sentence. We can distinguish two types of negation cues: explicit and implicit negation cues [41]. Explicit negation cues are negation words, such as “not” and “never,” which affect the following words and change their meaning.

On the other hand, we have implicit negation cues, such as “dislike” and “hopeless.” The implicit negation cues can be recognized by their affixes and suffixes, such as “dis-,” “im-,” and “-less,” and their negation affects only these single words. In this research, the implicit negation cues are treated as ordinary words, and therefore, they automatically receive their own sentiment orientation and strength in our sentiment lexicon. Hence, we only pay extra attention to the explicit negation cues and leave the implicit negation cues as future research. In this work, we focus on the explicit negation cues as defined by Jia et al. [40] and Councill et al. [41]. In Table 5, we state all the explicit negation cues that we use in this research. Since we are dealing with microblogging messages, we also take into account abbreviations of the explicit negation cues, such as “isnt” and “cant.”

Table 5 Explicit negation cues

After detecting a negation cue, we must still tackle the challenge of detecting the negation scope. The negation scope is the set of words affected by the negation cue, and the sentiment orientation is inverted. There exist many approaches to detect the negation scope. They vary from simply setting a fixed window as the negation scope [40] to using machine learning approaches to determine the negation scope [42]. We follow the work of Hogenboom et al. [39] and consider the two words following the negation cue as the negation scope. The authors show that this is a simple and effective approach to use in sentiment classification. Finally, after determining the negation cues and negation scope, we treat the explicit negation cues as stopwords and remove them from the messages.

We propose two approaches to account for the negated words in the negation scope while computing the sentiment orientation and strength. The first approach creates two entries for a word in the sentiment lexicon, one for the original word and one for its negated version. We refer to this approach as the NW approach. We transform the negated word w to “NOT_w” and give it a separate entry in the sentiment lexicon. If we look at the example sentence, “It is not looking good.”, then we create a new entry for the negated version of “good,” namely “NOT_good.” Now, an own sentiment orientation and strength are assigned to “NOT_good.” An advantage of this approach is that we only have to change the words in the negation scope to their negated versions. In addition, we do not have to change any input for the previously described sentiment lexicon building approaches because the negated words get separate entries.

Consequently, if we come across a negated word in the sentiment classification task, we do not pay particular attention to the negated word. Nevertheless, this approach has a disadvantage. The negated version of a word may receive the same sentiment orientation as the original version, which is possible due to the low number of occurrences of the negated version of the word. We could tackle this by setting a threshold of minimal occurrences before the negated version of a word is included in the sentiment lexicon. However, this results in a loss of information since we do not add these negated versions to the sentiment lexicon.

The second approach considers the negated words to have a sentiment orientation that is the opposite of the message’s sentiment class. Thus, we consider the negated words in a message with a positive sentiment class as negative words and the negated words in a message with a negative sentiment class as positive words. In other words, we flip the sentiment orientation of the negated words. We refer to this approach as the FS approach. To clarify the FS approach further, we look at the example sentence: “It is not looking good.”. The message’s sentiment class is negative, but since “good” is in the negation scope, we consider good” to be the opposite of negative, i.e., positive. The probability-based and information retrieval-based sentiment lexicon building approaches specifically rely on the number of occurrences of word w in sentiment class c (= \(\sum _{m \in M_c}^{}n_{wm}\) and \(tf_{w, c}\)) and on the number of messages of sentiment class c in which word w occurs (= \(M_{w,c}\)).

Consequently, we adjust the values of \(\sum _{m \in M_c}^{}n_{wm}\), \(tf_{w, c}\), and \(M_{w,c}\) for the negated word w. If we have the word “good” from the example sentence, then we adjust \(M_{w,pos}\) by adding one message to \(M_{w,pos}\), and \(M_{w,neg}\) by subtracting one message from \(M_{w,neg}\). For the sentiment-aware word embedding-based approach, we treat the negated scope(s) of a message as a separate message, which has the opposite sentiment class of the original message. Furthermore, the second approach tackles the disadvantage of having the same sentiment orientation by having a single entry for each word.

In Table 6, we state all the automatically sentiment lexicon building approaches discussed in this paper. We select the five most advanced approaches (per type and per category), which we use for evaluation. For each approach, we create three different versions. The first version is made without adjusting for negation, i.e., the benchmark sentiment lexicon. The second and third versions of the sentiment lexicon are created with accounting for negation using the two previously described approaches. Hence, we construct in total fifteen financial sentiment lexicons. We compare the different financial sentiment lexicons made while accounting for negation to their benchmarks to analyze whether there is a significant difference in performance on the sentiment classification tasks.

Table 6 All sentiment lexicon building approaches

Evaluation

We evaluate the created financial sentiment lexicons in different ways. We first discuss the supervised and unsupervised classification evaluation of these lexicons. The evaluation is done internally by comparing the built financial sentiment lexicons with each other and externally by comparing them with different existing lexicons. The external comparison is made with the following general and financial sentiment lexicons:

  • Harvard General Inquirer Lexicons (GI) - The Harvard General Inquirer [17] contains a positive and a negative lexicon, which is originally constructed by the Harvard IV dictionary. Since the words lack a sentiment strength, we assign a value of \(+\)1 to the words in the positive lexicon and a value of −1 to the words in the negative lexicon.

  • MPQA Subjectivity Lexicon - The MPQA Subjectivity Lexicon has been manually built by Wilson et al. [18]. We use the prior polarity of words as the sentiment orientation. The prior polarity can either be positive, negative, neutral, or both positive and negative. An example of a word that is both positive and negative is the word “demand.” In this research, we only use the words with either a positive, neutral, or negative prior polarity. We assign a value of 0 to the words that have neutral prior polarity.

  • Hu and Liu Lexicons (HL) - Hu and Liu [2] built two lexicons, a positive and a negative lexicon. Similar to the GI lexicon, we assign a value of \(+\)1 to the words in the positive lexicon and a value of −1 to the words in the negative lexicon. The words’ subjectivity strength indicates whether the meaning of a word is either strong or weak. Similar to Oliveira et al. [14], we use the words’ subjectivity strength to adjust the weak words’ sentiment strength to \(+\)0.5 or −0.5. In the case of a strong word, we keep the sentiment strengths of \(+\)1 and −1.

  • NRC Hashtag Sentiment Lexicon (NRC-H) - Mohammed et al. [43] created the first sentiment lexicon using the PMI measure, as described in the "Probability-Based Approaches"section. The PMI measure was applied to the words of 775,000 tweets, which were marked as either positive or negative by their hashtags. The authors used positive hashtags, such as #good, and negative hashtags, such as #bad, to identify the tweet’s sentiment orientation. In our research, we use the sentiment lexicon that consists of unigrams.

  • NRC Emoticon Sentiment Lexicon (NRC-E) - The second sentiment lexicon generated by Mohammed et al. [43] is constructed by applying the PMI measure on the sentiment140 corpus [51]. The tweets in the corpus were classified as either positive or negative based on the emoticon(s) in the tweet.

  • VADER Sentiment Lexicon - Ten individual raters rated more than 7,500 words to create the VADER sentiment lexicon [44]. The raters rated the words on a scale of [−4, \(+\)4]. Thereafter, the average of these ten ratings is taken as the sentiment strength of a word.

  • Loughran and McDonald Lexicons (LM) - Loughran and McDonald [12] constructed six lexicons out of financial 10-K documents. The lexicons are named after the sentiment they represent. These lexicons only contain words and do not contain any specific sentiment orientations or strengths. In this research, we only use the positive and negative lexicons because it is unclear which sentiment orientation we should assign to the other lexicons. We assign a positive sentiment orientation and a sentiment strength of \(+\)1 to the words in the positive lexicon. In addition, we assign a negative sentiment strength and a sentiment strength of −1 to the words in the negative lexicon.

  • SenticNet 6.0 Lexicon - Cambria et al. [45] introduced an approach that combines both symbolic and subsymbolic models and leverages their strengths. In this research, we make use of the sixth version of the SenticNet knowledge base.

  • Stock Market Sentiment Lexicon (SM) - Oliveira et al. [14] generated a financial sentiment lexicon using the PMI measure. This sentiment lexicon was constructed by leveraging messages from StockTwits. The SM sentiment lexicon is the only external sentiment lexicon that considered negation. The authors account for negation by dividing the messages of StockTwits into an affirmative corpus and a negated corpus. Thereafter, they learn two separate sentiment strengths for each word, one without negation and one with negation.

Sentiment Classification Evaluation

In the sentiment classification evaluation, we use the obtained sentiment lexicons to classify unseen messages as either positive or negative. The evaluation is done internally by comparing the created financial sentiment lexicons and externally by comparing them with the earlier mentioned lexicons constructed by other researchers. The comparisons are made in a supervised and unsupervised manner. In the comparisons, we use different metrics, which are all based on the well-known confusion matrix.

In the unsupervised setting, the sentiment lexicon may be unable to classify a message as either positive or negative due to the insufficient coverage of the sentiment lexicon or because the sentiment strengths cancel each other out. Consequently, we can distinguish two groups of test messages in all the unseen messages (A). The first group consists of the unclassified messages (U), and the second group consists of the classified messages (C). Based on this differentiation, we compute the following evaluation metrics:

$$\begin{aligned} \mathrm {Overall \ Accuracy \ (ACC1): }&\mathrm {The \ overall \ percentage \ of \ correctly \ classified}\\&\mathrm {messages.}\\&=\frac{\mathrm {TP} + \mathrm {TN}}{\mathrm {A}} =\frac{\mathrm {TP} + \mathrm {TN}}{\mathrm {U} + \mathrm {C}} \\&= \frac{\mathrm {TP} + \mathrm {TN}}{\mathrm {U} + \mathrm {TP} + \mathrm {FP} + \mathrm {TN} + \mathrm {FN}};\\ \mathrm {Unclassified \ (UNCL):}&\mathrm {The \ percentage \ of \ unclassified \ messages \ due}\\&\mathrm {the \ insufficient \ coverage \ of \ the \ sentiment}\\&\mathrm {lexicon.}\\&= \frac{\mathrm {U}}{\mathrm {A}} = \frac{\mathrm {U}}{\mathrm {U} + \mathrm {C}};\\ \mathrm {Classification \ Accuracy \ (ACC2): }&\mathrm {The \ percentage \ of \ correctly \ classified \ messages}\\&\mathrm {after \ adjusting \ for \ the \ unclassified \ messages.}\\&= \frac{\mathrm {TP} + \mathrm {TN}}{\mathrm {C}} = \frac{\mathrm {TP} + \mathrm {TN}}{\mathrm {TP} + \mathrm {FP} + \mathrm {TN} + \mathrm {FN}} ;\\ \mathrm {Balanced \ Accuracy \ (BA): }&\mathrm {Balanced \ accuracy \ of \ the \ classified \ messages.}\\&= \frac{\mathrm {TP} \times (\mathrm {TN} + \mathrm {FP}) + \mathrm {TN} \times (\mathrm {TP} + \mathrm {FN})}{2 \times (\mathrm {TP} + \mathrm {FN}) \times (\mathrm {TN} + \mathrm {FP})};\\ \mathrm {F}_{1} \mathrm {Positive \ (F}_{1} \mathrm {Pos): }&\mathrm {The \ F}_{1} \mathrm {measure \ for \ the \ pos. \ sentiment \ class.}\\&= \frac{2 \times \mathrm {TP}}{2 \times \mathrm {TP} + \mathrm {FP} + \mathrm {FN}};\\ \mathrm {F}_{1} \mathrm {Negative \ (F}_{1} \mathrm {Neg): }&\mathrm {The \ F}_{1} \mathrm {-measure \ for \ the \ neg. \ sentiment \ class.}\\&= \frac{2 \times \mathrm {TN}}{2 \times \mathrm {TN} + \mathrm {FN} + \mathrm {FP}};\\ \mathrm {Macro \ F}_{1} \ \mathrm {(Macro \ F}_{1}):&\mathrm {The \ macro \ F}_{1} \mathrm {measure.}\\&= \frac{\mathrm {F}_{1} \mathrm {Pos + F}_{1} \mathrm {Neg}}{2}. \end{aligned}$$

In this research, we are mainly focusing on the balanced accuracy and the macro F\(_1\) metric because they are combinations of the other metrics. The balanced accuracy combines the true positive rate and the true negative rate. The macro F\(_1\) measure combines the recall and precision metrics. Last, we also account for negation in the sentiment classification tasks. Here, we again follow the work of Hogenboom et al. [39] and define the negation scope as the two words following the negation cue. If we come across a negated word in a sentiment classification task that has not a separate entry in the sentiment lexicon, we flip the sentiment orientation of the word and maintain the sentiment strength.

Unsupervised Classification.

In the unsupervised setting, we look up the messages’ words in the sentiment lexicon and take the sum of all the individual words’ sentiment strengths to obtain an overall sentiment score of the message. In the case of a word that is not stated in the sentiment lexicon, the word’s sentiment strength is set to zero such that it does not influence the overall sentiment score. If the overall sentiment score of a message is positive, then we classify the message as positive. On the other hand, if the overall sentiment score is negative, we classify the message as negative.

Supervised Classification.

Next to the unsupervised evaluation of the financial sentiment lexicons, we also evaluate the financial sentiment lexicon in a supervised manner. We start by extracting some pre-defined sentiment lexicon features on the test dataset, as defined by Zhu et al. [46]. The sentiment lexicon features are as follows:

  • The number of words in message m that have a sentiment strength in the sentiment lexicon;

  • The total sentiment value of message m, which is computed by taking the sum over the sentiment strengths of all the words in m;

  • The largest sentiment strength of the words in message m;

  • The sum of sentiment strengths of the words in message m that have a positive sentiment orientation;

  • The sum of sentiment strengths of the words in message m that have a negative sentiment orientation;

  • The sentiment strength of the last word in message m.

We use the sentiment lexicon features as input for the supervised sentiment classification. We train a linear classifier with LibLinear [47]. A linear classifier works well on a large number of features, and it supports interpretability. We perform a grid search on the accuracy to tune the type of classifier and the hyperparameter \(\alpha\), representing the cost of constraints violation on the five-fold cross-validation. As described by Fan et al. [47], we consider six different types of multi-class classifiers.

Results

In this section, we discuss the created financial sentiment lexicons and their evaluation. First, we give more details about constructing the financial sentiment lexicons and provide more insight into them. Thereafter, we look at the performance of the obtained financial sentiment lexicons in various sentiment classification tasks. The constructed financial sentiment lexicons and the R and Python implementation codes used to produce these are made available at https://github.com/ThomasJABos/Financial-Sentiment-Lexicons-Negation.

Building Financial Sentiment Lexicons

In this research, we make use of three datasets. The first dataset is used to construct the sentiment lexicons and test the sentiment lexicons. The second and third dataset are solely meant as complementary datasets for the sentiment classification tasks. The first dataset consists of collected messages from StockTwits. We received permission from StockTwits to use their database to collect these messages. StockTwits users can mark their messages as bullish or bearish. We set the overall sentiment values of the messages to +1 and -1, respectively. We collect 10,000 bullish and 10,000 bearish messages for each month in the year 2019. Hence, we collect a total of 240,000 messages. An advantage of collecting messages each month is that topics differ monthly, which results in a richer vocabulary of words. In addition, the advantage of having an equal number of messages in each sentiment class is that there are words that only occur in messages that belong to one of the two sentiment classes. Both advantages lead to increased coverage of the financial sentiment lexicons. The second and third dataset are made available by Cortis et al. [48]. The second dataset is the microblogging dataset, and the third dataset is the financial headlines dataset. We state an overview of the number of messages in each dataset in Table 7.

Table 7 Number of messages in the datasets

Before we can use the StockTwits messages to construct the financial sentiment lexicons, we undertake some preprocessing steps to clean the messages. We start by removing URLs, user mentions, and cashtags. One reason for removing cashtags is to prevent that cashtags get labeled with a sentiment orientation and strength related to the time period. We also remove punctuation, emoticons, and emojis to ease the (pre)processing steps. The emoticons and emojis could be indicators of sentiment, but these are outside the scope of this research. In addition, intentional spelling mistakes, such as “boreddd” and “cooool,” could also carry a sentiment. In this study, we do not correct these intentional spelling mistakes. Furthermore, we remove simple stopwords from the messages. Stopwords are words that often do not provide any additional information or insight [49]. Examples of stopwords are “a” and “the.” We use the stopwords from the list of stopwords introduced by Feinerer et al. [50].

Finally, we process the numbers from the messages. The numbers in the messages could be classified into three categories. The first category contains numbers that are prior to a percentage sign, %. This category contains all the percentage increases and decreases. We replace the percentage increases, e.g., +15%, with posperc, and the percentage decreases, e.g., -18%, with negperc. The second category consists of numbers that indicate increases or decreases, but a percentage sign does not follow these numbers. We replace the increases, e.g., +15, with posnum and the decreases, e.g., -18, with negnum. The reason for replacing the increases and decreases with a tag is that we want to prevent that single numbers would receive a sentiment orientation and strength. By replacing them with tags, we still maintain the sentimental value of the numbers. The last category consists of single numbers without a sign, i.e., without a \(+\) or −. An example of a phrase that does not contain a sign is the following phrase: “selling at 50.2”. The number does not have a meaning without knowing the context, which is, in this case, the stock price of a particular stock. We remove the numbers from the third category.

The financial sentiment lexicons are constructed using 200,000 messages from StockTwits. The training set of messages consists of 100,000 messages with a positive sentiment class and 100,000 messages with a negative sentiment class. In this research, we focus on building financial sentiment lexicons that do only contain unigrams, i.e., single words. The reason for focusing only on unigrams is that computation time significantly increases if we also consider n-grams, i.e., sets of n words as entries for our financial sentiment lexicon. We only consider unigrams that occur at least five times in our dataset. We refer to the lexicons built without accounting for negation as the original financial sentiment lexicons. Next, we refer to the lexicons constructed using the first negation approach, which focuses on creating separated entries for the negated words, as NW financial sentiment lexicons. Last, the FS financial sentiment lexicons are the lexicons created with the second negation approach, which is based on the principle that the negated words belong to the opposite sentiment class of the message.

Sentiment Classification Evaluation

We start by performing the sentiment classification in an unsupervised setting. In the unsupervised setting, we evaluate the sentiment lexicons of each category on the three test datasets. We start with the StockTwits test dataset. Thereafter, we evaluate the microblogging dataset, and finally, we discuss the financial headlines dataset. The unsupervised setting is followed by the sentiment classification in a supervised setting. In the supervised setting, we need to train the linear classifiers and test the sentiment lexicons using a test set. In order to have a well-trained classifier and at the same time have enough test messages remaining, we need a sufficiently large test set. Therefore, we only evaluate the financial sentiment lexicons on the StockTwits test dataset in the supervised setting.

Unsupervised Sentiment Classification Evaluation

Table 8 shows the evaluation metrics of the fifteen financial sentiment lexicons on the StockTwits test dataset. The sentiment lexicons built using the BTB, W-PMI, W-NPMI, and ICF approaches all have similar values for the evaluation metrics. However, the sentiment lexicons constructed using the SNN approach have dissimilar values for the evaluation metrics compared to the other approaches. Looking at the balanced accuracy and the macro F\(_1\) measure, we see that the sentiment lexicons of the BTB, W-PMI, and W-NPMI approaches, the probability-based approaches, perform slightly better than the sentiment lexicons of the ICF approach. In the category of original sentiment lexicons, we notice that the BTB and W-NPMI sentiment lexicons have a balanced accuracy of 73.2% and a macro F\(_1\) measure of 72.7%.

If we look at the NW category’s sentiment lexicons, we notice that all the evaluation metrics of the BTB, W-PMI, W-NPMI, and ICF sentiment lexicons have been improved. Hence, accounting for negation while building the sentiment lexicons pays off. Finally, looking at the FS lexicons, we notice that they perform slightly worse than the other two categories’ sentiment lexicons. Overall, based on the balanced accuracy and the macro F\(_1\) measure, the BTB and W-PMI NW sentiment lexicons perform the best on the StockTwits dataset. The BTB and W-PMI NW sentiment lexicons have a balanced accuracy of 73.5% and a macro F\(_1\) measure of 73.0%.

Table 8 Evaluation metrics of the financial sentiment lexicons in unsupervised sentiment classification on the StockTwits dataset

Table 9 shows the evaluation metrics of the financial sentiment lexicons on the microblogging dataset. Similar to Table 8, we notice that the sentiment lexicons of the SNN approach have dissimilar evaluation metrics compared to the sentiment lexicons of the other approaches. Furthermore, in all the categories, the sentiment lexicons created using the W-NPMI approach have the highest values compared to the other sentiment lexicons. Moreover, the W-NPMI sentiment lexicon of the FS category overall has the highest values for the evaluation metrics with a balanced accuracy of 72.5% and a macro F\(_1\) measure of 73.5%. In addition, we notice that the balanced accuracy and macro F\(_1\) measure of all approaches are similar or higher for the sentiment lexicons that account for negation. Looking at the probability-based approaches, we notice that the FS approach leads to a higher balanced accuracy and macro F\(_1\) measure compared to the NW approach. However, the opposite is true for the ICF sentiment lexicons.

Table 9 Evaluation metrics of the financial sentiment lexicons in unsupervised sentiment classification on the microblogging dataset

In Table 10, one finds the evaluation metrics of the financial sentiment lexicons on the headlines dataset. We notice again that the values of the SNN sentiment lexicons’ evaluation metrics are dissimilar compared to the other sentiment lexicons. In the category with the original sentiment lexicons, we see that the BTB sentiment lexicon slightly outperforms the other lexicons in this category. In the second category, the NW sentiment lexicons, the W-NPMI sentiment lexicon slightly outperforms the other lexicons based on the balanced accuracy and macro F\(_1\) metric. The W-PMI sentiment lexicon of the FS category performs slightly better than the other sentiment lexicons that belong to this category. Based on the balanced accuracy and the macro F\(_1\) measure, we select the W-PMI sentiment lexicon of the FS category as our best performing sentiment lexicon on the headlines dataset. This sentiment lexicon has a balanced accuracy of 62.3% and a macro F\(_1\) metric of 62.1%. In addition, we notice that accounting for negation while constructing the financial sentiment lexicons leads to an increase in the F\(_1\)Pos measure. Looking at the balanced accuracy and the macro F\(_1\) measure, we see that the FS lexicons outperform the original sentiment lexicons, except for the W-NPMI and SNN sentiment lexicons. Furthermore, we notice that the W-NPMI NW sentiment lexicon performs better than the other two categories’ W-NPMI sentiment lexicons.

Overall, we notice that the probability-based approaches perform relatively better than the information retrieval-based and the sentiment-aware word embedding-based approaches. Moreover, we notice that our introduced weighted versions of the PMI approaches perform better than the other approaches. Furthermore, we observe that the quality could be improved by accounting for negation while building the sentiment lexicons. Looking at the balanced accuracy and the macro F\(_1\) measure, we observe over the three test datasets that the SNN approach could be improved with the NW approach. The other four approaches could be improved by either the NW or FS approach, depending on the test dataset.

Table 10 Evaluation metrics of the financial sentiment lexicons in unsupervised sentiment classification on the headlines dataset

After selecting the best financial sentiment lexicon for each test set, we compare these financial sentiment lexicons with the external sentiment lexicons mentioned in the "Evaluation" section. In Table 11, one finds the evaluation metrics of the external lexicons. We notice that the manually made sentiment lexicons, such as the GI and LM lexicons, struggle with classifying the messages as either positive or negative. The LM sentiment lexicon is, on average, unable to classify approximately 70% of the test messages. The high percentages of unclassified messages confirm the disadvantage of the low coverage of manually made sentiment lexicons, as discussed in the “Related Work” section. Therefore, we only consider the external sentiment lexicons with a low and similar percentage of unclassified messages as the best financial sentiment lexicons. We consider the NRC-H lexicon [43] and the SM sentiment lexicon [14]. In addition, we notice that the evaluation metrics for the NRC-H sentiment lexicon are lower than the evaluation metrics of the SM sentiment lexicon and the best financial sentiment lexicon. This result is in line with our expectations because the NRC-H sentiment lexicon is, in contrast to the other two lexicons, not explicitly constructed for the financial domain.

Table 11 Evaluation metrics of the external sentiment lexicons in unsupervised sentiment classification on the test datasets

Similar to the previous unsupervised comparisons, we focus on the balanced accuracy and the macro F\(_1\). We see that the BTB and W-PMI NW sentiment lexicons outperform both considered external sentiment lexicons on the StockTwits dataset with a balanced accuracy of 73.5% and a macro F\(_1\) measure of 73.0%. In the case of the microblogging dataset, the W-NPMI FS lexicon slightly outperforms the NRC-H sentiment lexicon and the SM sentiment lexicon. However, in the case of the headlines dataset, the SM sentiment lexicon slightly outperforms the W-PMI FS lexicon. Overall, we notice that the newly introduced W-PMI and W-NPMI sentiment lexicons, which are built while accounting for negation, perform very well, both internally and externally, compared to other sentiment lexicons.

Supervised Sentiment Classification Evaluation

Next to the unsupervised sentiment classification, we also perform supervised sentiment classification. We use a linear classifier introduced by Fan et al. [47] for the supervised sentiment classification. First, we extract the six sentiment lexicon features for each message in the StockTwits test dataset, as described in the “Sentiment Classification Evaluation” section. Hereafter, we split the StockTwits test set into an 80% training set to train the linear classifier and a 20% test set to evaluate the sentiment lexicons using a fixed seed. The training set consists of 16,000 positive and 16,000 negative messages. The test set consists of 4,000 messages and 4,000 negative messages.

We perform a grid search on the accuracy to tune the type of classifier and the hyperparameter \(\alpha\) on the five-fold cross-validation of the training set. We consider the six different types of multi-class classifiers described by Fan et al. [47]. We let the values for c vary from 0.0001 to 1000. The optimal classifier is the L2-regularized logistic regression, which is also the same type of classifier, as Tang et al. used [26].

In Table 12, one finds the evaluation metrics of the financial sentiment lexicons on the StockTwits test set. In general, we see that the lexicons of the BTB, W-PMI, W-NPMI, and ICF approaches score similarly on the evaluation metrics across all three categories. This result is similar to the results we obtained in the unsupervised setting. The evaluation metrics of the SNN approach of the original and NW categories are more similar to the evaluation metrics of the other approaches. This similarity was not the case in the unsupervised sentiment classification. This difference indicates that the SNN sentiment lexicons’ sentiment strengths are more suited for supervised sentiment classification compared to the unsupervised sentiment classification. A possible explanation for the SNN sentiment lexicons’ lower performance in both sentiment classification tasks could be the number of training messages used in the neural network to compute the sentiment strengths. Neural networks tend to perform better when leveraging large datasets.

Similar to the unsupervised setting, we focus on the balanced accuracy and the macro F\(_1\) measure. We notice that the W-NPMI and ICF sentiment lexicons slightly outperform the BTB and W-PMI sentiment lexicons in the original category. In the negation categories, the W-NPMI sentiment lexicons are slightly outperforming the other sentiment lexicons. In addition, by looking at the balanced accuracy and macro F\(_1\) measure, we notice that the sentiment lexicons’ quality could be improved by accounting for negation while building the sentiment lexicons. The balanced accuracy and macro F\(_1\) measure of the NW sentiment lexicons are higher than the balanced accuracy and macro F\(_1\) measure of the two other categories’ sentiment lexicons. The balanced accuracy and macro F\(_1\) measure of the FS lexicons are similar to the balanced accuracy and macro F\(_1\) measure of the original sentiment lexicons. Overall, the W-NPMI NW sentiment lexicon performs slightly better than the other sentiment lexicons with a balanced accuracy of 75.1% and a macro F\(_1\) measure of 75.1%.

Table 12 Evaluation metrics of the financial sentiment lexicons in supervised sentiment classification on the StockTwits test set

After selecting the best financial sentiment lexicon for the StockTwits test set in the supervised sentiment classification setting, we compare this financial sentiment lexicon with the external sentiment lexicons mentioned in the “Evaluation” section. Table 13 shows the external sentiment lexicons’ evaluation metrics and the best financial sentiment lexicon on the StockTwits test set.

Table 13 Evaluation metrics of the external sentiment lexicons and the best financial sentiment lexicon in supervised sentiment classification on the StockTwits test set

In general, the balanced accuracy and macro F\(_1\) measure of all the external sentiment lexicons are very similar, except for the SM lexicon. Based on the balanced accuracy and the macro F\(_1\) measure, the SM lexicon is the best performing external sentiment lexicon with a balanced accuracy of 66.5% and a macro F\(_1\) metric of 66.5%. However, the W-NPMI NW sentiment lexicon has significantly higher values for the balanced accuracy and the macro F\(_1\) measure compared to all the external sentiment lexicons.

Conclusion

The financial domain is currently lacking specific sentiment lexicons. In this research, we discuss several approaches to build financial sentiment lexicons automatically. We introduce two new approaches to automatically build these sentiment lexicons, namely the W-PMI and W-NPMI approach. Furthermore, we propose two different methods to account for negation while building the sentiment lexicons. The first method, the NW approach, creates a separate entry in the lexicon for the word’s negated version. The second method, the FS approach, considers the negated word to have a sentiment orientation that is the opposite of the message’s sentiment. This way, the method corrects for the negation without creating a new entry for the negated version of the word in the sentiment lexicon. We evaluate the constructed sentiment lexicons in two different sentiment classification tasks by comparing them with each other and with external sentiment lexicons created by other researchers.

The first sentiment classification task is done by evaluating the sentiment lexicons in an unsupervised setting across three different test sets. The test sets consist of StockTwits messages, microblogging messages, and financial headlines. In this unsupervised setting, we focused on the balanced accuracy and the macro F\(_1\) measure. We noticed that the probability-based approaches achieved relatively higher metrics compared to the other types of approaches. Furthermore, we noticed that the sentiment lexicons achieve higher scores for the evaluation metrics if they account for either of the two proposed negation approaches while building the sentiment lexicons. Moreover, we noticed that the W-PMI and W-NPMI sentiment lexicons outperform all the internal and external sentiment lexicons in the unsupervised sentiment classification task.

In the second sentiment classification task, we evaluate the financial sentiment lexicons in a supervised setting. Again, we noticed that the quality of the sentiment lexicons could be improved by accounting for negation while building them. The W-NPMI NW sentiment lexicon slightly outperforms the other financial sentiment lexicons. Moreover, we observed that the W-NPMI NW sentiment lexicon has significantly higher scores for the evaluation measures compared to all the external sentiment lexicons. Furthermore, we noticed that the financial sentiment lexicons that take into account negation by using the NW approach achieve higher scores for the evaluation measures on the test set.

In the considered sentiment classification tasks (unsupervised and supervised), the probability-based approaches outperformed the other types of approaches. We compared building the financial sentiment lexicons while accounting for negation using the NW approach and FS approach to the baseline, not accounting for negation. The sentiment lexicons can be improved by accounting for negation while building the sentiment lexicons using either the proposed NW approach or the FS approach. In general, the financial sentiment lexicon obtained using the proposed W-NPMI approach and the NW approach performs best.

The constructed financial sentiment lexicons could be further improved in different ways. In this research, we focused on the explicit negation cues, such as “not” and “never.” A possible future research direction is to focus, next to the explicit negation cues, on the implicit negation cues, such as “dislike” and “hopeless.” Furthermore, the financial sentiment lexicons could be improved by taking into account intensifiers, such as “really” and “very,” and downtoners, such as “hardly” and “slightly,” while constructing the sentiment lexicons. In addition, the sentiment lexicons could be refined by taking into account emoticons and emojis, which are becoming more and more popular in microblogs [52, 53]. Last, we plan to apply the introduced W-PMI and W-NPMI sentiment building approaches together with accounting for negation on other domains, such as the consumer product domain.