Introduction

With the tremendous increase in the use of social network sites like Twitter and Facebook, online community is exchanging information in the form of opinions, sentiments, emotions, and intentions, which reflect their affiliations and aptitude towards an entity, event and policy [1,2,3]. The propagation of extremist content has also been increasing and being considered as a serious issue in the recent era due to the rise of militant groups such as Irish Republican Army, Revolutionary Armed Forces of Colombia (FARC), Al Quaeda, ISIS (Daesh), Al Shabaab, Taliban, Hezbollah and others [4]. These groups have spread their roots not only at the community levels but also their networks are gaining control of social networking sites [5]. These networking sites are vulnerable and approachable platforms for the group strengthening, propaganda, brainwashing, and fundraising due to its massive impact on public sentiments and opinions.

Opinions expressed on such sites give an important clue about the activities and behavior of online users. Detection of such extremist content is important to analyze user sentiment towards some extremist group and to discourage such associated unlawful acts. It is also beneficial in terms of classifying user’s extremist affiliation by filtering tweets prior to their onward transmission, recommendation or training AI Chatbot from tweets [6].

The traditional techniques of filtering extremist tweets arenot scalable, inspiring researchers to develop automated techniques. In this study, we focus on the problem of classifying a tweet as extremist or non-extremist. The task faces different challenges, such as different kinds of extremism, various targets and multiple ways of representing the same semantics. The existing studies of extremism informatics are based on classical machine learning techniques [7, 8] or use classical feature representation schemes followed by a classifier.

In their work, Wei et al. [8] proposed a machine learning based classification system for identifying extremist-related conversations on Twitter. Different features are investigated for identifying extremist behavior o Twitter based on public tweets by applying KNN classifier. Based on the social media communication, Azizan and Aziz [7] conducted a study for the detection of extremist affiliations using machine learning technique, namely Naïve Bayes algorithm. It has shown best results over other ML classifiers. However, in the state of the art work [7] performed on extremist affiliation detection, authors have applied machine learning classifier with classical features. Furthermore, they have classified user reviews into positive and negative sentiments reflecting affiliations with extremist groups. However, classification of tweets into positive and negative classes does not provide an efficient way of distinguishing between extremist and non-extremist tweets. Another, major limitation of their approach is that it lacks the ability to take the overall dependencies related to a sentences in a document. Therefore, the machine learning model does not provide an efficient way for classifying text into extremist and non-extremist.

To overcome the aforementioned limitations of state of the art study [7], we investigate deep learning-based sentiment analysis techniques, which have already shown promising performance across a large number of complicated problems in different domains like vision, speech and text analytics [9, 10]. We propose to apply LSTM-CNN model, which works as follows: (i) CNN model is applied for feature extraction, and (ii) LSTM model receives input from the output of the CNN model and retains the sequential correlation by taking into account the previous data for capturing the global dependencies of a sentence in the document with respect to tweet classification into extremist and non- extremist.

We take the task of extremist affiliation detection as a binary classification task. We take the training set Tr = {t1, t2, t3,…..tn} and class tags (labels) has Extrimist_affliation = {yes, no}. Each tweet is assigned a tag. The aim is to design a model which can learn from the training data set and can classify a new tweet as either extremist or non-extremist. The Twitter-based messaging is a major element of communication among individuals and groups, including extremists and extremist’s groups. Using this sort of communication, future terrorist activities can potentially be traced. We propose a technique to identify tweets containing such content. Additionally, we classify sentiments of users in terms of emotional affiliations expressed towards individuals and groups having extremist thoughts. For this purpose, we apply IBM Watson API for tone analysis [11].

In this work, we experiment with multiple Machine Learning (ML) classifiers such as Random Forest, Support Vector Machine, KN-Neighbors, Naïve Bayes Classifiers, and deep learning (DL) classifiers. The feature set for such classifiers is encoded by task-driven embedding trained over different classifiers: CNN, LSTM, and CNN + LSTM. As baselines, we compare with feature set which consists of n-grams [12], TF–DF, and bag of words (BoW) [13].

The proposed system aims at applying deep learning-based sentiment analysis technique to answer following research questions:

  • RQ#1: How to recognize and classify tweets as extremist vs non-extremist, by applying deep learning-based sentiment analysis techniques?

  • RQ#2: What is the performance of classical feature sets liken-grams, bag-of-words, TF-IDF, bag-of-words (BoW) over word embedding learned using CNN, LSTM, FastText, and GRU?

  • RQ#3: What is the performance of proposed technique for extremist affiliation classification with respect to the state-of-the-art methods?

  • RQ#4: How to perform the sentiment classification of user reviews w.r.t emotional affiliations of Extremists on Twitter and Deep Web?

Following contributions are made in this study:

  1. i.

    Classifying user reviews (Tweets) as extremist or non-extremist affiliations deep learning-based sentiment analysis techniques.

  2. ii.

    To investigate the classical feature sets like n-grams, bag-of-words, TF-IDF, bag-of-words (BoW) over word embedding learned using CNN, LSTM, FastText, and GRU, for tweet classification as extremist and non-extremist.

  3. iii.

    Sentiment classification of user reviews w.r.t emotional affiliations of Extremists on Twitter and Deep Web?

  4. iv.

    Comparing the efficiency of the proposed model with other baseline methods.

  5. v.

    Our method outperforms baseline methods by a significant margin in terms improved precision, recall, f-measure an accuracy.

The rest of the article is organized as follows: related work is presented in “Proposed approach” section; “Experimental setup” section presents proposed methodology; in “Conclusions and future work” section, we present experimental setup; in “Experimental setup” section, we analyze the results obtained from experiments and final section concludes the study and gives a recommendation for future work.

Related work

In this section, we present a review of relevant studies conducted on the classification of social media-based extremist affiliations.

With the development of machine learning, it has gradually been applied to the analysis of extremist content and sentiments. Ferrara et al. [5] applied machine learning techniques on social media text to detect the interaction of extremist users. The proposed system has experimented on a set of more than 20,000 tweets generated from extremist accounts, which were later suspended by Twitter. The main emphasis was on three tasks, namely: (i) detection of extremist users, (ii) identifying users having with extremist content, and (iii) predicting users’ response to extremists’ postings. The experiments are conducted in two dimensions, i.e. time-independent and real-time prediction tasks. An accuracy of about 93% is achieved with respect to extremist detection. With the same purpose, a machine learning-based technique is proposed by [7] for classifying of extremist affiliations. The Naïve Bayes algorithm is applied with the classical feature set. The system is based on the classification of user reviews into positive and negative classes with less focus on identifying, which sentiment class (positive or negative) is associated with extremist communication. In contrast to Ferrara et al. [5] work, which mainly emphasizes on the classification extremist’s affiliations on skewed data; their method applies NB algorithm on balanced data giving more robust results. However, the overall dependencies in the sentence are not considered. This issue can be handled by applying deep learning models based on word embedding features. Researchers have also begun to investigate various ways of automatically analyzing extremist affiliations in languages other than English. In this connection, Hartung et al. [13] proposed a machine learning technique for detecting extremist posts in German Twitter accounts. Different features are experimented, such as emotions, linguistic patterns, and textual clues. The system yielded improved results over the state-of-the-art works. Studies on classifying extremist affiliations in the context of social media content are also noticeable in illegal drug usage. For example, in their work on marijuana-related microblogs, Nguyen et al. [14] collected more than thirty thousand tweets pertaining to marijuana during 2016. The text mining technique provides some useful insights to the acquired data such as (i) user attitude can be categorized as positive or negative, (ii) more than 65% tweets are originated from mobile phones, and (iii) frequency of tweets on weekend is higher than other days.

Lexicon-based unsupervised techniques for sentiment classification mainly rely on some sentiment lexicon and sentiment scoring modules [15]. Like other areas of sentiment analysis, extremist affiliation has been investigated by Ryan et al. [16], by proposing a novel technique based on part-of-speech tagging and sentiment-driven detection of extremist writers from web forums. The study was based on about 1 million posts from more than 25,000 distinct users surfing on four extremist forums. The proposed method was based on the user’s sentiment score, computed by aggregating the score of no. of negative posts, duration of negative posts and severity of negative posts. The system is flexible to detect online suspicious activities on extremist users. In 2012, Chalothorn and Ellman [17] proposed a sentiment analysis model to analyze online radical posts using different lexical resources such as SentiWordNet, WordNet and NLTK toolkit. The sentiment class and intensity of the text is computed. Initially, textual data were acquired from different web forums such as Montada and Qawem, and after performing necessary pre-processing tasks, different feature-driven measures were applied to detect and manipulate religious and extremists content. Experimental results show that Montada forum has more positive posting than the Qawem forum. It was concluded that Qawem forum is suffered from more radical postings. Another worth noting work is carried out by [18] by collecting a huge dataset from YouTube group with extremist ideology. Different sentiment analysis techniques were applied to examine the topics under discussion and classified into positive (radical) and negative (non-radical) classes. Furthermore, gender-wise sentiments were also highlighted to observe the opinions expressed by male and female users.

Unsupervised techniques like clustering, have successfully been applied in different domains like aspect-based sentiment analysis [19], stock prediction [20] and sentiment classification. Skillicorn [21], in his work on crime investigations, proposed a framework for the adversarial analysis of data. The framework is comprised of three major segments including data collection, detection of suspects, and finding of suspicious individuals using network-driven association methods. Another method is based on the data clustering and visual analysis techniques for the investigation and implementation of terrorism informatics [22]. For this purpose, authors have used Twitter to detect and classify terrorist events by utilizing civilian sentiments.

Hybrid approach for developing sentiment-based applications have received considerable attention of researchers in different domains, such as business, health-care and politics [23]. In such approaches, different features of supervised, unsupervised and semi supervised techniques are adopted [19]. In the context of extremist affiliation classification, Zeng et al. [24] worked on the Chinese text segmentation issue in terrorism domain using a suffix tree and mutual information. The core module uses mutual information and the suffix tree for manipulating data in terrorism domain. The technique has applicability for processing huge amount of Chinese textual data. Analyzing militant conversation from online conversation, Prentice et al. [25] investigated the intents and content generated during the militant’s conversation on social media with respect to Gaza violence in 2008/2009. Over 50 online text conversations were analyzed by applying both qualitative and quantitative techniques. Their proposed system includes a manual coding approach to detect the presence of a persuasive metaphor and semantics of the underlying text.

The aforementioned studies on detection and classification of social media-based extremist affiliations have used different approaches, such as supervised machine learning, an unsupervised technique like lexicon-based and clustering-based, and hybrid models. However, there is a need to investigate the applicability of state of the art sentiment-based deep learning models for classifying extremist affiliations using social media content.

Proposed approach

Data collection

We used Twitter streaming API [26] to scrap tweets containing one or more extremism-related keywords (ISIS, bomb, suicide etc.). Furthermore, we also investigated different Dark Web forums, such as Al-Firdaws, Montada, alokab, and Islamic Network [27]. The first three are Arabic forums and third one is in English. We collected over 25,000 postings by translating the non-English postings to English using Python-based Google Translate API (https://pypi.org/project/googletrans/). Each review is matched with the seed words present in the manually built extremist’s vocabulary lexicon acquired from BiSAL [28], a bilingual sentiment lexicon for analyzing dark web forums. In this way, all postings containing one or more keywords from the manually acquired lexicon, are collected. For this purpose, we used a python-based beautiful-soupscript (https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486).

The acquired data is stored is in a machine readable “.CSV” file. In this way, we acquired manually tagged training datasets for conducting experiments. The training dataset is comprised of 12,754 tweets labeled as “extremist” and 8432 as “non-extremist” [29]. Table 1 shows the detail of the used dataset. Table 2 shows a sample list of frequently occurring terms in the dataset, showing term frequency (tf), document frequency (df) and user frequency (uf).

Table 1 Dataset statistics
Table 2 Top 25 frequently occurring terms

Preprocessing

We applied different preprocessing techniques, such as tokenization, stop word removal, case conversion, and special symbol removal [30]. The tokenization yields a set of unique tokens (356,242), which assist in building a vocabulary from the training set, used for encoding the text.

Training, validation and testing

We divided the dataset into three parts: train, validate, and test. The DL model is trained with the Keras library [31] based on TensorFlow. The hardware requirements include 4 Titan X GPUson, a 128 GB memory with Intel Core i7 node. Figure 1 shows a diagramatic representation of train, validation and test split.

Fig. 1
figure 1

Train, validation and test split

Training data

Training data is used to train the model. In this work, 80% of the data is used training and it may vary as per requirements of the experiment. The training data includes both the input and the expected output. It includes both the input and the corresponding expected output [32].

Table 3 shows a sample list of review sentences in training data.

Table 3 A partial listing of training data
Data validation

Data validation is used to minimize the overfitting and under fitting [33], which usually happens because the accuracy of the training phase is often high and performance gets degraded against test data. Therefore, 10% validation set is used to avoid performance error by applying parameter tuning. For this purpose, we applied automatic verification of dataset [34], which provides an unbiased evaluation of the model and minimize the overfitting [35].

Testing data

The test data (20%) is used to check whether the trained model performs well on the unseen data. It is used for the final evaluation of the model when it is trained completely. A list of sample entries in the test data is presented in Table 4.

Table 4 A partial listing of test data

Proposed network model

The proposed method implements and evaluates the performance of long short term memory with Convolutional Neural Network (CNN) model to identify tweets/reviews containing content with extremist clues. We train the neural classifier, for the classification of extremist affiliation content. The working flow of the network is comprised of following steps: (i) word embedding, in which each word in a sentence is assigned a unique index to form a fixed-length vector, (ii) dropout layer is used to avoid overfitting, (iii)LSTM layer is incorporated to capture long-distance dependency across tweets/reviews (iii) feature extraction is performed using a convolution operation, (iv) pooling layer aims at minimizing the dimension of feature map by aggregating the information, (v) Flatten layer converts the pooled feature map into a column vector, and (vi) at the output layer, softmax function is used for the classification. Figure 2 presents a network diagram for classifying the sentence as “extremist” or “non-extremist content”. In the rest of this paper, we give detailed working of these layers.

Fig. 2
figure 2

LSTM + CNN architecture for extremist affiliation classification

Word embedding (input) layer

The embedding or input layer is the first layer of LSTM + CNN model, which transforms the words into real-valued vector representation, i.e. a vocabulary of the words is created, which is then converted into a numeric form, known as word embedding. The word embedding is given as input (sentence matrix) to the next layer. As shown in the pseudocode, there are different parameters, namely (i) max features, (ii) embed dim, and (iii) input length. The “max_features” holds the top words and presents the size of vocabulary; “embed_dim” shows the dimension of the real-valued vector, and the “input_length” describes the length of each of the input sequence.

The sentence consists of a sequence of words: x1, x2…xn, and each word are assigned an exclusive index number. The embedding layer transforms such indices into D dimensional word vector. For this purpose, an embedding matrix of size [vocabulary size × embedding size]) is learned over a 10 × 4-dimensional matrix i.e. ([V × D] = [10 × 4]). As in this case, the vocabulary size is 10, while embedding size is 4, therefore, the individual word “Baghdadi” is represented as 1 by 4-dimensional vector i.e. (1 × D = [1 × 4]. For example, the word “Baghdadi” with index “1” contains an embedding vector [0.2, 0.4, 0.1, 0.2], represented by the first row shown in Fig. 3. Similarly, the second row is [0.6, 0.2, 0.8, 0.8] and same is the case for others. Thus, we can clearly see that each word has an embedding of size “1 × D”, as depicted in Fig. 3. The embedding matrix is denoted as E ϵ RV×D. The word embedding process is illustrated as follows:

Fig. 3
figure 3

Word representation in input layer

Dropout layer

The function of the dropout layer is to avoid overfitting. The value 0.5 represents the “rate” parameter of the dropout layer and the value of this parameter falls between 0 and 1 [36]. The dropout layer randomly deletes or turnoff the activation of neurons in the embedding layer as the dropout is applied on embedding layer, whereas each neuron in the embedding layer depicts the dense representation of a word in a sentence. The modeling of dropout on a single neuron is presented in Eq. (1):

$$f(k,p) = \left\{ {\begin{array}{ll} p & \quad {if \ k = 1} \\ {1 - p} & \quad {if \ k = 0} \\ \end{array} } \right.$$
(1)

k depicts the desirable results and p is the probability related to real-valued word representation. So, when p value is 1 the neuron holding a real value will be deleted and is activated otherwise. Figure 4 shows the working of dropout layer.

Fig. 4
figure 4

Operation of dropout layer

Figure 4 illustrates the embedding layer which holds the real-valued representation of a given sentence: Baghdadi… our last and only hope, I simply love you. so, after adding a dropout layer some of the values in the embedding layer are deactivated randomly (Fig. 4).

Long short term memory

We used single LSTM layer, which consists of a 100 lstm cells/units. LSTM performs some pre-calculations before it produces an output. In each cell, four independent calculations are performed using four gates: forget (ft), input (it), candidate (c ~ t) and output (ot). The equations for these gates are given below [37]:

$$ft = \sigma \left( {Wfxt + Uf ht - 1 + bf} \right)$$
(2)
$$it = \sigma \left( {Wixt + Ui ht - 1 + bi} \right)$$
(3)
$$Ot = \sigma \left( {Wo xt + Uo ht - 1 + bo} \right)$$
(4)
$$C\sim t = \tau \left( {Wc xt + Uc ht - 1 + bc} \right)$$
(5)
$$Ct = ft oCt - 1 + it oC\sim t$$
(6)
$$ht = Ot o\tau \left( {Ct} \right)$$
(7)

The graphical illustration of entire LSTM cell in green block is shown in Fig. 5.

Fig. 5
figure 5

Long short term memory cell

Now, the example sentence: “Baghdadi… our last and only hope, I simply love you”, is passed through the LSTM cell. The execution first starts with the forget gate using Eq. (2) in which input xt and the previous output ht is multiplied with their respective weights Wf, Uf. Next bias bf is added which outputs (4 × 1) vector. Then sigmoid activation function is applied to transform the values between 0 and 1 and the values greater than 0.5 is assumed as 1 for three gates ft, it, Ot as shown in below computation [38]. Next, for input and output gate, the same procedure is repeated and for candidate gate, instead of the sigmoid, tangent function is used. Finally, the output ht and the next cell state ct vectors, are calculated using Eqs. (6) and (7).

Putting values in Eqs. (2), (3), (4), (5), (6), (7)

$${\text{f}}_{\text{t}} = \sigma \left( {\left[ {\begin{array}{*{20}c} {0.1} & {0.3} & {0.2} & {0.4} \\ {0.2} & {0.1} & {0.5} & {0.3} \\ {0.3} & {0.6} & {0.3} & {0.2} \\ {0.1} & {0.2} & {0.5} & {0.3} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {0.2} \\ {0.4} \\ {0.1} \\ {0.2} \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.1} & {0.2} & {0.5} & {0.6} \\ {0.3} & {0.4} & {0.1} & {0.7} \\ {0.2} & {0.3} & {0.5} & {0.4} \\ {0.4} & {0.1} & {0.3} & {0.2} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} 0 \\ 0 \\ 0 \\ 0 \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.1} \\ {0.2} \\ {0.3} \\ {0.4} \\ \end{array} } \right]} \right) = \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 \\ \end{array} } \right]$$
$${\text{i}}_{\text{t}} = \sigma \left( {\left[ {\begin{array}{*{20}c} {0.4} & {0.3} & {0.1} & {0.5} \\ {0.2} & {0.6} & {0.4} & {0.1} \\ {0.3} & {0.5} & {0.6} & {0.7} \\ {0.2} & {0.1} & {0.4} & {0.3} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {0.2} \\ {0.4} \\ {0.1} \\ {0.2} \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.5} & {0.1} & {0.4} & {0.6} \\ {0.4} & {0.3} & {0.2} & {0.7} \\ {0.6} & {0.2} & {0.4} & {0.8} \\ {0.2} & {0.7} & {0.8} & {0.1} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} 0 \\ 0 \\ 0 \\ 0 \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.1} \\ {0.3} \\ {0.4} \\ {0.2} \\ \end{array} } \right]} \right) = \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 \\ \end{array} } \right]$$
$${\text{c}} \sim {\text{t}} = \tau \left( {\left[ {\begin{array}{*{20}c} {0.3} & {0.2} & {0.5} & {0.6} \\ {0.1} & {0.3} & {0.4} & {0.2} \\ {0.5} & {0.6} & {0.7} & {0.1} \\ {0.4} & {0.3} & {0.2} & {0.7} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {0.2} \\ {0.4} \\ {0.1} \\ {0.2} \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.5} & {0.1} & {0.4} & {0.6} \\ {0.4} & {0.3} & {0.2} & {0.7} \\ {0.6} & {0.2} & {0.4} & {0.8} \\ {0.2} & {0.7} & {0.8} & {0.1} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} 0 \\ 0 \\ 0 \\ 0 \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.4} \\ {0.1} \\ {0.2} \\ {0.3} \\ \end{array} } \right]} \right) = \left[ {\begin{array}{*{20}c} {0.6} & {0.3} & {0.5} & {0.5} \\ \end{array} } \right]\,$$
$${\text{o}}_{t} = \sigma \left( {\left[ {\begin{array}{*{20}c} {0.2} & {0.1} & {0.5} & {0.6} \\ {0.3} & {0.4} & {0.2} & {0.1} \\ {0.1} & {0.5} & {0.6} & {0.7} \\ {0.6} & {0.4} & {0.3} & {0.2} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {0.2} \\ {0.4} \\ {0.1} \\ {0.2} \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.1} & {0.2} & {0.5} & {0.3} \\ {0.6} & {0.3} & {0.2} & {0.4} \\ {0.7} & {0.4} & {0.1} & {0.8} \\ {0.8} & {0.5} & {0.2} & {0.3} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} 0 \\ 0 \\ 0 \\ 0 \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} {0.7} \\ {0.6} \\ {0.5} \\ {0.4} \\ \end{array} } \right]} \right) = \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 \\ \end{array} } \right]$$
$${\text{c}}_{\text{t}} = \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 \\ \end{array} } \right] \cdot \left[ {\begin{array}{*{20}c} 0 & 0 & 0 & 0 \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 \\ \end{array} } \right] \cdot \left[ {\begin{array}{*{20}c} {0.6} & {0.3} & {0.5} & {0.5} \\ \end{array} } \right]$$
$${\text{h}}_{\text{t}} = \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 \\ \end{array} } \right] \cdot \tau \left[ {\begin{array}{*{20}c} {0.6} & {0.3} & {0.5} & {0.5} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {0.5} & {0.2} & {0.4} & {0.} \\ \end{array} 4} \right]$$

From here, we pass our ht and ct to begin the next lstm cell calculations. As a result, LSTM produces an output sequence P = [p0, p1, p2,……., pl] of a matrix \(P \in R^{lxw}\). Finally, this representation is fed to the CNN layer.

Convolutional layer

In this layer, a convolutional operation is performed which is a mathematical operation implemented on two functions, yielding a third function. To perform the convolutional operation, the dimensions of an input matrix (P), filter matrix (F) and output matrix (T), are represented as follows:

$$P = R^{lxw}$$
(8)

In Eq. (8), P represents input matrix, produced by the LSTM layer, R denotes all real numbers, l is the length and w is the width of input matrix which is shown as R10×4.

$$F = R^{nxm}$$
(9)

In Eq. (9), F represents filter matrix, R denotes all real numbers, n is the length and m is the width of the filter matrix, which is shown as R2×2, and

$$T = R^{sxd}$$
(10)

In Eq. (10), T represents output matrix, R denotes all real numbers, s is the length and d is the width of output matrix, which is shown as R10×4.

The convolutional operation is formulated as shown in Eq. (11):

$$t_{i,j} = \mathop \sum \limits_{l = 1}^{n} \mathop \sum \limits_{w = 1}^{m} f_{l,w} \otimes p_{i + l - 1,j + w - 1}$$
(11)

where, ti,j ϵ Rs×dis the tth element of output matrix, fl,w ϵ Rn×n represents fth element of weight matrix, ⊗ denotes element-wise cross multiplication, and pi+l1,j+w1 ϵ Rl×w represents the pth elements of the input matrix.

For the example sentence: “Baghdadi… our last and only hope, I simply love you”, the convolutional operation is executed as follows: (i) Elements of input matrix 0.5, 0.2, 0.2, 0.7, (ii) Elements of filter matrix 0.7, 0.4, 0.9, 0.5, and (iii) Convolutional operation 0.5 × 0.7 + 0.2 × 0.4 + 0.2 × 0.7 + 0.7 × 0.5 = 0.92, where 0.92 is the first element of the output matrix. Similarly, the process of element-wise cross multiplication and addition continues until all the values of input matrix are covered and this will be done through the sliding of the filter over the input matrix.

Feature map After adding bias and an activation function to the output matrix, a feature map (A) for a given sentence is computed as follows [see Eq. (12)]:

$$A = a_{i,j} = f\left( {t_{i,j} + b} \right)$$
(12)

where dimension of feature map (A) for a given sentence = Rq×r = R10×4, where, ai, j ϵRq×r, is an ath element of a feature map, b is a bias term, and f is an activation function.

In Fig. 2, each element of the output matrix is then added to the bias term after the convolutional operation. For example, adding a bias value to the first element of the output matrix that is 0.92 + 1 = 1.92 which represents the first element of the feature map for a given sentence (Fig. 2).

As, shown in Algorithm no. 1, the parameters used in this layer are: (i) “filters”, it describes the filter number within the convolutional layer; (ii) “kernel_size”, shows the dimensionality of convolutional window; (iii) “padding” holds single value among the three values: “valid”, “same”, and “casual”. If padding contains the value of “valid”, then it shows no padding, and padding with the value “same”, depicts that original input length equals to output length. Furthermore, when padding contains the value “casual”, then it produces dilated convolution; and (iv) the parameter “activation = relu” means activation is exploited to reveal nonlinearity.

Finally, a feature map is generated on which relu activation function is applied to remove non-linearity and its mathematical expression is: Output = max (Zero, Input), where Input means an element of the feature map.

For example, in the input sentence, the first element of the feature map is 1. 92. If we apply relu activation function on it, then Output = max (0, 1.92), will be Output = 1.92, since 1.92 > 0. So, in this way, other elements of rectified feature map for a given sentence is calculated.

Pooling layer

The pooling layer is used to minimize the dimension of the feature map by aggregating the information. So, the max pooling is applied to every sentence in a dataset. We used max pooling to get the required feature of a sentence by selecting the maximum value. Equation (13) presents the formula for pooling layer.

$$g_{p,h} = max_{{i,j \in Z_{p,h} }} a_{i,j}$$
(13)

Suppose, Zp,h is a small matrix with size 2 × 2, also gp,h ϵ R5×2 is an element of matrix G, and ai,j is an element of matrix A. The elements \(g_{p,h}\) of matrix G is obtained after picking the maximum element from the matrix A, which is a rectified feature map for a given sentence, within the given window matrix that is Zp,h. Hence, the matrix G depicts the pooled feature map of the given sentence, which is illustrated in Fig. 6.

Fig. 6
figure 6

Pooling layer

A pooled feature map is created by setting a window size (2 × 2), placing it on the feature map, and finally extracting the maximal element inside the window. Since in our case, among the selected window size, that is max (1.92, 2.3, 2.33, 2.22), the largest element is 2.33, which shows the first element of the pooled feature map related to the given sentence. Hence, the same procedure will be performed for the other values of the pooled feature map.

Flatten layer

The flatten layer of Convolutional Neural Network transforms the pooled feature map into a column vector, which is made input to the neural network for the classification task [39]. The column vector represents the feature map for the desired sentence. To make the feature vector rows concatenated, the pooled feature map is flattened through reshape function of numpy as shown in an Eq. (14):

$$Flattening \, = \, pooled.reshape \, (4*2, \, 1)$$
(14)

This equation takes row 1, row 2, row 3 and so on, then append them all to make a single column vector. Figure 7 describes the function of flatten layer in which the matrix shows the pooled feature map for a sentence and the single column vector depicts that flattening operation is performed on a sentence pooled feature map.

Fig. 7
figure 7

Flatten layer

Output layer

At the output layer, an activation function like sigmoid, tanh, or softmax, is applied to compute the probability for the two classes, i.e. extremist and non-extremist. For example, the desired input text “Baghdadi… our last and only hope, I simply love you”, is tagged as “extremist”, when passed through the proposed CNN model.

In Fig. 8 the classification of the input vector is performed using the softmax function. The net input is obtained by applying Eq. (15).

$$u_{j} = \mathop \sum \limits_{i}^{l} w_{i} x_{i} + b$$
(15)

where “w” represents weight vector, “x” represents an input vector and “b” is a bias term.

Fig. 8
figure 8

Applying softmax function for classification

Softmax layer

Following are additional functions used in LSTM + CNN model, as shown in Algorithm 1.

Compile function

The compile method is used for model configuration. It covers different parameters: (i) Loss, it is an objective function, (ii) Optimizer, an instance/name of an optimizer which is used for the model compilation, and (iii) Metrics, it holds the evaluation metrics.

Summary of the model

The summary of the model will be shown using summary function after model creation.

Fitting a model

This section of the pseudo-code involves training of the required dataset and after that, evaluation of the model is performed on the test dataset. Finally, accuracy will be the output of the model.

figure a

A sample implementation code is given in Additional file 1: Appendix A.

Analyzing sentiments of users w.r.t emotional affiliation with extremists

This module deals with emotion classification of the user’s sentiments showing affiliation with the extremists’ postings. Each of the input text is tagged with an emotion category using the Python-based tone analyzer API [11]. It returns an emotion set: {anger, sadness, fear, joy, confident, analytical and tentative}. In this module, an input text is analyzed and the corresponding emotion class is identified. For example, when the input text: “Great news, ISIS fight Afghan forces to capture Helmand..” is passed through the emotion analyzer, it returns an emotion class “joy”. A sample set of user reviews and the detected emotion class is shown in Table 5.

Table 5 Extremist-related emotion classification

Experimental setup

To conduct the experiments, we used the Python and Anaconda framework (https://anaconda.org/anaconda/python).

Results and discussion

In this section, we discuss results obtained by conducting different experiments to answer the posed research questions.

Answer to RQ#1: How to recognize and classify tweets as extremist vs non-extremist, by applying deep learning-based sentiment analysis techniques?

To answer this research question, we applied deep-learning-based sentiment analysis technique, namely LSTM + CNN (discussed in detail in the proposed methods section). Additionally, we conducted experiments on 1-Layer CNN, 1-Layer LSTM. Table 6 shows results obtained on account of applying different DL models and it is obvious that the proposed LSTM + CNN model achieves best results.

Table 6 Experimental results of different DL-based SA models

The LSTM + CNN model attained best results, as the LSTM layer generates a new representation of the input tweet received from the embedding layer by capturing information from both the current and previous inputs, reducing the information loss. The LSTM model retains the context information using current and previous states for sufficient duration to make predictions. At the next level, the CNN layer captures additional features (n-gram) from the richer collection, yielding better and improved performance results.

Parameter settings

Table 7 presents the parameter setting for the proposed model (LSTM + CNN). The experiment is conducted using different parameters, listed as follows: the parameter, namely ‘number of filter’, receives a value varying from 2 to 16, while the values of other parameters, such as kernel size, padding, pooling size, optimizer, batch size, epochs, and units, are fixed.

Table 7 Parameter setting regarding proposed LSTM + CNN model

The configuration setting regarding 8 LSTM + CNN models with selected parameters (number of filters, kernel size, pooling size, and LSTM units), is shown in Table 8.

Table 8 All variants of LSTM + CNN with parameter setting

We have listed the accuracy, loss score, and training time in Table 9. After conducting experiments with varying parameter setting of LSTM + CNN models, it is noted that the performance of LSTM + CNN8 model is better with lstm units = 100 (cells), pooling size = 2 × 2, and number of filters = 16 and it’s achieved accuracy is 92.66%. It is noted that the accuracy of the model increases by increasing the number of filters.

Table 9 LSTM + CNN models training time, loss score and accuracy

Answer to RQ2: What is the performance of classical feature sets liken-grams, bag-of-words, TF-IDF, bag of-words (BoW) over word embedding learned using CNN, LSTM, FastText, and GRU?

Firstly, we discuss a few baseline/state-of-the-art methods, in all such techniques, a feature vector is created for a given tweet, which is applied as its feature set with the classifier.

Baseline methods

To conduct experiments, we used different classifiers for the following three feature representation techniques in the baselines [7, 8, 12]: (i) n-gram It is the state-of-the-art technique [12]) (ii) bag-of-words The bag-of-word, also called Count vectorizer technique, makes use of word frequency [7, 13], and (iii) TF-IDF A feature vector is created for text classification [8].

Deep learning methods with variants

Different variants of DL techniques (CNN + Random Embedding, LSTM + Random Embedding, FastText + Random Embedding, and GRU + Random Embedding), are used for tweet classification as extremist and non-extremist. The results are reported in Table 10. The proposed LSTM + CNN performs better than the other DL methods.

Table 10 Comparison of proposed work with baseline methods

Proposed method

For the extremist classification task, we experimented different DL-based SA models, namely CNN, LSTM, FastText and GRU, initialized with random word embeddings. The proposed LSTM + CNN model for extremist classification outperforms baseline methods (Part A of Table 10) and it also performs better than the other DL models (Part B of Table 10).

Parameter setting for LSTM + CNN

Following parameters are used for the LSTM + CNN model namely: (i) Max features (10 000), (ii) embedding dimension (128), (iii) LSTM unit size (200), convolutional filter size (200), and (iv) a batch size 32 with 3 epochs which yielded best performance results as shown in (Part C of Table 10).

Answer to RQ3#: What is the performance of proposed model for extremist affiliation classification with respect to state-of-the-art methods?

To find an answer for RQ3, we conducted experiments using supervised, lexicon-based, benchmark proposed technique for extremist Classification on Twitter.

Supervised techniques

In Table 11, the results of the proposed method (LSTM + CNN), are compared with different state-of-the-art techniques [7, 8] based on machine classifiers, namely, KN-Neighbors, Naïve Bayes Classifier. Furthermore, we also applied other machine and deep learning classifiers, namely Random Forest, Support Vector Machine, LSTM, and CNN. The objective of the experimentation is to perform extremist classification of Tweets using different ML and DL classifiers. The performance evaluation results of various machine learning classifiers are presented in terms of accuracy, precision, recall, and f-measure. The KNN exhibited the lowest performance result (72% accuracy).

Table 11 Comparison with state-of-art techniques

Lexicon-based technique

The SentiWordNet is used [17] to classify the reviews as extremist or nob-extremist using sentiment analysis. The reviews tagged as positive are treated as having affiliations with extremist, whereas negative reviews are treated as non-extremist reviews. The experimental results are presented in Table 11.

Proposed

The proposed technique (LSTM + CNN) produced the best results (Table 11), when compared with the other comparing methods.

Answer to RQ4: How to perform the sentiment classification of user reviews w.r.t emotional affiliations of Extremists on Twitter and Deep Web?

To find an answer for RQ2, we conducted experiments using tone analyzer API [11] for Emotion Classification of User reviews w.r.t Extremist’s Affiliation on Dark Web. Results reported in Table 12 show that the proposed module for emotion classification outperforms the comparing supervised machine learning classifiers in terms of improved accuracy, precision, recall, and F-measure.

Table 12 Comparative results of sentiments of users w.r.t emotional affiliation with extremists

Conclusions and future work

This study presents a sentiment-based extremist classification system based on users’ postings made on Twitter. The proposed work operates in three modules: (i) users’ tweet collection, (ii) preprocessing, and (iii) classification with respect to extremist and non-extremist classes using LSTM + CNN model and other ML and DL classifiers.

The experimental results show that the proposed system outperformed the comparing methods in terms of better precision, recall, f-measure, and accuracy. However, the system has certain limitations, such as (i) lack of an automated method for crawling, cleaning and storing Twitter content, (ii) lack of considering visual and social context features for obtaining more robust results, and (iii) investigating other types of extremism by apply the DL methods for multi-class label classification. Our future work aims at applying more advanced techniques, such as attention-based mechanism for extremist affiliation detection with multi-class labels. Furthermore, the inclusion of context-aware features can also improve the performance of the system.