Keywords

1 Introduction

Nowadays, online social media plays a vital role in our daily lives and are the major way through which individuals interact on the Internet. The social networking sites like Facebook, Twitter, LinkedIn and MySpace enables the user to communicate with other users, or to find people with similar interests to one’s own. And also online profiles can be created by the users in which they post daily updates about their lives in the form of pictures, videos, and related content. Facebook and Twitter have more than billions of users and it grows every day. Everybody started using social media ranging from normal people to celebrities, politicians, and media houses. They become prominent news source and can disseminate the information much faster than the traditional news media. Many real-world examples have shown the effectiveness and the timely information reported by Twitter during disasters and social movements. The following are the representative examples: the bomb blasts in Mumbai in November 2008, [1] the flooding of the Red River Valley in the United States and Canada in March and April 2009, the U.S. Airways plane crash on the Hudson River in January 2009, the devastating earthquake at Haiti in 2010, the demonstrations following the Iranian Presidential elections in 2009, and the “Arab Spring” in the Middle East and North Africa region.

As online social networking utilization turned out to be progressively interlaced with the occasions in the online world, people and organizations have discovered approaches to abuse these stages to spread wrong information [2], to assault and calumniate others [3], or to mislead and control. Clients with some tricky expectation may utilize this to spread bits of gossip, issue threats, give the wrong direction to their adherents and impart their tentative arrangements to their community [4, 5]. Criminal gangs and terrorist organizations like ISIS receive web-based social networking for purposeful publicity and enlistment [6]. Fraudulent action and social bots have been utilized to facilitate planned protest campaigns, to control political decisions and stock markets [7]. The absence of compelling substance confirmation frameworks and insufficient technical solutions to timely detect and ruin improper use on a considerable lot of these platforms, including Twitter and Facebook, raises concerns when more youthful clients disclose to cyber-bullying, harassment, or hate speech, initiating dangers like gloom and suicide. Moreover, online communications such as highly powerful social media are often used as a way of shouting out people’s intentions before engaging in their acts of violence and also to coordinate criminal activities [8]. Being able to automatically detect negative material is beneficial to the managers of websites that allow users to post content or as part of an early warning system to authorities on possible threats to public safety [9]. The automatic detection of potentially dangerous words can help to ensure the safety of the public with minimum disruption. Thus monitoring social media posts and discussions, then figuring out how participants are reacting to a brand or event can improve the business [10,11,12]. Extraction of useful information from social media is more challenging than classic information extraction, i.e., extraction from trusted sources like traditional news media and well-formed grammatical texts. The actual challenge is in accessing that data and transforming it into something that is usable and actionable. Social media text [13] is typically very short, noisy, a high uncertainty of the reliability of the information conveyed in the text messages compared to conventional news media, and many social media support multi-lingual languages.

In this paper, we propose a keyword-based approach for detecting civil unrest events from twitter dataset. This system can automatically learn keywords from the dataset and the dataset is filtered based on these identified keywords. Then clustering analysis is performed in order to detect tweets promoting civil unrest and analyze the impact of the protest on the public. Finally, extensive experimental evaluation and performance analysis are performed.

2 Related Work

In recent years much attention is given to Online Social Network Mining due to the availability of enormous volume of uncensored data posted by people, which focuses on Social Recommendations, Opinion Mining, Sentiment Analysis, Topic Detection and Tracking, Community Detection, Event Detection, and Forecasting. This section presents related works in the following areas: (1) Spatiotemporal mining of Social Media; (2) Event Detection and Forecasting; (3) Early detection of Suspicious Behaviors in Social Media; and (4) Civil Unrest event forecasting from Social Media.

2.1 Spatiotemporal Mining of Social Media

Considerable research work has been carried out by the researchers for studying the spatiotemporal event that is mainly relevant to the tweets posted within a certain geographical neighborhood. Thus, forecasting of such events requires an examination of spatial features and their correlations in addition to the temporal dimension. Ting Hua [14] reviewed several methods of spatiotemporal event detection and event forecasting. Judith Gelernter proposed a method for identifying locations and associating them with people by mining social media text conversations. Bo Hu [15] developed a probabilistic model for location recommendation by capturing the spatiotemporal aspects of user check-ins. Andrade [16] adopted a temporal approach for analyzing the cross-correlation between rainfall gauge data and rainfall-related Twitter messages by means of temporal units and their lag-time.

2.2 Event Detection and Forecasting

Most prior event detection research has focused on keywords present in the text also they rely on templates, dictionaries or presence of a specific pattern in the text. Wei Wang [17] extracted key sentences promoting civil unrest contain fields like participants, purpose, location and time using multiple instance learning. Yiming Yang [18] adapted the traditional hierarchical and non-hierarchical clustering techniques for online event detection based on semantic and temporal properties of events. Fang Jin [19] detected civil unrest events by representing the spatiotemporal structure of user activity in twitter in the form of graph wavelets. Minglai Shao [20] proposed a method to indicate the forthcoming or ongoing events in dynamic multivariate networks by measuring the significance of evolving sub graphs and subsets of attributes.

2.3 Early Detection of Suspicious Behaviors in Social Media

Considerable research work has been carried out in the area of Social Media Analysis. However, there has been relatively little work with respect to the early detection of Suspicious Behaviors targeting civil unrest, by observing text-based user’s conversations. Some of the significant works are presented in this section. Myriam Munezero [21] developed a framework to search for linguistic features that pertain to Anti Social Behaviors (ASBs) in order to use those features for the automatic identification of suspicious activities in texts. Dongjin Choi [22] proposed a method by using word similarity based on WordNet hierarchy and n-gram data frequency for distinguishing articles about terrorism. Burnap [23] built models that predicted information flow size and survival on Twitter following the terrorist event in Woolwich, London in 2013. Emilio Ferrara [24] has proposed a method to identify criminal networks from communication media such as mobile phones and online social networks that leave digital traces in the form of metadata.

2.4 Civil Unrest Event Forecasting from Social Media

Many events with a large number of people gathering to support a common case are not civil unrest events [25] rather it is typically defined by law enforcement as a gathering of three or more people, in reaction to an event, with the intention of causing a public disturbance in violation of the law. Ryan Compton and Jiejun Xu [26] proposed a strategy by simply applying various filters like keyword filter, future dates filter, and location filter for early detection of civil unrest from social media. Congyu [27] proposed to locate the predictive power of social media in its function as a protest advertisement and organization mechanism from the Global Database of Events, Location, and Tone (GDELT).

3 System Framework for Civil Unrest Detection

Social network analysis (SNA) has long been used for identifying social groups and for determining the relationships among the members of social groups. Figure 29.1 depicts the overall architecture of civil unrest detection system. It is divided into the following steps. First, all tweets between two dates are collected and preprocessed, where basic pre-processing steps are taken to clean the tweets and make them suitable for further processing. Second, automatic keyword learning is done based on the highest term frequency and significant keywords representing a particular protest are identified. Third, using this set of keywords the preprocessed tweets are filtered and the features used for detecting civil unrest are extracted from the resulting tweets. Fourth, clustering analysis is done to detect the essence of unrest content in those tweets in order to understand the influence of that protest on society.

Fig. 29.1
figure 1

The overall process of civil unrest detection

3.1 Preprocessing

The extracted tweets contain many unwanted words, symbols, white spaces, acronyms, etc., and such unwanted elements must be eliminated so that they can be easily processed in future and yield results with maximum accuracy. So the raw tweets were cleaned and preprocessed in order to remove the stop words, punctuations, and unwanted symbols. And the tweets written in natural languages are translated into English by Google Translate in order to process the tweets incrementally.

3.2 Keyword Learning and Filtering

Then the average term frequency and inverse document frequency score for each word are calculated and words were listed in decreasing order. Then the top ranked 100 words were selected and they were highly related to the cause for protest, place of protest and the key actors of protest. And the keyword matching was applied to the complete dataset using these protest-related terms. Keyword matching method is used to measure the tweets containing information about the upcoming protest. We measured the volume of tweets containing protest-related keywords and future-oriented words. First, we applied the keyword matching method. Since the tweets were extracted in the period of BusFareHike protest we tried with the basic keywords related to that protest like #BusFareHike, #TNBusStrike were the most popular hashtags of that protest. The tweets containing these keywords were selected and aggregated by day and thus we collected a huge volume of tweets containing the post of twitter for the period of 8 days for each protest.

3.3 Clustering Model for Civil Unrest Detection

The unsupervised learning is highly useful in social media monitoring as it enables us to obtain an overview of the public opinion about an event by applying various clustering techniques. Clustering is the technique of collecting the similar type of components in one cluster. Tweets containing information about the same event express collective behavior. This can be used to make different clusters having keywords representing various civil unrest events like #SaveFisherMen, #BusFareHike, and #Jallikattu. Simple TF-IDF algorithm is used for making clusters.

Algorithm

Civil Unrest detection based on keyword extraction will be performed in four general steps as below:

Input : Document containing tweets.

Output : Number of Clusters each representing different protest events.

Step1: Remove stop words and repeated tweets from each posts.

Step2: Extracting keyword of the user tweets based on TFIDF method:

TF-IDF value is composed of two components TF and IDF values. The logical basis of TF value is that more frequent words in a document are more important than less frequent words. TF value in a document is the number of times a given term appears in that document. The IDF, which measures the importance of a term in the collection. Dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient gives the value.

$$ tf\left(i,j\right)=\frac{n\left(i,j\right)}{\sum \limits_kn\left(k,j\right)} $$
(29.1)

n(i, j): The number of occurrences of the considered term in document d j

\( \sum \limits_kn\left(k,j\right): \) The number of occurrences of all term in document d j

$$ idf(i)=\log \left(\frac{\left|D\right|}{\left| dj: tj\in dj\right|}\right) $$
(29.2)

|D|: The total number of documents in the corpus

|dj : tj ∈ dj|: Number of documents where the term ti appears

$$ tfidf\left(i,j\right)= tf\left(i,j\right)\times idf(i) $$
(29.3)

Step 3: Calculate cosine distance between each tweet as a measure of similarity such that

$$ \cos \theta =\frac{x.y}{\left|x\right|.\left|y\right|} $$
(29.4)

where x and y are term frequency-inverse document frequency (TF-IDF) vectors corresponding to documents x and y.

Step4: Clustering the tweets using the K Mean clustering algorithm.

4 Results and Discussion

The implementation process starts with the data collection. Twitter API allows the users to extract information needed by providing them separate login and access credentials. These credentials are used to handshake with the R tool. The tweets were extracted using the Twitter API and R tool. The twitter posts were called tweets and that were collected in the period of 22/01/2018 to 29/01/2018 for #TNBusFareHike protest. We retrieved about 35,000 tweets; which contains people’s opinions against the Tamil Nadu government for suddenly increasing the Bus Fare. Similarly, the dataset for #SaveFisherMen and #HydroCarbon protest was collected during the days of protest and they were aggregated by day. Thus we collected a huge volume of tweets for different protests.

Figure 29.2 shows the word cloud that is formed using the protest-related keywords identified from the tweets. The words that appear in bigger size are the words that appear frequently in the tweets. TF-IDF is the product of TF and IDF. When the Term Frequency is high and the Document Frequency is low (IDF is high) a high TF-IDF is obtained. TF/IDF and the many other clustering techniques work well if applied on a large size dataset. And also a bar chart representation of frequent words that appear in the tweets which promote protest is prepared. It clearly shows the comparison between the word frequent counts.

Fig. 29.2
figure 2

Word cloud of #BusFareHike protest

The top-ranked frequent words in each document containing tweets of a particular event were taken which are shown in Fig. 29.3. Based on the ranking the words that occur very frequently are considered to be the keywords which are used to cluster the tweets of a particular protest. The following Table 29.1 lists the collection of keywords extracted from sample data for different civil unrest events.

Fig. 29.3
figure 3

Frequent words that appear on #busfarehike tweets

Table 29.1 Keywords extracted to identify tweets of different protests

The clustering analysis complete with the process of measurement of cluster validation by evaluating the clustering algorithms used. This study uses an internal validation since the dataset used in this system do not have prior knowledge, but uses the information residing in the data. A user study was conducted on 150 real-time tweets to validate the clusters. There are several types of indices to determine the optimal cluster of internal validation, one among is Sum of Squared Error (SSE). When the clusters are well separated the “goodness” of the resulting clusters can be evaluated using Sum of Squared Error (SSE) to measure the compactness of the cluster. Sum of Squared Error (SSE) is calculated as,

$$ \mathrm{SSE}=\sum \limits_{i=1}^n{\left({x}_i-\hat{x_i}\right)}^2 $$
(29.5)

where each x i is the actual value of observation, each \( \hat{x_i} \) is the estimated or forecast value of observation.

By comparing the Sum of Squared Error (SSE) of the different number of clusters is one of the ways to determine the appropriate number of cluster. SSE is defined as the sum of the squared distance between each member of a cluster and its cluster centroid. The plot of the SSE against the number of clusters k shown in Fig. 29.4 shows that as the k-value increases the SSE value decreases since clusters become smaller. In Fig. 29.4, the first elbow is found for the k-value 3. Thus the optimum number of clusters for the dataset is 3. To enable the detection and make the probability estimation feasible, we repeated the experiment using various datasets and the results were improved.

Fig. 29.4
figure 4

SSE of clusters formed using K-Mean clustering algorithm

5 Conclusion

In this paper, we investigated existing text-mining methods for detecting civil unrest contents for preventing from the upcoming protest. Specifically, we proposed the Keyword-Based approach to detect civil unrest from social media before it may occur. We learned civil unrest keywords to train real-time tweets with clustering algorithm and tackled the problem of detecting civil unrest events. We integrated our ideas in a modular framework and experimentally demonstrated the validity and scalability of the method. The performance of the system can be improved, (1) to include location extraction method, by applying more advanced Geotagging scheme, using GPS signals, and by using information about the Twitter graph to estimate the location of a tweet from the location of related Twitter users, (2) multilingual text analysis can be applied to improve the clustering accuracy.