Keywords

1 Introduction

Launched in 2006, Twitter serves as a microblogging platform in which people can publish at most 140 character-long tweets or 10,000 character-long direct messages [1]. Due to its popularity, portability, and ease of use, Twitter quickly has grown into a platform for people sharing daily life updates, chatting, and recording or spreading news. As of September 2015, Twitter announced that there were more than 320 million monthly active users worldwide1Footnote 1. In comparison to conventional news sources, Twitter favors real-time content and breaking news, and it thus plays an important role as a dynamic information source for individuals, companies, and organizations [2].

Since its establishment, Twitter has generously opened a portion of its data to the public and has attracted extensive research in many areas [3,4,5]. In many studies, the primary task is to identify the event-related tweets and then exploit these tweets to build domain knowledge-related models for analysis. As defined by Atefeh [2], events are generally considered as “real-world occurrences that unfold over space and time”. Compared to many data sources, tweets serve as a massive and timely collection of facts and controversial opinions related to specific events [2]. Furthermore, events discussed on Twitter vary in both scale and category, while some may reach to global audiences such as presidential election [6], and others, such as wildfire [7, 8], appeal to local users. In general, studies of Twitter events can be categorized into natural events [3], political events [9], social events [10], and others [11].

Originated from the Topic Detection and Tracking (TDT) program, detection of retrospective or new events has been addressed over two decades from a collection of news stories [12]. Historically, there exist a number of systems developed to automatically detect events from online news [13,14,15].

An event usually consists of many sub-events, which can describe various facets of it [7]. Furthermore, users tend to post new statuses of an event to keep track of the dynamics of it. Within an event, some unexpected situations or results may occur and surprise users, such as the bombing during the Boston Marathon and the verdict moment of the Zimmerman trial. By building an intelligent system, we can identify these sub-events to quickly respond to them, thus avoiding crisis situations or maximizing marketing impact.

2 Background

Traditionally, unsupervised models and supervised models have been widely applied to detect events from news sources. Clustering methods have been a classic approach for both Retrospective Event Detection (RED) and New Event Detection (NED) since 1990s. According to Allan et al. [12], they designed a single pass clustering method with a threshold model to detect and track events from a collection of digitalized news sources. Chen and Roy [16] also applied clustering approaches such as DBSCAN to identify events for other user-generated contents such as photos.

Additionally, supervised algorithms such as naive Bayes, SVM, and gradient boosted decision trees, have been proposed for event detection. Becker et al. [17] employed the Naive Bayes classifier to label the clustered tweets into event-tweets or non-events tweets with derived temporal features, social features, topical features, and Twitter-centric features, while the tweets are grouped using an incremental clustering algorithm. Sakaki et al. [3] applied the Support Vector Machine to classify tweets into tweets related to target events or not with three key features. Subsequently, they designed a spatial-temporal model to estimate the center of an earthquake and forecast the trajectory of a hurricane using Kalman filtering and particle filtering. Popescu and Pennacchiotti [18] proposed a gradient boosted decision tree based model integrated with a number of custom features to detect controversial events from Twitter streams.

Furthermore, ensemble approaches are also employed to address the event detection problem. Sankaranarayanan et al. [19] first employed a classification scheme to classify tweets into different groups, and then applied a clustering algorithm to identify events.

As argued by Meladianos et al. [1], sub-event detection has been receiving more and more attention from the event research community. For the time being, there are a number of studies dealing with sub-event detection in an offline mode [20]. Zhao et al. [21] adopted a simple statistical approach to detect sub-events during NFL games when tweeting rate suddenly rose higher than a prior threshold. Chakrabarti and Punera [22] developed a two-phase model with a modified Hidden Markov Model to identify sub-events and then derived a summary of the tweets stream. However, their approach has a severe deficiency because it fails to work properly under situations when unseen event types are involved. Zubiaga et al. [20] compared two different approaches for sub-event detection. The first approach measured recent tweeting activities and identified sub-events if there was a sudden increase of the tweeting rate by at least 1.7 compared to the previous period. The second approach relied on all previous tweeting activities and detected sub-events if the tweeting rate within 60 s exceeded 90% of all previously tweeting rates. As claimed by the authors, the latter outlier-based approach outperformed the first increase-based approach since it neglected situations when there existed low tweeting rates preceded by even lower rates [20].

Nichols et al. [23] provided both an online approach and an offline approach to detect sub-events as well as summarizing important events moments by comparing slopes of statuses updates with a specific slope threshold, which was defined as the sum of the median and three times the standard deviation (median + 3*standard deviation) in their experiment. Shen et al. [24] incorporated “burstiness” and “cohesiveness” properties of tweets into a participant-based sub-event detection framework, and developed a mixture model tuned by EM which yielded the identification of important moments of an event. Chierichetti et al. [25] proposed a logistic regression classifier, to capture the new sub-events with the exploration of the tweet and retweet rates as the features.

In this study, we formalize sub-event detection as an outlier detection problem, where a set of statistical models, Kalman filter (KF), Gaussian process (GP), and probabilistic principle component analysis (PPCA), are used to construct the probability distribution of future observables. Outliers are identified as observations that do not fit these predicted probability distributions. Three real-world case studies (2013 Boston marathon, 2013 NBA AllStar, Zimmerman trial) are investigated to test the effectiveness of the methods. Finally, we discuss the limitations of the proposed framework and provide future directions for improvement.

3 Methodology

Our goal is to model the evolution of the probability distribution of the tweeting change rate (increase/decrease) from period t – 1 to t as defined in the following equation. Each period t spans 30 min and #tweets represents the total number of tweets within that period and filtered for the particular event of interest.

$$ v_{t} = \frac{{\# tweets_{t} - \# tweets_{t - 1} }}{{\# tweets_{t - 1} + 1}} $$
(1)

Three methods (KF, GP, PPCA) described in the following subsections, are evaluated in constructing the probability density function \( p(v_{t + 1} |v_{1:t} = \left\{ {v_{t}^{*} \ldots v_{1}^{*} } \right\}) \). All three approximate the target using a Gaussian density function. This probability distribution is then used to determine whether an observation \( v_{t + 1}^{*} \) is an outlier (*denotes the actual observation of percentage change). An observation is labeled as unexpected sub-event when it is identified as an outlier at the 0.025 significance test for the one-tail test.

$$ p (v_{t + 1} \ge |v_{t + 1}^{ *} | |v_{1:t} )< \, 0.025 $$
(2)

3.1 Kalman Filter (KF)

Kalman filter and its variants are widely applied in dynamic systems to estimate the state of a system [26, 27]. In this study, we assume that a latent variable \( h_{t} \) related to our quantity of interest - percentage change \( v_{t} \), evolves with time using the following linear dynamical system.

$$ h_{t} = Ah_{t - 1} + \eta_{t}^{h} $$
(3)
$$ v_{t} = Bh_{t} + \eta_{t}^{v} $$
(4)
$$ h_{1} \sim N(\mu_{0} , \sigma_{0}^{2} ) $$
(5)

Here, \( \eta_{t}^{h} \) is the process noise and \( \eta_{t}^{v} \) is the measurement noise. They are as assumed to be independent of one another, temporally independent, and normally distributed according to \( N(0,\Sigma _{H} ) \) and \( N\left( {0,\Sigma _{V} } \right) \) respectively. The model parameters \( A, B,\Sigma _{H} ,\Sigma _{V} , \) \( \mu_{0} , \sigma_{0}^{2} \) are learned from the data using the Expectation Maximization (EM) algorithm [28].

The initial mean \( \mu_{0} \) and variance \( \sigma_{0}^{2} \) are obtained using data from a 12 h window, and the EM is run on a 12 h moving window to determine the rest of the parameters. After the parameters are obtained, we make a prediction for the next 30 min to compute the probability \( p(v_{t + 1} |v_{1:t} = \left\{ {v_{t}^{*} \ldots v_{1}^{*} } \right\}) \) and test whether the next incoming observation is a sub-event.

3.2 Gaussian Process (GP)

To better capture the non-linearity in the data, we have also tested Gaussian processes. GP is a generalization of a multivariate Gaussian distribution to infinitely many variables [29]. Specifically, a GP defines a distribution over functions, \( p(f) \), and \( f \) is a mapping function. In this study we use GP to capture the nonlinear relation between several past observations of percentage change and future ones. Namely, we consider the following model.

$$ v_{t} = f\left( {v_{t - 1} , v_{t - 2} , v_{t - 3} } \right) + \epsilon_{t} $$
(6)

Here \( f\left( \cdot \right) \sim GP( \cdot |0, k) \) and \( \varepsilon \sim N( \cdot |0, \sigma^{2} ) \), where \( k( \cdot , \cdot ) \) is the kernel function. Common choices for kernel function include the squared exponential kernel function, polynomial kernel functions, and sigmoid kernel functions. In this work we have used cubic covariance function, parameters of which are determined using maximum likelihood estimation for each 24 h moving window, where the training data consists of inputs \( \{ X = \left( {v_{t}^{*} ,v_{t - 1}^{*} ,v_{t - 2}^{*} } \right)_{t = 3 \ldots 47} ,Y = \left( {v_{t}^{*} } \right)_{t = 4 \ldots 48} \} \).

Once the training is completed, the probability density function corresponding to a new input \( x_{*} = (v_{t}^{*} ,v_{t - 1}^{*} ,v_{t - 2}^{*} ) \) is obtained via conditioning of the joint as follows.

$$ \begin{array}{*{20}c} {p(v_{t + 1} |x_{*} ,X, y) = N(\mu_{*} , \sigma_{*}^{2} )} \\ {\mu_{*} = K_{*N} (K_{N} + \sigma^{2} I)^{ - 1} y} \\ {\sigma_{*}^{2} = K_{**} - K_{*N} \left( {K_{N} + \sigma^{2} I} \right)^{ - 1} K_{N*} + \sigma^{2} } \\ \end{array} $$

Here, \( K_{N} \) represents the Gram matrix whose entries are given by the kernel function evaluated at the corresponding pairs of inputs in the training data. \( K_{*N} \) is a row vector corresponding with kernel function evaluated between the new input \( x_{*} \) are all the training data points, and \( K_{**} \) is kernel function evaluated at the new input point.

3.3 Probabilistic Principle Component Analysis (PPCA)

A third model is tested by simply approximating the joint distribution \( p\left( {v_{t} , v_{t - 1} , v_{t - 2} ,v_{t - 3} } \right) \) using a Gaussian distribution based on 48 samples corresponding to each 24-h moving window. The prediction for the quantity of interest is obtained via conditioning the joint using the past three observations. Since we need to approximate the covariance in 4 dimensions using only 46 samples, we propose to use a more robust estimator such as PPCA than simply computing the sample covariance matrix.

PPCA model is defined as a linear relationship between the 4-dimensional observable \( [v_{t} , v_{t - 1} , v_{t - 2} ,v_{t - 3} ]^{T} \) and the M-dimensional latent variable \( z_{n} \) which follows a zero-mean normal distribution with unit covariance matrix [30]. In this study we have set \( p = 2 \).

$$ [v_{t} , v_{t - 1} , v_{t - 2} ,v_{t - 3} ]^{T} = Wz_{n} + \mu + \epsilon_{n} $$
(7)

Here, \( W \) is a \( 4 \times 2 \) matrix, \( \mu \) is the data offset, and \( \epsilon \) is the projection error, which assumed to follow isotropic Gaussian distribution \( \varepsilon \sim N(0, \sigma^{2} I) \). We can then obtain the joint distribution of the features by integrating out the latent variables:

$$ p[v_{t} , v_{t - 1} , v_{t - 2} ,v_{t - 3} ]^{T} \sim N(\mu , C) $$
(8)

Here, the covariance matrix \( {\text{C}} = WW^{T} + \sigma^{2} I \). The parameters \( {\text{W}}, \) \( \mu , \) and \( \sigma^{2} \) can be either estimated using the EM approach or by maximizing the following likelihood function [30].

$$ {\text{L}} = - \frac{N}{2}d{ \ln }\left( {2\uppi} \right) + { \ln }\left| {\text{C}} \right| + {\text{tr}}({\text{C}}^{ - 1} {\text{S}} ) $$
(9)
$$ {\text{S}} = \frac{1}{N}\sum\nolimits_{n = 1}^{N} {(t_{n} - \mu )(t_{n} - \mu )^{T} } $$
(10)

4 Experiments

Twitter data were collected from Jan. 2, 2013 to Oct. 7, 2014 using Twitter streaming APIs. Then we handpicked three national events during this period, including the 2013 Boston marathon event, the 2013 NBA AllStar event, and the Zimmerman trial event. For these events, we filtered out relevant tweets with pre-specified keywords and hashtags, and provided basic summary of the events shown in Table 1.

Table 1. Basic information for the picked events

For two of the three events, we detected sub-events using data retrieved in one week. However, for the Zimmerman trial event, we missed partial data and thus used data collected in three days that were relevant to the event. Based upon the data, we developed an online detection system that could capture outliers. Figure 1 shows a daily pattern of the number of users and number of tweets for the collected tweets. As the figure indicates, there exist periodic patterns for both the number of tweets and the number of users.

Fig. 1.
figure 1

Daily patterns of the collected tweets during 04/12/2013 and 04/18/2013.

5 Results

As shown in Figs. 2, 3 and 4, the upper sub-plot to the lower sub-plot are outliers identified by the KF, GP, and PPCA algorithms, respectively. Red color indicates actual percentage change of tweets, green color indicates the confidence interval, and cyan color indicates the identified outliers by each algorithm. For the Boston marathon event, as shown in Fig. 2, there were 90, 4, and 7 sub-events detected by the KF, GP, and PPCA algorithms, respectively. 4 of the 90 sub-events identified by KF were labelled as real sub-events, 2 sub-events identified by GP were labelled as real sub-events, and 3 sub-events identified by PPCA were labelled as real sub-events. For this particular event, GP yielded the best precision and 2 of the 4 identified sub-event were real sub-events. Meanwhile, KF achieved the best recall but with many false positives (Table 3).

Fig. 2.
figure 2

Predicted sub-events with KF, GP, and PPCA, for the 2013 Boston marathon event. The cyan color indicates the sub-events identified by each algorithm. (Color figure online)

Fig. 3.
figure 3

Predicted sub-events with KF, GP, and PPCA, for the Zimmerman trial event. The cyan color indicates the sub-events identified by each algorithm. (Color figure online)

Fig. 4.
figure 4

Predicted sub-events with KF, GP, and PPCA, for the 2013 NBA AllStar event. The cyan color indicates the sub-events identified by each algorithm. (Color figure online)

Table 2. Evaluation metrics of the picked events

Outliers of the Zimmerman trial event are visualized in Fig. 3. In terms of the recall value, both KF and PPCA captured 3 of the 10 sub-events, but PPCA achieved a slightly better precision value. In contrast, GP achieved the best precision value.

Figure 4 indicated the identified outliers of the NBA AllStar event. For this event, KF outperformed the other two methods and yielded slightly better recall and precision value. It captured 3 of the 12 sub-events and 3 of the 5 predicted sub-events were real.

A summary of the three picked events are provided in Table 3. Overall, GP and PPCA yield similar F1 score, while GP achieves better recall value and PPCA achieves better precision. KF, in compared to the other two methods, yields the best recall value. This performance is most affected by the Boston event, in which many false positives are identified. More interestingly, we notice that GP provides robust estimates of the uncertainty while the other two methods yield higher uncertainty estimates for time windows after outliers. This observation can be illustrated by the large green confidence bounds after the two spikes in Figs. 2 and 3.

Table 3. Evaluation metrics of the picked events

6 Conclusion

In this study, we explore of building an intelligent system for sub-event detection with three probabilistic models. The sub-events of a point-of-interest event is captured by the system if a new observation is out of the confidence bound of the predictive distribution. We demonstrate the proposed system could capture sub-events with varying performance. The KF model is able to produce slightly better recall, while the GP model is most robust to outliers and yields the best precision. Compared to these two models, PPCA achieves a balanced performance on recall and precision, yielding the best overall F1 score. Nevertheless, we need to interpret the aggregated evaluation with caution because the performance is affected by individualized events, outliers, and the proper choice of parameters. In the future study, we will further tune the parameters, incorporate robust distributions (e.g. t distribution), and take content features into considerations.