Keywords

1 Introduction

As an up-to-date information media, Micro-blog, where folks speak out their opinions on social events, has been an important place in which hot issues are born and discussed for its shortness, rich content, relatively low barrier to entry, and fast propagation velocity. Especially, users’ following, reposting and commenting usually help micro-blog events propagate. It’s been hard for users to find information they are interested in because of much valuable data being flooded caused by information explosion. Therefore, studying how to get valuable information out of a vast number of micro-blog data becomes a hot spot in computer science area. Meanwhile, detection of micro-blog hot event, known as an important branch of web public sentiment monitor, concerned both domestic and foreign academia, showing its huge research value.

By now, researchers have done numerous research about micro-blog hot event detection which could be divided into two following categories [1]: (1) Methods focused on texts, specifically, micro-blogs are clustered into several clusters in order to identify hot events. For example, Shi [2] propose a hot event evolution model to discover the user interest distribution, but also a hot event filtering algorithm is developed to detect important events. Yang [3], who’s been devoted to clustering hot topics based on timing characteristic, presented K_SC algorithm based on hotness tendency of topics. However, data sparsity problems caused by shortness of micro-blog and lots of noise data in it, make a relatively low efficiency on identifying burst words after clustering. (2) Approach focused on burst features, that is, burst features are extracted and divided into different groups, and unexpected events are identified by feature groups. Yang [4] detected hot events through changes of amount of emotionally key words. Chen [5] detected burst features by analysis method based on analysis of timing windows, then utilize Affinity Propagation algorithm to cluster burst features. Similarly, Zhao [6] propose a novel real-time event detection method by generating an intermediate semantic level from social multimedia data, named MC, which is able to explore the high correlations among different microblogs. Aforementioned methods only theoretically improve effectiveness of event detection when detecting burst events, not achieve satisfying result in real-life applications. The most fundamental cause is that topics drift will appear with time changing during event detection.

In order to improve accuracy of micro-blog hot event detection and reduce complexity, we proposes a microblog hot event detection algorithm based on restart random walk model and modularity, which divide micro-blog hot event detection into two phases, Phase 1: to know hidden semantic relations among terms through restart random walk algorithm. Phase 2: to cluster terms with the idea of modularity based on the former result, and find hot events. The main contributions of this work are as follows:

  1. 1.

    To know hidden semantic relations among terms, we construct an undirected weighted graph and run restart random walk algorithm on it.

  2. 2.

    We apply the idea of modularity as clustering for detecting hot event, and achieve the goal of corresponding between hot words and hot events.

  3. 3.

    We use three experiments on two datasets to verify effectiveness of hot event detection algorithm, which demonstrate promising results compared to the kindred methods.

The remaining of this paper is as follows. Section 2 discusses the preliminary knowledge of our proposed method. Construct graph and acquire association relationship between words in Sect. 3. Find Hot Event based on Restart Random Walk and Modularity in Sect. 4. Experimental results are discussed in Sect. 5, and conclusions are drawn in Sect. 6.

2 Preliminary Knowledge

2.1 Hot Degree

Micro-blog’s sensitivity towards hot events makes it able to reflect hot events. Popular micro-blog, whose number of comments and reposts gradually increases, spreads very soon which is why we need a metric to measure how much the micro-blogs concerned us [7].

Assuming that user ui had posted a micro-blog mb, then it was posted by user uj in time ∆t, then the repost value ret(mb, uj) of the latter user on this micro-blog is defined as ret(mb, uj):

$$ ret(mb,u_{j} ) = \left\{ {\begin{array}{*{20}c} 1 & {otherwise{\kern 1pt} } \\ {1 - sim(u_{i} ,u_{j} )} & {if\;u_{i} \;is\;similar\;to\;u_{j} } \\ \end{array} } \right. $$
(1)

As the same, comment value com(mb, uj) of uj towards this micro-blog is defined as follows:

$$ com(mb,u_{j} ) = \left\{ {\begin{array}{*{20}c} 1 & {otherwise} \\ {1 - sim(u_{i} ,u_{j} )} & {if\;u_{i} \;is\;similar\;to\;u_{j} } \\ \end{array} } \right. $$
(2)

sim(ui, uj) represents similarity between users, we calculate it by user similarity method as [8]. \( sim(u_{i} ,u_{j} ) = \frac{{F(u_{i} ) \cap F(u_{j} )}}{{F(u_{i} ) \cup F(u_{j} )}} \). Where F(ui) denote collection composed of user that ui is attentioned.

Then we got the definition of hot degree based on formula (1) and (2).

Definition:

Hot degree of a micro-blog Hot(mb) equals the weighted sum of its repost value ret(mbi, uj) and comment value com(mbi, uj). After normalization it is:

$$ Hot(mb) = \frac{{\lambda \sum\limits_{j = 1}^{l} {ret(mb,u_{j} ) + (1 - \lambda )\sum\limits_{j = 1}^{h} {com(mb,u_{j} )} } }}{1 + h} $$
(3)

Where, λ is the adjustment parameter, 0 < λ < 1, l is number of reposts, h is number of comments. By definition of the hot degree, the hot event should be directly related to hot degree and not the content itself of micro-blog.

2.2 Co-occurrence Degree Between Words

Given a micro-blog mb, co-occurrence degree of term ti and tj is denoted as c(ti, tj), which is as follows [9]:

$$ c(t_{i} ,t_{j} ) = e^{{ - dist(t_{i} ,t_{j} )}} \quad if\,t_{i} \, \in \,mb\,\, and\,\, t_{j} \, \in \,mb $$
(4)

dist(ti, tj) is co-occurrence distance between ti and tj, whose value is number of words between ti and ti in micro-blog mb. Co-occurrence degree c(ti, tj) reflects that two words are correlated if they often appear in the same micro-blog.

3 Acquire Association Relationship Between Words

\( MB\, = \,\{ mb_{ 1} ,mb_{ 2} , \ldots ,mb_{N} \} \) is micro-blogs set, and \( mb_{i} = \left\{ {t_{i1} ,t_{i2} , \ldots ,t_{{i\left| {mb_{i} } \right|}} } \right\} \) is the i-th micro-blog, and candidate item set is \( MT\, = \,\{ t_{ 1} ,t_{ 2} , \ldots ,t_{m} \} \), where m represents the size of the dictionary.

3.1 Construct Graph Model

We construct an undirected weighted graph G = (V, E), where \( V\, = \,\left\{ {v_{ 1} ,v_{ 2} , \ldots ,v_{\text{M}} } \right\} \) is vertex set, M is the number of the rest of the vertices, vi corresponds to candidate item in the MT. Then we connect any two vertexes in the set V if they’re from the same micro-blog, so edges set \( E\, = \,\left\{ {\left( {v_{i} ,v_{j} } \right)|v_{i} \in mb\;{\text{and}}\;v_{j} \in mb} \right\} \). Notes: in the rest of the paper vi represents the vertex word ti corresponds.

First, to get weight matrix as Fig. 1 shows. In A′, element \( w_{ij} \, = \,w\left( {v_{i,} ,v_{j} } \right),w\left( {v_{i,} ,v_{j} } \right) \) represents weight on the edge \( \left( {v_{i,} ,v_{j} } \right) \), which is defined as the sum of cooccurrence degrees of terms vi,and vj in micro-blog set. As Eq. (5) shows:

Fig. 1.
figure 1

The weighted adjacency matrix

Fig. 2.
figure 2

The weighted matrix after normalization

$$ w(v_{i} ,v_{j} ) = \left\{ {\begin{array}{*{20}l} {\sum\limits_{mb \in MB} {c(v_{i} ,v_{j} )} } \hfill & {(v_{i} ,v_{j} ) \in E} \hfill \\ 0 \hfill & {\;\;\;\;otherwise} \hfill \\ \end{array} } \right. $$
(5)

Afterwards, run normalization and asymmetric operation on matrix A′ to obtain matrix A [10]. Value of element cij is calculated through the formula (6).

$$ c_{ij} = \frac{{w_{ij} }}{{n_{j} + 0.01}} $$
(6)

Where 0 ≤ cij ≤ 1 and \( \sum\limits_{j} {c_{ij} = 1,} \,n_{j} = \sum\limits_{i} {w_{ij} } \) represents the sum of elements in j-th column in matrix A′. The rest of this paper is developed based on graph G.

3.2 Restart Random Walk on Graph

Random walk model [11] means to traverse a graph beginning with one vertex or a series of vertexes. At any vertical, traverser randomly selects an edge connecting the vertex at a certain possibility, then randomly jumps to the next vertex along the edge or jumps back to the starting point at a certain possibility. Mathematic expression of it is:

$$ r^{(t + 1)} = (1 - \alpha ) * C * r^{(t)} + \alpha * d $$
(7)

Where C is transition probability matrix. r(t) represents possibility assignment at the t-th time. d is restart vector, which is possibility assignments jumping into every vertex when jumps happening. α is an adjusting factor which controls reliance degrees among terms.

First, assuming that it starts random walk from v in graph G. The closer between v and vj, the more possibly that v walks to vj. Matrix A represents co-occurrence relations between any two words, which is consistent with tendency of walking. Therefore, matrix A is selected as transition probability matrix, i.e. C = A.

Next, determining the value of the initial vector r(0), it’s value are shown in formula (8). Assuming that h = index(v) can locate the index of vertex v in G, it can be seen from the formula that value of r(0) is transposition of the h-th row vector in matrix A actually.

$$ r^{(0)} (j) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {(v,v_{j} ) \notin E} \hfill \\ {c_{hj} } \hfill & {(v,v_{j} ) \in E} \hfill \\ \end{array} } \right. $$
(8)

Finally, the paper hypothesizes that starting point is equally randomly selected, so initial possibility assignment \( d = \left[ {\frac{1}{m},\frac{1}{m}, \ldots \ldots \frac{1}{m}} \right]_{m} \).

So far, all parameters formula (7) needs are determined. Put them in formula (6) and calculate iteratively, till r falls in a stable condition. Ultimately, vector r describes the comprehensively semantic relations between vertex v and other vertexes. Let each of the vertexes in graph as start point in turn and repeat aforementioned process so Matrix reflects semantic relations among all term pairs will be obtained, which is represented as P.

4 Find Hot Event Using Modularity

In this section, we’ll introduce the idea of modularity to reach the aim of word clustering use Matrix P, so hot events are found by filtering.

4.1 Modularity

With the further research on web, researchers find many large complex net made up of lots of communities, nodes are connected very firmly within each community, while connections among communities are relatively sparse. Modularity [12] is a common metric to evaluate quality of community partition in complex web, whose formula is as follows:

$$ Q = \sum\limits_{i = 1}^{k} {(e_{ii} - a_{i}^{2} } ) $$
(9)

eii represents the fraction of total edges that connect to vertices in community Ci, \( a_{i} = \sum\limits_{j} {e_{ij} } \) i ≠ j, which represent the fraction of edges that link vertices in community Ci to vertices in community Cj, k is the number of communities. Newman points out that it is difficult to recognize whether Q has reached the maximum or not. Therefore, modularity increment is introduced to determine if communities are partitioned properly in this paper, and decide when the partition terminated. Modularity increment is defined as:

$$ \Delta Q = 2(e_{ij} - a_{i} * a_{j} ) $$
(10)

4.2 Hot Event Detection Algorithm

In this paper, we adopts the idea of modularity, and take graph as partition object, and we also use the correlation matrix P in Sect. 3.2 as prior information. The results of graph partitioning are hot events. Prerequisites of the algorithm is that we see each node in graph as an independent clustering. First, the initial operating is find the maximum in matrix P, assume that max(P) = pij, two vertex, vi and vj, are merged into the same cluster. Continue searching the maximum in P and calculate modularity increment ΔQ. If ΔQ > 0, then merging, Otherwise, partition’s over. Pseudocode of the algorithm is as follow:

figure a

5 Experimental Results and Analysis

5.1 Experimental Data

Data Set 1: Messages posted in Sina Weibo from January to June in 2016 are sampled manually as experimental data. To ensure being in accordance with real events at best, enormous noise data are added during manual sampling which turns out a complete noise-contained data set including 2541 Weibo posts of 8 hot events in all. Of these micro-blog, 1749 are events-describing and 792 are noise data. Then, pre-processing methods including word segmentation and stop words removal are launched, and isolated word filtering is conducted according to affinities among terms. Finally, 12000 terms remain.

Data Set 2: In total, titles of 3755 essays in 6 categories of data mining are drawn from DBLP to run experiments. They are: Text clustering (614), Text classification (484), Video processing (516), Speech recognition (685), Image processing (960), Graphical model (496). After pre-processing work like removal of stop words and HTML tags, we get final experimental data sets.

5.2 Comparative Analysis on the Results

Three experiments are designed in this paper to verify effectiveness of hot event detection algorithm. Experiment 1 utilizes dataset 1 to extract hot topics. Experiment 2 adjusts important parameters in our algorithm to observe the influence on hot event results. Experiment 3 is to compare our method with the existing methods in the similar manner. In this paper, we adopt NMI and ARI [13, 14] as the evaluation criterion.

We use the same parameters for restart random walk model as work [10] in Experiment 1, that is α = 0.15. And constructing matrix representation for micro-blog based on data set 1. Hot events are obtained through algorithm proposed in this paper. Final experimental result selects illustrated in the following Table 1. Some key terms with stronger correlations are selected to describe hot events, and take hot events published by authorized institutes for comparison, which show good agreements with actual network hot event results.

Table 1. Comparison of real hot events and hot words detected by our method

Experiment 2: There are two parameters: λ and β, λ balances the dedications to hotness from number of reposts and comments, and β takes dedications to hot words extraction results with affinities among words into consideration. We research the influence on topic words extraction by setting different values for them. λ is set 0.5, 0.55 and 0.6, still β ranges from 0.01 to 0.08.

Experimental results are illustrated (a) and (b) in Fig. 3, it can be seen that the effect of repost value of micro-blog on results is slightly higher than comment value. And the worst performance is when λ = 0.6, so comparison diagram is omitted. It also can be observed that NMI and ARI are on the rise slowly before β = 0.03, and they reach the maximum When β is 0.03. But with value of β keeping rising and reaching the maximum values allowed by theory, effectiveness start decreasing. Especially, NMI and ARI fell quickly after 0.05.

Fig. 3.
figure 3

Effect of parameters on the result of hot event detection

Experiment 3: Select DPSO algorithm in literature [15] and MCF method proposed in work [16] with our method to comparing on two data sets. Through mining mutual information between words and Internal/External correlative information, DPSO finds micro-blog hot events in the best angle. MCF proposes using topic model for extracting micro-blog themes, and word activation force model is introduced to generate hot events. The experimental comparative result among method of this paper and other two methods is illustrated in (a) and (b) of Fig. 4.

Fig. 4.
figure 4

Effect of three different method on clustering results on different datasets (a on dataset 1, and b on dataset 2)

We can see from Fig. 4 that our method has a little higher NMI and ARI than the other two methods. Possible cause is that method in the paper mines surface and hidden semantic relations among terms as well as possible, which makes micro-blog semantic expression clear. While drawbacks of the other methods, such as noise information, have result in a lot of low-quality feature items and small numbers of thematic words. Result in Fig. 4 also shows the superiority of method in this paper. Meanwhile, since data set 2 is cleaner and it brings less distribution, obtained results are higher than those from dataset 1.

6 Conclusions

The paper proposes a hot event detection method based on restart random walk model and community partition. Main design idea is to calculate shown and hidden semantic relations among lexical items by conducting restart random walk algorithm iteratively on graph and construct a semantic correlation matrix. Meanwhile, the idea of community partition is introduced. An algorithm performing word clustering with the semantic correlation matrix is designed in order to obtaining the set of hot events. The experimental result points out that hot events found are consistent with real-time events, so the effectiveness of detection is outstanding. From now on, researches about reducing the outliers in feature word sets, initialization of random walk model metrics and judging standards of convergent conditions in community partition can be performed, even trying to introduce expert dictionaries or lexicons themselves, to raise accuracy of hot event detection.