Microblog Hot Event Detection Based on Restart Random Walk and Modularity

Li, XiaoHong; Gong, JiHeng; Ma, Yuyin; Ma, HuiFang; Qin, Na

doi:10.1007/978-3-030-00828-4_27

XiaoHong Li¹⁸,
JiHeng Gong¹⁸,
Yuyin Ma¹⁸,
HuiFang Ma¹⁸ &
…
Na Qin¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 538))

Included in the following conference series:

International Conference on Intelligent Information Processing

1116 Accesses

Abstract

Using traditional method to extract semantic relations between words hardly applied to micro-blog, which make finding hot event not sensitive. We propose a new method based on restart random walk and Modularity. The semantic relation between items is calculated by conducting the restart random walk iteratively on graph, and then the semantic correlation matrix is constructed. Next, the idea of Modularity is introduced to design algorithm for word clustering, which make a series of micro-blog hot events obtain. The experimental results show that our method has a higher accuracy compared with the kindred method, and hot events could be detected effectively.

You have full access to this open access chapter, Download conference paper PDF

An Efficient Microblog Hot Topic Detection Algorithm Based on Two Stage Clustering

Hot Topic Detection in News Blog Based on W2T Methodology

Hot Topic Detection Based on VSM and Improved LDA Hybrid Model

Keywords

1 Introduction

As an up-to-date information media, Micro-blog, where folks speak out their opinions on social events, has been an important place in which hot issues are born and discussed for its shortness, rich content, relatively low barrier to entry, and fast propagation velocity. Especially, users’ following, reposting and commenting usually help micro-blog events propagate. It’s been hard for users to find information they are interested in because of much valuable data being flooded caused by information explosion. Therefore, studying how to get valuable information out of a vast number of micro-blog data becomes a hot spot in computer science area. Meanwhile, detection of micro-blog hot event, known as an important branch of web public sentiment monitor, concerned both domestic and foreign academia, showing its huge research value.

By now, researchers have done numerous research about micro-blog hot event detection which could be divided into two following categories [1]: (1) Methods focused on texts, specifically, micro-blogs are clustered into several clusters in order to identify hot events. For example, Shi [2] propose a hot event evolution model to discover the user interest distribution, but also a hot event filtering algorithm is developed to detect important events. Yang [3], who’s been devoted to clustering hot topics based on timing characteristic, presented K_SC algorithm based on hotness tendency of topics. However, data sparsity problems caused by shortness of micro-blog and lots of noise data in it, make a relatively low efficiency on identifying burst words after clustering. (2) Approach focused on burst features, that is, burst features are extracted and divided into different groups, and unexpected events are identified by feature groups. Yang [4] detected hot events through changes of amount of emotionally key words. Chen [5] detected burst features by analysis method based on analysis of timing windows, then utilize Affinity Propagation algorithm to cluster burst features. Similarly, Zhao [6] propose a novel real-time event detection method by generating an intermediate semantic level from social multimedia data, named MC, which is able to explore the high correlations among different microblogs. Aforementioned methods only theoretically improve effectiveness of event detection when detecting burst events, not achieve satisfying result in real-life applications. The most fundamental cause is that topics drift will appear with time changing during event detection.

In order to improve accuracy of micro-blog hot event detection and reduce complexity, we proposes a microblog hot event detection algorithm based on restart random walk model and modularity, which divide micro-blog hot event detection into two phases, Phase 1: to know hidden semantic relations among terms through restart random walk algorithm. Phase 2: to cluster terms with the idea of modularity based on the former result, and find hot events. The main contributions of this work are as follows:

1.
To know hidden semantic relations among terms, we construct an undirected weighted graph and run restart random walk algorithm on it.
2.
We apply the idea of modularity as clustering for detecting hot event, and achieve the goal of corresponding between hot words and hot events.
3.
We use three experiments on two datasets to verify effectiveness of hot event detection algorithm, which demonstrate promising results compared to the kindred methods.

The remaining of this paper is as follows. Section 2 discusses the preliminary knowledge of our proposed method. Construct graph and acquire association relationship between words in Sect. 3. Find Hot Event based on Restart Random Walk and Modularity in Sect. 4. Experimental results are discussed in Sect. 5, and conclusions are drawn in Sect. 6.

2 Preliminary Knowledge

2.1 Hot Degree

Micro-blog’s sensitivity towards hot events makes it able to reflect hot events. Popular micro-blog, whose number of comments and reposts gradually increases, spreads very soon which is why we need a metric to measure how much the micro-blogs concerned us [7].

Assuming that user u_i had posted a micro-blog mb, then it was posted by user u_j in time ∆t, then the repost value ret(mb, u_j) of the latter user on this micro-blog is defined as ret(mb, u_j):

$$ ret(mb,u_{j} ) = \left\{ {\begin{array}{*{20}c} 1 & {otherwise{\kern 1pt} } \\ {1 - sim(u_{i} ,u_{j} )} & {if\;u_{i} \;is\;similar\;to\;u_{j} } \\ \end{array} } \right. $$

(1)

As the same, comment value com(mb, u_j) of u_j towards this micro-blog is defined as follows:

$$ com(mb,u_{j} ) = \left\{ {\begin{array}{*{20}c} 1 & {otherwise} \\ {1 - sim(u_{i} ,u_{j} )} & {if\;u_{i} \;is\;similar\;to\;u_{j} } \\ \end{array} } \right. $$

(2)

sim(u_i, u_j) represents similarity between users, we calculate it by user similarity method as [8]. $ sim(u_{i} ,u_{j} ) = \frac{{F(u_{i} ) \cap F(u_{j} )}}{{F(u_{i} ) \cup F(u_{j} )}} $. Where F(u_i) denote collection composed of user that u_i is attentioned.

Then we got the definition of hot degree based on formula (1) and (2).

Definition:

Hot degree of a micro-blog Hot(mb) equals the weighted sum of its repost value ret(mb_i, u_j) and comment value com(mb_i, u_j). After normalization it is:

$$ Hot(mb) = \frac{{\lambda \sum\limits_{j = 1}^{l} {ret(mb,u_{j} ) + (1 - \lambda )\sum\limits_{j = 1}^{h} {com(mb,u_{j} )} } }}{1 + h} $$

(3)

Where, λ is the adjustment parameter, 0 < λ < 1, l is number of reposts, h is number of comments. By definition of the hot degree, the hot event should be directly related to hot degree and not the content itself of micro-blog.

2.2 Co-occurrence Degree Between Words

Given a micro-blog mb, co-occurrence degree of term t_i and t_j is denoted as c(t_i, t_j), which is as follows [9]:

$$ c(t_{i} ,t_{j} ) = e^{{ - dist(t_{i} ,t_{j} )}} \quad if\,t_{i} \, \in \,mb\,\, and\,\, t_{j} \, \in \,mb $$

(4)

dist(t_i, t_j) is co-occurrence distance between t_i and t_j, whose value is number of words between t_i and t_i in micro-blog mb. Co-occurrence degree c(t_i, t_j) reflects that two words are correlated if they often appear in the same micro-blog.

3 Acquire Association Relationship Between Words

$ MB\, = \,\{ mb_{ 1} ,mb_{ 2} , \ldots ,mb_{N} \} $ is micro-blogs set, and $ mb_{i} = \left\{ {t_{i1} ,t_{i2} , \ldots ,t_{{i\left| {mb_{i} } \right|}} } \right\} $ is the i-th micro-blog, and candidate item set is $ MT\, = \,\{ t_{ 1} ,t_{ 2} , \ldots ,t_{m} \} $, where m represents the size of the dictionary.

3.1 Construct Graph Model

We construct an undirected weighted graph G = (V, E), where $ V\, = \,\left\{ {v_{ 1} ,v_{ 2} , \ldots ,v_{\text{M}} } \right\} $ is vertex set, M is the number of the rest of the vertices, v_i corresponds to candidate item in the MT. Then we connect any two vertexes in the set V if they’re from the same micro-blog, so edges set $ E\, = \,\left\{ {\left( {v_{i} ,v_{j} } \right)|v_{i} \in mb\;{\text{and}}\;v_{j} \in mb} \right\} $. Notes: in the rest of the paper v_i represents the vertex word t_i corresponds.

First, to get weight matrix as Fig. 1 shows. In A′, element $ w_{ij} \, = \,w\left( {v_{i,} ,v_{j} } \right),w\left( {v_{i,} ,v_{j} } \right) $ represents weight on the edge $ \left( {v_{i,} ,v_{j} } \right) $, which is defined as the sum of cooccurrence degrees of terms v_i,and v_j in micro-blog set. As Eq. (5) shows:

$$ w(v_{i} ,v_{j} ) = \left\{ {\begin{array}{*{20}l} {\sum\limits_{mb \in MB} {c(v_{i} ,v_{j} )} } \hfill & {(v_{i} ,v_{j} ) \in E} \hfill \\ 0 \hfill & {\;\;\;\;otherwise} \hfill \\ \end{array} } \right. $$

(5)

Afterwards, run normalization and asymmetric operation on matrix A′ to obtain matrix A [10]. Value of element c_ij is calculated through the formula (6).

$$ c_{ij} = \frac{{w_{ij} }}{{n_{j} + 0.01}} $$

(6)

Where 0 ≤ c_ij ≤ 1 and $ \sum\limits_{j} {c_{ij} = 1,} \,n_{j} = \sum\limits_{i} {w_{ij} } $ represents the sum of elements in j-th column in matrix A′. The rest of this paper is developed based on graph G.

3.2 Restart Random Walk on Graph

Random walk model [11] means to traverse a graph beginning with one vertex or a series of vertexes. At any vertical, traverser randomly selects an edge connecting the vertex at a certain possibility, then randomly jumps to the next vertex along the edge or jumps back to the starting point at a certain possibility. Mathematic expression of it is:

$$ r^{(t + 1)} = (1 - \alpha ) * C * r^{(t)} + \alpha * d $$

(7)

Where C is transition probability matrix. r^(t) represents possibility assignment at the t-th time. d is restart vector, which is possibility assignments jumping into every vertex when jumps happening. α is an adjusting factor which controls reliance degrees among terms.

First, assuming that it starts random walk from v in graph G. The closer between v and v_j, the more possibly that v walks to v_j. Matrix A represents co-occurrence relations between any two words, which is consistent with tendency of walking. Therefore, matrix A is selected as transition probability matrix, i.e. C = A.

Next, determining the value of the initial vector r⁽⁰⁾, it’s value are shown in formula (8). Assuming that h = index(v) can locate the index of vertex v in G, it can be seen from the formula that value of r⁽⁰⁾ is transposition of the h-th row vector in matrix A actually.

$$ r^{(0)} (j) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {(v,v_{j} ) \notin E} \hfill \\ {c_{hj} } \hfill & {(v,v_{j} ) \in E} \hfill \\ \end{array} } \right. $$

(8)

Finally, the paper hypothesizes that starting point is equally randomly selected, so initial possibility assignment $ d = \left[ {\frac{1}{m},\frac{1}{m}, \ldots \ldots \frac{1}{m}} \right]_{m} $.

So far, all parameters formula (7) needs are determined. Put them in formula (6) and calculate iteratively, till r falls in a stable condition. Ultimately, vector r describes the comprehensively semantic relations between vertex v and other vertexes. Let each of the vertexes in graph as start point in turn and repeat aforementioned process so Matrix reflects semantic relations among all term pairs will be obtained, which is represented as P.

4 Find Hot Event Using Modularity

In this section, we’ll introduce the idea of modularity to reach the aim of word clustering use Matrix P, so hot events are found by filtering.

4.1 Modularity

With the further research on web, researchers find many large complex net made up of lots of communities, nodes are connected very firmly within each community, while connections among communities are relatively sparse. Modularity [12] is a common metric to evaluate quality of community partition in complex web, whose formula is as follows:

$$ Q = \sum\limits_{i = 1}^{k} {(e_{ii} - a_{i}^{2} } ) $$

(9)

e_ii represents the fraction of total edges that connect to vertices in community C_i, $ a_{i} = \sum\limits_{j} {e_{ij} } $ i ≠ j, which represent the fraction of edges that link vertices in community C_i to vertices in community C_j, k is the number of communities. Newman points out that it is difficult to recognize whether Q has reached the maximum or not. Therefore, modularity increment is introduced to determine if communities are partitioned properly in this paper, and decide when the partition terminated. Modularity increment is defined as:

$$ \Delta Q = 2(e_{ij} - a_{i} * a_{j} ) $$

(10)

4.2 Hot Event Detection Algorithm

In this paper, we adopts the idea of modularity, and take graph as partition object, and we also use the correlation matrix P in Sect. 3.2 as prior information. The results of graph partitioning are hot events. Prerequisites of the algorithm is that we see each node in graph as an independent clustering. First, the initial operating is find the maximum in matrix P, assume that max(P) = p_ij, two vertex, vi and vj, are merged into the same cluster. Continue searching the maximum in P and calculate modularity increment ΔQ. If ΔQ > 0, then merging, Otherwise, partition’s over. Pseudocode of the algorithm is as follow:

5 Experimental Results and Analysis

5.1 Experimental Data

Data Set 1: Messages posted in Sina Weibo from January to June in 2016 are sampled manually as experimental data. To ensure being in accordance with real events at best, enormous noise data are added during manual sampling which turns out a complete noise-contained data set including 2541 Weibo posts of 8 hot events in all. Of these micro-blog, 1749 are events-describing and 792 are noise data. Then, pre-processing methods including word segmentation and stop words removal are launched, and isolated word filtering is conducted according to affinities among terms. Finally, 12000 terms remain.

Data Set 2: In total, titles of 3755 essays in 6 categories of data mining are drawn from DBLP to run experiments. They are: Text clustering (614), Text classification (484), Video processing (516), Speech recognition (685), Image processing (960), Graphical model (496). After pre-processing work like removal of stop words and HTML tags, we get final experimental data sets.

5.2 Comparative Analysis on the Results

Three experiments are designed in this paper to verify effectiveness of hot event detection algorithm. Experiment 1 utilizes dataset 1 to extract hot topics. Experiment 2 adjusts important parameters in our algorithm to observe the influence on hot event results. Experiment 3 is to compare our method with the existing methods in the similar manner. In this paper, we adopt NMI and ARI [13, 14] as the evaluation criterion.

We use the same parameters for restart random walk model as work [10] in Experiment 1, that is α = 0.15. And constructing matrix representation for micro-blog based on data set 1. Hot events are obtained through algorithm proposed in this paper. Final experimental result selects illustrated in the following Table 1. Some key terms with stronger correlations are selected to describe hot events, and take hot events published by authorized institutes for comparison, which show good agreements with actual network hot event results.

Table 1. Comparison of real hot events and hot words detected by our method

Full size table

Experiment 2: There are two parameters: λ and β, λ balances the dedications to hotness from number of reposts and comments, and β takes dedications to hot words extraction results with affinities among words into consideration. We research the influence on topic words extraction by setting different values for them. λ is set 0.5, 0.55 and 0.6, still β ranges from 0.01 to 0.08.

Experimental results are illustrated (a) and (b) in Fig. 3, it can be seen that the effect of repost value of micro-blog on results is slightly higher than comment value. And the worst performance is when λ = 0.6, so comparison diagram is omitted. It also can be observed that NMI and ARI are on the rise slowly before β = 0.03, and they reach the maximum When β is 0.03. But with value of β keeping rising and reaching the maximum values allowed by theory, effectiveness start decreasing. Especially, NMI and ARI fell quickly after 0.05.

Experiment 3: Select DPSO algorithm in literature [15] and MCF method proposed in work [16] with our method to comparing on two data sets. Through mining mutual information between words and Internal/External correlative information, DPSO finds micro-blog hot events in the best angle. MCF proposes using topic model for extracting micro-blog themes, and word activation force model is introduced to generate hot events. The experimental comparative result among method of this paper and other two methods is illustrated in (a) and (b) of Fig. 4.

We can see from Fig. 4 that our method has a little higher NMI and ARI than the other two methods. Possible cause is that method in the paper mines surface and hidden semantic relations among terms as well as possible, which makes micro-blog semantic expression clear. While drawbacks of the other methods, such as noise information, have result in a lot of low-quality feature items and small numbers of thematic words. Result in Fig. 4 also shows the superiority of method in this paper. Meanwhile, since data set 2 is cleaner and it brings less distribution, obtained results are higher than those from dataset 1.

6 Conclusions

The paper proposes a hot event detection method based on restart random walk model and community partition. Main design idea is to calculate shown and hidden semantic relations among lexical items by conducting restart random walk algorithm iteratively on graph and construct a semantic correlation matrix. Meanwhile, the idea of community partition is introduced. An algorithm performing word clustering with the semantic correlation matrix is designed in order to obtaining the set of hot events. The experimental result points out that hot events found are consistent with real-time events, so the effectiveness of detection is outstanding. From now on, researches about reducing the outliers in feature word sets, initialization of random walk model metrics and judging standards of convergent conditions in community partition can be performed, even trying to introduce expert dictionaries or lexicons themselves, to raise accuracy of hot event detection.

References

Diao, Q., Jiang, J., Zhu, F.: Finding bursty topics from microblogs. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 536–544. Association for Computational Linguistics (2012)
Google Scholar
Shi, L.L., Liu, L., Wu, Y., et al.: Event detection and user interest discovering in social media data streams. IEEE Access 5(99), 20953–20964 (2017)
Article Google Scholar
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search And Data Mining, pp. 177–186. ACM (2011)
Google Scholar
Yang, L., Lin, Y., Lin, H.F.: Microblog hot events detection based on emotion distribution. J. Chin. Inf. Process. 26(1), 84–91 (2012)
Google Scholar
Chen, H., Chen, W.: Analyzing bursty feature for event detection. Appl. Res. Comput. 1, 30–33 (2011)
Google Scholar
Zhao, S., Gao, Y., Ding, G., et al.: Real-time multimedia social event detection in microblog. IEEE Trans. Cybern. PP(99), 1–14 (2017)
Article Google Scholar
Liu, Y.Z., Du, Y.N., Jiang, Y.C.: Trend prediction for microblog based on classification modeling of heat curves. Pattern Recognit. Artif. Intell. 28(1), 27–34 (2015)
Article Google Scholar
ZhiYun, Z., ChunYuan, J., ZhenFei, W.: Computing research of user similarity based on microblog. Comput. Sci. 44(2), 262–266 (2017)
Google Scholar
Hua, W., Wang, Z., Wang, H., et al.: Short text understanding through lexical-semantic analysis. In: 2015 IEEE 31st International Conference on Data Engineering, pp. 495–506. IEEE (2015)
Google Scholar
Pan, J.Y., Yang, H.J., Fallouts, C.: Automatic multimedia cross-modal correlation discovery. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 653–658. ACM (2004)
Google Scholar
Fu, B., Wang, Z., Xu, G., et al.: Multi-label learning based on Iterative label propagation over graph. Pattern Recognit. Lett. 42, 85–90 (2014)
Article Google Scholar
Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)
Article Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering method. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Article Google Scholar
Fahad, A., Alshatri, N., Tari, Z., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
Article Google Scholar
Ma, H.F., Ji, Y.G., Li, X.H., Zhou, R.N.: A microblog hot topic detection algorithm based on discrete particle swarm optimization. In: Proceedings of the 14th Pacific Rim International Conference on Artificial Intelligence, Phuket, Thailand, 22–26 August 2016
Google Scholar
Dai, T., Wu, Y., Lei, D.J.: Hot topic summarization on microblog generated by model combination. Appl. Res. Comput. 33(7), 2026–2029 (2016)
Google Scholar

Download references

Acknowledgments

The work is supported in part by the Natural Science Foundation for Young Scientists of Gansu Province, (No. 1606RJYA269), and Youth Teacher Scientific Capability Promoting Project of NWNU(No. NWNU-LKQN-16-20), and the National Natural Science Foundation of China (No. 61762078, No. 61862058).

Author information

Authors and Affiliations

College of Computer Science and Engineering, Northwest Normal University, Lanzhou, 730070, China
XiaoHong Li, JiHeng Gong, Yuyin Ma, HuiFang Ma & Na Qin

Authors

XiaoHong Li
View author publications
You can also search for this author in PubMed Google Scholar
JiHeng Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yuyin Ma
View author publications
You can also search for this author in PubMed Google Scholar
HuiFang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Na Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to XiaoHong Li .

Editor information

Editors and Affiliations

Institute of Computing Technology, CAS, Beijing, China
Zhongzhi Shi
University of Reims Champagne-Ardenne, Saint Drezery, France
Eunika Mercier-Laurent
University of South Australia, Mawson Lakes, SA, Australia
Jiuyong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Gong, J., Ma, Y., Ma, H., Qin, N. (2018). Microblog Hot Event Detection Based on Restart Random Walk and Modularity. In: Shi, Z., Mercier-Laurent, E., Li, J. (eds) Intelligent Information Processing IX. IIP 2018. IFIP Advances in Information and Communication Technology, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-030-00828-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-00828-4_27
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00827-7
Online ISBN: 978-3-030-00828-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)