Efficient temporal mining of micro-blog texts and its application to event discovery

Stilo, Giovanni; Velardi, Paola

doi:10.1007/s10618-015-0412-3

Efficient temporal mining of micro-blog texts and its application to event discovery

Published: 21 May 2015

Volume 30, pages 372–402, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

2298 Accesses
50 Citations
Explore all metrics

Abstract

In this paper we present a novel method for clustering words in micro-blogs, based on the similarity of the related temporal series. Our technique, named SAX*, uses the Symbolic Aggregate ApproXimation algorithm to discretize the temporal series of terms into a small set of levels, leading to a string for each. We then define a subset of “interesting” strings, i.e. those representing patterns of collective attention. Sliding temporal windows are used to detect co-occurring clusters of tokens with the same or similar string. To assess the performance of the method we first tune the model parameters on a 2-month 1 % Twitter stream, during which a number of world-wide events of differing type and duration (sports, politics, disasters, health, and celebrities) occurred. Then, we evaluate the quality of all discovered events in a 1-year stream, “googling” with the most frequent cluster n-grams and manually assessing how many clusters correspond to published news in the same temporal slot. Finally, we perform a complexity evaluation and we compare SAX* with three alternative methods for event discovery. Our evaluation shows that SAX* is at least one order of magnitude less complex than other temporal and non-temporal approaches to micro-blog clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

http://trec.nist.gov/data/tweets/ sampled from Jan 23rd to Feb 8th, 2011.
http://wordnet.princeton.edu.
In any case a limit is fixed a priori for the number of topics.
https://dev.twitter.com/docs/streaming-apis.
Words are stemmed to reduce sparseness, even though, as discussed in the paper, this might not be strictly necessary with more dense Twitter streams. In what follows we will refer to clustered items interchangeably as words, stems, or tokens.
http://code.google.com/p/jmotif/wiki/ZNormalization.
http://en.wikipedia.org/wiki/2012.
https://www.google.it/trends/.
http://libalf.informatik.rwth-aachen.de/index.php?page=home.
See Fig. 8 of the mentioned paper, in which 6 shapes of attention of Twitter hashtags are shown.
For example, in many algorithms the number of cluster K is a parameter.
We use the euclidean distance, but other measures, e.g. the edit distance, produce very similar results.
For example, there are many available implementations of LDA.
Some of the events shown in the related papers are world-wide, but several are local events, e.g. “Super Junior’s Yesung (@shfly3424) created his Twitter account”.
https://github.com/lintool/twitter-tools/wiki/TREC-2013-Track-Guidelines.
In what follows we omit the “big-o” notation for simplicity: complexity formulas are all to be interpreted as “order of”.
[\(1{\ldots }B\)] in the original paper (Xie et al. 2013).
Table I of (Xie et al. 2013).
http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations.
In agreement with http://firstmonday.org/ojs/index.php/fm/article/view/4366/3654.
This is also confirmed by the fact that we noticed an increment of daily tweets from an average of 3.3M per day during May to 4.6M during August.
www.foreignpolicy.com/articles/2014/09/26/why_big_data_missed_the_early_warning_signs_of_ebola.
Here we show only one tweet for the sake of space, whereas in our testing dataset we retrieve 10–20 tweets.
These events can easily be filtered out by a classifier, however teen events could be of interest.
http://jgibblda.sourceforge.net/.
http://code.google.com/p/btm/.
Requests must be addressed to the authors.

References

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3:993–1022
MATH Google Scholar
Chae J, Thom D, Bosch H, Jang Y, Maciejewski R, Ebert D, Ertl T (2013) Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition. IEEE symposium on visual analytics science and technology, Seattle
Cha M, Haddadi H, Benvenuto F, Gummadi K (2010) Measuring user influence in twitter: the million followers fallacy. In: Proceedings of conference on artificial intelligence AAAI
Cheng T, Wicks T (2014) Event detection using Twitter: a spatio-temporal approach. PLoS One 9(6):e97807. doi:10.1371/journal.pone.0097807
Article Google Scholar
Dao Q, Jiang J, Zhu F, Lim WP (2012) Finding bursty topics from microblogs. In: Proceedings of conference association of computational linguistics ACL 2012
Dou W, Wang X, Ribarsky W, Zhou M (2012) Event detection in social media data. In: IEEE VisWeek workshop on interactive visual text analytics. Seattle, WA
Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84. doi:10.1109/MIS.2012.76
Article Google Scholar
Hong L, Davison B (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp. 80–88. ACM
Hong L, Dom B, Gurumurthy S, Tsioutsioulikis K (2011) Time-dependent topic model for multiple text streams. In: ACM conference on knowledge discovery and data mining KDD 2011, San Diego
Huang B, Yang Y, Mahmood A, Wang H (2012) Microblog topic detection based on LDA model and single-pass clustering RSCTC 2012, LNAI 7413, pp. 166–171
Ifrim G, Shi B, Brigadir I (2014) Event detection in Twitter using aggressive filtering and hierarchical tweet clustering proceedings of SNOW-WWW workshop, Korea
Jain A (2010) Data clustering: 50 years beyond K-means. Patt Recogn Lett 31:651–666
Article Google Scholar
Keogh E, Chakrabarti K, Pazzani M (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings Of ACM special interest group on management of data SIGMOD, pp. 151–162
Kovacs F, Legany C, Babos A (2005) Cluster validity measurement techniques. In: Proceedings of 6th international symposium of Hungarian researchers on computational intelligence, Budapest
Lee R, Sumiya K (2010) Measuring geographical regularities of crowd behaviors for twitter-based geo-social event detection. Proceedings of the 2nd ACM international workshop on location based social networks SIGSPATIAL, LBSN ’10. ACM, New York, pp. 1–10
Lehmann J, Goncalves B, Ramasco JJ, Cattuto C (2012) Dynamical classes of collective attention in Twitter. Proceedings of World Wide Web Conference WWW2012
Lin J, Keogh E, Li W, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Mining Knowl Discov 15(2):107–144
Article Google Scholar
Lin J, Khade R, Li Y (2012) Rotation-invariant similarity in time series using bag-of-patterns representation. J Intell Inf Syst 39:287–315
Article Google Scholar
Li C, Sun A, Datta A (2012) Twevent: segment-based event detection from tweets. In: Proceedings of ACM international conference on information and knowledge management CIKM
Maynard D, Funk A (2012) Challenges in developing opinion mining tools for social media. In: Proceedings Of @NLP cann u tag #usergenartedcontent? Workshop at LREC 2012, Istanbul
McMinn A, Moshfeghi Y, Jose JM (2013) Building a large scale corpus for evaluating event detection in twitter, ACM international conference on information and knowledge management CIKM’13, San Francisco
Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text—an exploration of temporal text mining. In: Proceedings of conference of knowledge discovery and data mining KDD’05, Chigago
Oncina J, Garcıa P (1992) Inferring regular languages in polynomial updated time. In: 4th Spanish symposium on pattern recognition and image analysis, MPAI. vol. 1. World Scientific, pp. 49–61
Petrovic S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In: Proceedings of national American conference of the association of computational linguistics NAACL
Petrovic S, Osborne M, Mc Creadie R (2013) Can Twitter replace Newswire for breaking news?. In: Proceedings of the 7th international AAAI conference on weblogs and social media, ICWSM
Pohl D, Bouchachia A, Hellwagner H (2012) Automatic sub-event detection in Emergency management using social media (2012), WWW2012-SWDM’12 Workshop, Lyon
Popescu AM, Pennacchiotti M, Paranjpe D (2011) Extracting events and event descriptions from twitter. In: Worls Wide Web Conference WWW2011, pp. 105–106, 2011
Rui L, Kin L, Ravi K, Kevin C (2012) TEDAS: a Twitter-based event detection and analysis system. In: IEEE 28th international conference on data engineering (ICDE), pp. 1273–1276
Wang X, Zhu F, Jing J, Li S (2013) Real time event detection in Twitter, conference on web age information management WAIM, Spinger
Weng J, Lim E, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web Search and data mining WSDM, ACM, pp. 261–270
Weng J, Yao Y, Leonardi E, Lee B (2011) Event detection in Twitter. In: International AAAI conference on weblogs and social media ICWSM
Xie W, Zhu F, Jang J, Lim E, Wang K (2013) TopicSketch: real-time bursty topic detection from Twitter, IEEE 13th international conference on data mining (ICDM)
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on web search and data mining (WSDM), pp. 177–186
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: World Wide Web conference WWW 2013, Rio de Janeiro

Download references

Author information

Authors and Affiliations

Department of Computer Science, Sapienza University of Roma, Via Salaria 113, Rome, Italy
Giovanni Stilo & Paola Velardi

Authors

Giovanni Stilo
View author publications
You can also search for this author in PubMed Google Scholar
Paola Velardi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paola Velardi.

Additional information

Responsible editor: Eamonn Keogh.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 196 KB)

Appendix

See Table 6.

Table 6 “Summer 2014” experiments

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stilo, G., Velardi, P. Efficient temporal mining of micro-blog texts and its application to event discovery. Data Min Knowl Disc 30, 372–402 (2016). https://doi.org/10.1007/s10618-015-0412-3

Download citation

Received: 26 May 2014
Accepted: 04 March 2015
Published: 21 May 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10618-015-0412-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient temporal mining of micro-blog texts and its application to event discovery

Abstract

Access this article

Similar content being viewed by others

Text Mining for News and Blogs Analysis

Text Mining for News and Blogs Analysis

Graph-Based Methods for Clustering Topics of Interest in Twitter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 196 KB)

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient temporal mining of micro-blog texts and its application to event discovery

Abstract

Access this article

Similar content being viewed by others

Text Mining for News and Blogs Analysis

Text Mining for News and Blogs Analysis

Graph-Based Methods for Clustering Topics of Interest in Twitter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 196 KB)

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation