Abstract
In this chapter we study joint sequence complexity and we introduce its applications for topic detection and text classification, in particular source discrimination. The mathematical concept of the complexity of a sequence is defined as the number of distinct factors of it. The Joint Complexity is thus the number of distinct common factors of two sequences. Sequences containing many common parts have a higher Joint Complexity. The extraction of the factors of a sequence is done by suffix trees, which is a simple and fast (low complexity) method to store and retrieve them from the memory. Joint Complexity is used for evaluating the similarity between sequences generated by different sources and we will predict its performance over Markov sources. Markov models describe well the generation of natural text, and their performance can be predicted via linear algebra, combinatorics and asymptotic analysis. This analysis follows in this chapter. We exploit datasets from different natural languages, for both short and long sequences, with promising results on complexity and accuracy. We performed automated online sequence analysis on information streams in Twitter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Goker A, Kompatsiaris Y, Jaimes A (2013) Sensing trending topics in twitter. IEEE Trans Multimedia 15(6):1268–1282
Becher V, Heiber PA (2011) A better complexity of finite sequences, abstracts of the 8th. In: International conference on computability, complexity, and randomness, p 7
Burnside G, Milioris D, Jacquet P (2014) One day in twitter: topic detection via joint complexity. In: Proceedings of SNOW 2014 data challenge (WWW’14), Seoul
Fayolle J, Ward MD (2005) Analysis of the average depth in a suffix tree under a Markov model. In: International conference on the analysis of algorithms (AofA), Barcelona, pp 95–104
Flajolet P, Sedgewick R (2008) Analytic combinatorics. Cambridge University Press, Cambridge
Flajolet P, Gourdon X, Dumas P (1995) Mellin transforms and asymptotics: harmonic sums. Theor Comput Sci 144(1–2):3–58
Horn RA, Johnson CR (1985) Matrix analysis. Cambridge University Press, Cambridge
Ilie L, Yu S, Zhang K (2002) Repetition complexity of words. In: Proceedings of COCOON, pp 320–329
Jacquet P (2007) Common words between two random strings. In: IEEE international symposium on information theory, pp 1495–1499
Jacquet P, Szpankowski W (1994) Autocorrelation on words and its applications. Analysis of suffix trees by string-ruler approach. J Combin Theory Ser A 66:237–269
Jacquet P, Szpankowski W (1998) Analytical depoissonization and its applications. Theor Comput Sci 201:1–62
Jacquet P, Szpankowski W (2012) Joint string complexity for Markov sources. In: 23rd international meeting on probabilistic, combinatorial and asymptotic methods for the analysis of algorithms, vol 12, pp 303–322
Jacquet P, Szpankowski W (2015) Analytic pattern matching: from DNA to twitter. Cambridge University Press, Cambridge
Jacquet P, Szpankowski W, Tang J (2001) Average profile of the Lempel-Ziv parsing scheme for a Markovian source. Algorithmica 31(3):318–360
Jacquet P, Milioris D, Szpankowski W (2013) Classification of Markov sources through joint string complexity: theory and experiments. In: IEEE international symposium on information theory (ISIT), Istanbul
Janson S, Lonardi S, Szpankowski W (2004) On average sequence complexity. Theor Comput Sci 326:213–227
Li M, Vitanyi P (1993) Introduction to Kolmogorov Complexity and its Applications. Springer, Berlin
Milioris D, Jacquet P (2013) Method and device for classifying a message. European Patent No. 13306222.4
Milioris D, Jacquet P (2014) Joint sequence complexity analysis: application to social networks information flow. Bell Labs Tech J 18(4):75–88
Neininger R, Rüschendorf L (2004) A general limit theorem for recursive algorithms and combinatorial structures. Ann Appl Probab 14(1):378–418
Niederreiter H (1999) Some computable complexity measures for binary sequences. In: Ding C, Hellseth T, Niederreiter H (eds) Sequences and their applications. Springer, Berlin, pp 67–78
Nilsson S, Tikkanen M (1998) Implementing a dynamic compressed Trie. In: Proceedings of 2nd workshop on algorithm engineering
Papadopoulos S, Corney D, Aiello L (2014) Snow 2014 data challenge: assessing the performance of news topic detection methods in social media. In: Proceedings of the SNOW 2014 data challenge
Szpankowski W (2001) Analysis of algorithms on sequences. Wiley, New York
Tata S, Hankins R, Patel J (2004) Practical suffix tree construction. In: 30th VLDB conference, vol 30
Ziv J (1988) On classification with empirically observed statistics and universal data compression. IEEE Trans Inform Theory 34:278–286
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Milioris, D. (2018). Joint Sequence Complexity: Introduction and Theory. In: Topic Detection and Classification in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-66414-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-66414-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66413-2
Online ISBN: 978-3-319-66414-9
eBook Packages: EngineeringEngineering (R0)