Abstract
Subword complexity is a function that describes how many different substrings of a given length are contained in a given string. In this paper, two estimators of block entropy are proposed, based on the profile of subword complexity. The first estimator works well only for IID processes with uniform probabilities. The second estimator provides a lower bound of block entropy for any strictly stationary process with the distributions of blocks skewed towards less probable values. Using this estimator, some estimates of block entropy for natural language are obtained, confirming earlier hypotheses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Algoet PH, Cover TM (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem. Ann. Probab. 16:899–909
Baayen, H (2001) Word frequency distributions. Kluwer Academic Publishers, Dordrecht
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: the entropy convergence hierarchy. Chaos 15:25–54
Dillon WR, Goldstein M (1984) Multivariate analysis: methods and appplications. Wiley, New York
Dębowski Ł (2011) On the vocabulary of grammar-based codes and the logical consistency of texts. IEEE Trans. Inform. Theor. 57:4589–4599
Dębowski Ł (2013) A preadapted universal switch distribution for testing Hilberg’s conjecture (2013). http://arxiv.org/abs/1310.8511
Dębowski Ł (2014) Maximal repetitions in written texts: finite energy hypothesis vs. strong Hilberg conjecture (2014). http://www.ipipan.waw.pl/~ldebowsk/
Dębowski Ł (2014) A new universal code helps to distinguish natural language from random texts (2014). http://www.ipipan.waw.pl/~ldebowsk/
Ebeling W, Pöschel T (1994) Entropy and long-range correlations in literary English. Europhys. Lett. 26:241–246
Ebeling W, Nicolis G (1991) Entropy of symbolic sequences: the role of correlations. Europhys. Lett. 14:191–196
Ebeling W, Nicolis G (1992) Word frequency and entropy of symbolic sequences: a dynamical perspective. Chaos Sol. Fract. 2:635–650
Ferenczi S (1999) Complexity of sequences and dynamical systems. Discr. Math. 206:145–154
Gheorghiciuc I, Ward MD (2007) On correlation polynomials and subword complexity. Discr. Math. Theo. Comp. Sci. AH, 1–18
Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics, a foundation for computer science. Addison-Wiley, New York
Hall P, Morton SC (1993) On the estimation of entropy. Ann. Inst. Statist. Math. 45:69–88
Hilberg W (1990) Der bekannte Grenzwert der redundanzfreien Information in Texten—eine Fehlinterpretation der Shannonschen Experimente? Frequenz 44:243–248
Ivanko EE (2008) Exact approximation of average subword complexity of finite random words over finite alphabet. Trud. Inst. Mat. Meh. UrO RAN 14(4):185–189
Janson S, Lonardi S, Szpankowski W (2004) On average sequence complexity. Theor. Comput. Sci. 326:213–227
Joe H (1989) Estimation of entropy and other functionals of a multivariate density. Ann. Inst. Statist. Math. 41:683–697
Khmaladze E (1988) The statistical analysis of large number of rare events, Technical Report MS-R8804. Centrum voor Wiskunde en Informatica, Amsterdam
Kontoyiannis I, Algoet PH, Suhov YM, Wyner AJ (1998) Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inform. Theor. 44:1319–1327
Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27:1061–1067
Krzanowski W (2000) Principles of multivariate analysis. Oxford University Press, Oxford
de Luca A (1999) On the combinatorics of finite words. Theor. Comput. Sci. 218:13–39
Schmitt AO, Herzel H, Ebeling W (1993) A new method to calculate higher-order entropies from finite samples. Europhys. Lett. 23:303–309
Shannon C (1951) Prediction and entropy of printed English. Bell Syst. Tech. J. 30:50–64
Vogel H (2013) On the shape of subword complexity sequences of finite words (2013). http://arxiv.org/abs/1309.3441
Wyner AD, Ziv J (1989) Some asymptotic properties of entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inform. Theor. 35:1250–1258
Zipf GK (1935) The Psycho-Biology of language: an introduction to Dynamic Philology. Houghton Mifflin, Boston
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans. Inform. Theor. 23:337–343
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theor. 24:530–536
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Dębowski, Ł. (2016). Estimation of Entropy from Subword Complexity. In: Matwin, S., Mielniczuk, J. (eds) Challenges in Computational Statistics and Data Mining. Studies in Computational Intelligence, vol 605. Springer, Cham. https://doi.org/10.1007/978-3-319-18781-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-18781-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18780-8
Online ISBN: 978-3-319-18781-5
eBook Packages: EngineeringEngineering (R0)