Estimation of Entropy from Subword Complexity

Dębowski, Łukasz

doi:10.1007/978-3-319-18781-5_4

Łukasz Dębowski⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 605))

1871 Accesses
4 Citations

Abstract

Subword complexity is a function that describes how many different substrings of a given length are contained in a given string. In this paper, two estimators of block entropy are proposed, based on the profile of subword complexity. The first estimator works well only for IID processes with uniform probabilities. The second estimator provides a lower bound of block entropy for any strictly stationary process with the distributions of blocks skewed towards less probable values. Using this estimator, some estimates of block entropy for natural language are obtained, confirming earlier hypotheses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.gutenberg.org.

References

Algoet PH, Cover TM (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem. Ann. Probab. 16:899–909
Article MathSciNet MATH Google Scholar
Baayen, H (2001) Word frequency distributions. Kluwer Academic Publishers, Dordrecht
Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Google Scholar
Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: the entropy convergence hierarchy. Chaos 15:25–54
Article MathSciNet Google Scholar
Dillon WR, Goldstein M (1984) Multivariate analysis: methods and appplications. Wiley, New York
Google Scholar
Dębowski Ł (2011) On the vocabulary of grammar-based codes and the logical consistency of texts. IEEE Trans. Inform. Theor. 57:4589–4599
Article Google Scholar
Dębowski Ł (2013) A preadapted universal switch distribution for testing Hilberg’s conjecture (2013). http://arxiv.org/abs/1310.8511
Dębowski Ł (2014) Maximal repetitions in written texts: finite energy hypothesis vs. strong Hilberg conjecture (2014). http://www.ipipan.waw.pl/~ldebowsk/
Dębowski Ł (2014) A new universal code helps to distinguish natural language from random texts (2014). http://www.ipipan.waw.pl/~ldebowsk/
Ebeling W, Pöschel T (1994) Entropy and long-range correlations in literary English. Europhys. Lett. 26:241–246
Article Google Scholar
Ebeling W, Nicolis G (1991) Entropy of symbolic sequences: the role of correlations. Europhys. Lett. 14:191–196
Article Google Scholar
Ebeling W, Nicolis G (1992) Word frequency and entropy of symbolic sequences: a dynamical perspective. Chaos Sol. Fract. 2:635–650
Article MathSciNet MATH Google Scholar
Ferenczi S (1999) Complexity of sequences and dynamical systems. Discr. Math. 206:145–154
Article MathSciNet MATH Google Scholar
Gheorghiciuc I, Ward MD (2007) On correlation polynomials and subword complexity. Discr. Math. Theo. Comp. Sci. AH, 1–18
Google Scholar
Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics, a foundation for computer science. Addison-Wiley, New York
MATH Google Scholar
Hall P, Morton SC (1993) On the estimation of entropy. Ann. Inst. Statist. Math. 45:69–88
Article MathSciNet MATH Google Scholar
Hilberg W (1990) Der bekannte Grenzwert der redundanzfreien Information in Texten—eine Fehlinterpretation der Shannonschen Experimente? Frequenz 44:243–248
Article Google Scholar
Ivanko EE (2008) Exact approximation of average subword complexity of finite random words over finite alphabet. Trud. Inst. Mat. Meh. UrO RAN 14(4):185–189
Google Scholar
Janson S, Lonardi S, Szpankowski W (2004) On average sequence complexity. Theor. Comput. Sci. 326:213–227
Article MathSciNet MATH Google Scholar
Joe H (1989) Estimation of entropy and other functionals of a multivariate density. Ann. Inst. Statist. Math. 41:683–697
Article MathSciNet MATH Google Scholar
Khmaladze E (1988) The statistical analysis of large number of rare events, Technical Report MS-R8804. Centrum voor Wiskunde en Informatica, Amsterdam
Google Scholar
Kontoyiannis I, Algoet PH, Suhov YM, Wyner AJ (1998) Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inform. Theor. 44:1319–1327
Article MathSciNet MATH Google Scholar
Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27:1061–1067
Article Google Scholar
Krzanowski W (2000) Principles of multivariate analysis. Oxford University Press, Oxford
Google Scholar
de Luca A (1999) On the combinatorics of finite words. Theor. Comput. Sci. 218:13–39
Article MATH Google Scholar
Schmitt AO, Herzel H, Ebeling W (1993) A new method to calculate higher-order entropies from finite samples. Europhys. Lett. 23:303–309
Article Google Scholar
Shannon C (1951) Prediction and entropy of printed English. Bell Syst. Tech. J. 30:50–64
Article MATH Google Scholar
Vogel H (2013) On the shape of subword complexity sequences of finite words (2013). http://arxiv.org/abs/1309.3441
Wyner AD, Ziv J (1989) Some asymptotic properties of entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inform. Theor. 35:1250–1258
Article MathSciNet MATH Google Scholar
Zipf GK (1935) The Psycho-Biology of language: an introduction to Dynamic Philology. Houghton Mifflin, Boston
Google Scholar
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans. Inform. Theor. 23:337–343
Article MathSciNet MATH Google Scholar
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theor. 24:530–536
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, Ul. Jana Kazimierza 5, 01-248, Warszawa, Poland
Łukasz Dębowski

Authors

Łukasz Dębowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Łukasz Dębowski .

Editor information

Editors and Affiliations

Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
Stan Matwin
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland, and Warsaw University of Technology, Warsaw, Poland
Jan Mielniczuk

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dębowski, Ł. (2016). Estimation of Entropy from Subword Complexity. In: Matwin, S., Mielniczuk, J. (eds) Challenges in Computational Statistics and Data Mining. Studies in Computational Intelligence, vol 605. Springer, Cham. https://doi.org/10.1007/978-3-319-18781-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-18781-5_4
Published: 28 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18780-8
Online ISBN: 978-3-319-18781-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics