On the Approximation Ratio of Lempel-Ziv Parsing

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10807)


Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that \(z=O(b\log (n/b))\), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where \(z = \varOmega (b\log n)\). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.



We thank the reviewers for their insightful comments, which helped us improve the presentation significantly.


  1. 1.
    Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 7:1–7:13 (2017)Google Scholar
  2. 2.
    Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-19929-0_3 CrossRefGoogle Scholar
  3. 3.
    Bille, P., Gagie, T., Li Gørtz, I., Prezza, N.: A separation between run-length SLPs and LZ77. CoRR, abs/1711.07270 (2017)Google Scholar
  4. 4.
    Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)Google Scholar
  6. 6.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)MATHGoogle Scholar
  8. 8.
    Crochemore, M., Iliopoulos, C.S., Kubica, M., Rytter, W., Waleń, T.: Efficient algorithms for three variants of the LPF table. J. Discrete Algorithms 11, 51–61 (2012)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. CoRR, abs/1702.07577 (2017)Google Scholar
  10. 10.
    Fici, G.: Factorizations of the Fibonacci infinite word. J. Integer Sequences, 18(9), Article 3 (2015)Google Scholar
  11. 11.
    Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)CrossRefGoogle Scholar
  12. 12.
    Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Gallant, J.K.: String Compression Algorithms. Ph.D thesis. Princeton University (1982)Google Scholar
  14. 14.
    Gawrychowski, P.: Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic. CoRR, abs/1104.4203 (2011)Google Scholar
  15. 15.
    Hucke, D., Lohrey, M., Reh, C.P.: The smallest grammar problem revisited. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 35–49. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46049-9_4 CrossRefGoogle Scholar
  16. 16.
    I, T.: Longest common extensions with recompression. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 18:1–18:15 (2017)Google Scholar
  17. 17.
    Jez, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Kieffer, J.C., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Prob. Inf. Transm. 1(1), 1–7 (1965)MathSciNetMATHGoogle Scholar
  22. 22.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)MathSciNetMATHGoogle Scholar
  25. 25.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)CrossRefGoogle Scholar
  28. 28.
    Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 72:1–72:15 (2016)Google Scholar
  29. 29.
    Prezza, N.: Compressed Computation for Text Indexing. Ph.D thesis. University of Udine (2016)Google Scholar
  30. 30.
    Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms 3(24), 416–430 (2005)MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948)MathSciNetCrossRefMATHGoogle Scholar
  34. 34.
    Sthephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Chenxiang, Z., Efron, M.J., Iyer, R., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 17(7), e1002195 (2015)CrossRefGoogle Scholar
  35. 35.
    Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Travis Gagie
    • 1
    • 2
  • Gonzalo Navarro
    • 2
    • 3
  • Nicola Prezza
    • 4
  1. 1.EITDiego Portales UniversitySantiagoChile
  2. 2.Center for Biotechnology and Bioengineering (CeBiB)SantiagoChile
  3. 3.Department of Computer ScienceUniversity of ChileSantiagoChile
  4. 4.DTU ComputeTechnical University of DenmarkKongens LyngbyDenmark

Personalised recommendations