Advertisement

On the Approximation Ratio of Lempel-Ziv Parsing

  • Travis Gagie
  • Gonzalo Navarro
  • Nicola Prezza
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10807)

Abstract

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that \(z=O(b\log (n/b))\), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where \(z = \varOmega (b\log n)\). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.

Notes

Acknowledgements

We thank the reviewers for their insightful comments, which helped us improve the presentation significantly.

References

  1. 1.
    Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 7:1–7:13 (2017)Google Scholar
  2. 2.
    Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-19929-0_3 CrossRefGoogle Scholar
  3. 3.
    Bille, P., Gagie, T., Li Gørtz, I., Prezza, N.: A separation between run-length SLPs and LZ77. CoRR, abs/1711.07270 (2017)Google Scholar
  4. 4.
    Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)Google Scholar
  6. 6.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)zbMATHGoogle Scholar
  8. 8.
    Crochemore, M., Iliopoulos, C.S., Kubica, M., Rytter, W., Waleń, T.: Efficient algorithms for three variants of the LPF table. J. Discrete Algorithms 11, 51–61 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. CoRR, abs/1702.07577 (2017)Google Scholar
  10. 10.
    Fici, G.: Factorizations of the Fibonacci infinite word. J. Integer Sequences, 18(9), Article 3 (2015)Google Scholar
  11. 11.
    Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)CrossRefGoogle Scholar
  12. 12.
    Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Gallant, J.K.: String Compression Algorithms. Ph.D thesis. Princeton University (1982)Google Scholar
  14. 14.
    Gawrychowski, P.: Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic. CoRR, abs/1104.4203 (2011)Google Scholar
  15. 15.
    Hucke, D., Lohrey, M., Reh, C.P.: The smallest grammar problem revisited. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 35–49. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46049-9_4 CrossRefGoogle Scholar
  16. 16.
    I, T.: Longest common extensions with recompression. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 18:1–18:15 (2017)Google Scholar
  17. 17.
    Jez, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Kieffer, J.C., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Prob. Inf. Transm. 1(1), 1–7 (1965)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)CrossRefGoogle Scholar
  28. 28.
    Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 72:1–72:15 (2016)Google Scholar
  29. 29.
    Prezza, N.: Compressed Computation for Text Indexing. Ph.D thesis. University of Udine (2016)Google Scholar
  30. 30.
    Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms 3(24), 416–430 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Sthephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Chenxiang, Z., Efron, M.J., Iyer, R., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 17(7), e1002195 (2015)CrossRefGoogle Scholar
  35. 35.
    Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Travis Gagie
    • 1
    • 2
  • Gonzalo Navarro
    • 2
    • 3
  • Nicola Prezza
    • 4
  1. 1.EITDiego Portales UniversitySantiagoChile
  2. 2.Center for Biotechnology and Bioengineering (CeBiB)SantiagoChile
  3. 3.Department of Computer ScienceUniversity of ChileSantiagoChile
  4. 4.DTU ComputeTechnical University of DenmarkKongens LyngbyDenmark

Personalised recommendations