Skip to main content

On the Approximation Ratio of Lempel-Ziv Parsing

  • Conference paper
  • First Online:
LATIN 2018: Theoretical Informatics (LATIN 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10807))

Included in the following conference series:

Abstract

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that \(z=O(b\log (n/b))\), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where \(z = \varOmega (b\log n)\). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.

Partially funded by Basal Funds FB0001, Conicyt, by Fondecyt Grants 1-171058 and 1-170048, Chile, and by the Danish Research Council DFF-4005-00267.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.

  2. 2.

    For this case, we could have defined bordering in a stricter way, as the first or last block of a chunk.

References

  1. Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 7:1–7:13 (2017)

    Google Scholar 

  2. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3

    Chapter  Google Scholar 

  3. Bille, P., Gagie, T., Li Gørtz, I., Prezza, N.: A separation between run-length SLPs and LZ77. CoRR, abs/1711.07270 (2017)

    Google Scholar 

  4. Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  5. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  6. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)

    MATH  Google Scholar 

  8. Crochemore, M., Iliopoulos, C.S., Kubica, M., Rytter, W., Waleń, T.: Efficient algorithms for three variants of the LPF table. J. Discrete Algorithms 11, 51–61 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  9. Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. CoRR, abs/1702.07577 (2017)

    Google Scholar 

  10. Fici, G.: Factorizations of the Fibonacci infinite word. J. Integer Sequences, 18(9), Article 3 (2015)

    Google Scholar 

  11. Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)

    Article  Google Scholar 

  12. Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  13. Gallant, J.K.: String Compression Algorithms. Ph.D thesis. Princeton University (1982)

    Google Scholar 

  14. Gawrychowski, P.: Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic. CoRR, abs/1104.4203 (2011)

    Google Scholar 

  15. Hucke, D., Lohrey, M., Reh, C.P.: The smallest grammar problem revisited. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 35–49. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46049-9_4

    Chapter  Google Scholar 

  16. I, T.: Longest common extensions with recompression. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 18:1–18:15 (2017)

    Google Scholar 

  17. Jez, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  18. Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  19. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  20. Kieffer, J.C., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  21. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Prob. Inf. Transm. 1(1), 1–7 (1965)

    MathSciNet  MATH  Google Scholar 

  22. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  23. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  24. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)

    MathSciNet  MATH  Google Scholar 

  25. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  26. Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  27. Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  28. Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 72:1–72:15 (2016)

    Google Scholar 

  29. Prezza, N.: Compressed Computation for Text Indexing. Ph.D thesis. University of Udine (2016)

    Google Scholar 

  30. Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  31. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  32. Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms 3(24), 416–430 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  33. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  34. Sthephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Chenxiang, Z., Efron, M.J., Iyer, R., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 17(7), e1002195 (2015)

    Article  Google Scholar 

  35. Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their insightful comments, which helped us improve the presentation significantly.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo Navarro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gagie, T., Navarro, G., Prezza, N. (2018). On the Approximation Ratio of Lempel-Ziv Parsing. In: Bender, M., Farach-Colton, M., Mosteiro, M. (eds) LATIN 2018: Theoretical Informatics. LATIN 2018. Lecture Notes in Computer Science(), vol 10807. Springer, Cham. https://doi.org/10.1007/978-3-319-77404-6_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77404-6_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77403-9

  • Online ISBN: 978-3-319-77404-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics