Linear-Time Off-Line Text Compression by Longest-First Substitution

  • Shunsuke Inenaga
  • Takashi Funamoto
  • Masayuki Takeda
  • Ayumi Shinohara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)


Given a text, grammar-based compression is to construct a grammar that generates the text. There are many kinds of text compression techniques of this type. Each compression scheme is categorized as being either off-line or on-line, according to how a text is processed. One representative tactics for off-line compression is to substitute the longest repeated factors of a text with a production rule. In this paper, we present an algorithm that compresses a text basing on this longest-first principle, in linear time. The algorithm employs a suitable index structure for a text, and involves technically efficient operations on the structure.


Leaf Node Internal Node Production Rule Dead Zone Input Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithm on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer, Heidelberg (1985)Google Scholar
  2. 2.
    Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)CrossRefGoogle Scholar
  3. 3.
    Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15, 481–494 (1996)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, New Jersey (1990)Google Scholar
  5. 5.
    Bentley, J., McIlroy, D.: Data compression using long common strings. In: Proc. Data Compression Conference 1999 (DCC 1999), pp. 287–295. IEEE Computer Society, Los Alamitos (1999)Google Scholar
  6. 6.
    Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Brødal, G.S., Lyngsø, R.B., Östlin, A., Pedersen, C.N.S.: Solving the string statistics problem in time \(\mathcal{O}(n\log n)\). In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 728–739. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)CrossRefGoogle Scholar
  9. 9.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)zbMATHCrossRefGoogle Scholar
  10. 10.
    Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 169–180. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  11. 11.
    Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)Google Scholar
  12. 12.
    Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)CrossRefGoogle Scholar
  13. 13.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artificial Intelligence Research 7, 67–82 (1997)zbMATHGoogle Scholar
  15. 15.
    Nevill-Manning, C.G., Witten, I.H.: Phrase hierarchy inference and compression in bounded space. In: Proc. Data Compression Conference 1998 (DCC 1998), pp. 179–188. IEEE Computer Society, Los Alamitos (1998)Google Scholar
  16. 16.
    Nevill-Manning, C.G., Witten, I.H.: Online and offline heuristics for inferring hierarchies of repetitions in sequences 88(11), 1745–1755 (2000)Google Scholar
  17. 17.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  19. 19.
    Wolff, J.G.: An algorithm for the segmentation for an artificial language analogue. Britich Journal of Psychology 66, 79–90 (1975)Google Scholar
  20. 20.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans Information Theory 24(5), 530–536 (1978)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Shunsuke Inenaga
    • 1
    • 2
  • Takashi Funamoto
    • 1
  • Masayuki Takeda
    • 1
    • 2
  • Ayumi Shinohara
    • 1
    • 2
  1. 1.Department of InformaticsKyushu University 33FukuokaJapan
  2. 2.PRESTO, Japan Science and Technology Corporation (JST) 

Personalised recommendations