Choosing Word Occurrences for the Smallest Grammar Problem

  • Rafael Carrascosa
  • François Coste
  • Matthias Gallé
  • Gabriel Infante-Lopez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6031)


The smallest grammar problem - namely, finding a smallest context-free grammar that generates exactly one sequence - is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose to focus on the choice of the occurrences to be rewritten by non-terminals. We extend classical offline algorithms by introducing a global optimization of this choice at each step of the algorithm. This approach allows us to define the search space of a smallest grammar by separating the choice of the non-terminals and the choice of their occurrences. We propose a second algorithm that performs a broader exploration by allowing the removal of useless words that were chosen previously. Experiments on a classical benchmark show that our algorithms consistently find smaller grammars then state-of-the-art algorithms.


Production Rule Data Compression Pattern Discovery Kolmogorov Complexity Repeated Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE (January 2000)Google Scholar
  2. 2.
    Arnold, R., Bell, T.: A corpus for the evaluation of lossless compression algorithms. In: Data Compression Conference, Washington, DC, USA, p. 201. IEEE Computer Society, Los Alamitos (1997)CrossRefGoogle Scholar
  3. 3.
    Bentley, J., McIlroy, D.: Data compression using long common strings. In: Data Compression Conference, pp. 287–295 (March 1999)Google Scholar
  4. 4.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), 2554–2576 (2005)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Evans, S.C., Kourtidis, A., Markham, T., Miller, J.: MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP Journal on Bioinformatics and Systems Biology (3) (2007)Google Scholar
  6. 6.
    Kieffer, J., Yang, E.H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory 46 (2000)Google Scholar
  7. 7.
    Klein, D.: The Unsupervised Learning of Natural Language Structure. PhD thesis, University of Stanford (2005)Google Scholar
  8. 8.
    Lanctot, J.K., Li, M., Yang, E.H.: Estimating DNA sequence entropy. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 409–418 (January 2000)Google Scholar
  9. 9.
    Larsson, N., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)CrossRefGoogle Scholar
  10. 10.
    Marcken, C.D.: Unsupervised language acquisition. PhD thesis, Massachusetts Institute of Technology (January 1996)Google Scholar
  11. 11.
    Nakamura, R., Inenaga, S., Bannai, H., Funamoto, T., Takeda, M., Shinohara, A.: Linear-time text compression by longest-first substitution. Algorithms 2(4), 1429–1448 (2009)CrossRefGoogle Scholar
  12. 12.
    Nevill-Manning, C., Witten, I.: On-line and off-line heuristics for inferring hierarchies of repetitions in sequences. In: Data Compression Conference, pp. 1745–1755. IEEE, Los Alamitos (2000)Google Scholar
  13. 13.
    Nevill-Manning, C.G.: Inferring Sequential Structure. PhD thesis, University of Waikato (1996)Google Scholar
  14. 14.
    Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7 (January 1997)Google Scholar
  15. 15.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), 211–222 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Sakakibara, Y.: Efficient learning of context-free grammars from positive structural examples. Inf. Comput. 97(1), 23–60 (1992)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Sakamoto, H., Maruyama, S., Kida, T., Shimozono, S.: A space-saving approximation algorithm for grammar-based compression. IEICE Transactions 92-D(2), 158–165 (2009)CrossRefGoogle Scholar
  18. 18.
    Schuegraf, E.J., Heaps, H.S.: A comparison of algorithms for data base compression by use of fragments as language elements. Information Storage and Retrieval 10, 309–319 (1974)CrossRefGoogle Scholar
  19. 19.
    Wolff, J.: An algorithm for the segmentation of an artificial language analogue. British Journal of Psychology 66 (1975)Google Scholar
  20. 20.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Rafael Carrascosa
    • 1
  • François Coste
    • 2
  • Matthias Gallé
    • 2
  • Gabriel Infante-Lopez
    • 1
    • 3
  1. 1.Grupo de Procesamiento de Lenguaje NaturalUniversidad Nacional de CórdobaArgentina
  2. 2.Symbiose ProjectIRISA/INRIA Rennes-Bretagne AtlantiqueFrance
  3. 3.Consejo Nacional de Investigaciones CientíficasArgentina

Personalised recommendations