Choosing Word Occurrences for the Smallest Grammar Problem
The smallest grammar problem - namely, finding a smallest context-free grammar that generates exactly one sequence - is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose to focus on the choice of the occurrences to be rewritten by non-terminals. We extend classical offline algorithms by introducing a global optimization of this choice at each step of the algorithm. This approach allows us to define the search space of a smallest grammar by separating the choice of the non-terminals and the choice of their occurrences. We propose a second algorithm that performs a broader exploration by allowing the removal of useless words that were chosen previously. Experiments on a classical benchmark show that our algorithms consistently find smaller grammars then state-of-the-art algorithms.
KeywordsProduction Rule Data Compression Pattern Discovery Kolmogorov Complexity Repeated Word
Unable to display preview. Download preview PDF.
- 1.Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE (January 2000)Google Scholar
- 3.Bentley, J., McIlroy, D.: Data compression using long common strings. In: Data Compression Conference, pp. 287–295 (March 1999)Google Scholar
- 5.Evans, S.C., Kourtidis, A., Markham, T., Miller, J.: MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP Journal on Bioinformatics and Systems Biology (3) (2007)Google Scholar
- 6.Kieffer, J., Yang, E.H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory 46 (2000)Google Scholar
- 7.Klein, D.: The Unsupervised Learning of Natural Language Structure. PhD thesis, University of Stanford (2005)Google Scholar
- 8.Lanctot, J.K., Li, M., Yang, E.H.: Estimating DNA sequence entropy. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 409–418 (January 2000)Google Scholar
- 10.Marcken, C.D.: Unsupervised language acquisition. PhD thesis, Massachusetts Institute of Technology (January 1996)Google Scholar
- 12.Nevill-Manning, C., Witten, I.: On-line and off-line heuristics for inferring hierarchies of repetitions in sequences. In: Data Compression Conference, pp. 1745–1755. IEEE, Los Alamitos (2000)Google Scholar
- 13.Nevill-Manning, C.G.: Inferring Sequential Structure. PhD thesis, University of Waikato (1996)Google Scholar
- 14.Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7 (January 1997)Google Scholar
- 19.Wolff, J.: An algorithm for the segmentation of an artificial language analogue. British Journal of Psychology 66 (1975)Google Scholar