A string compression technique can compress well only if it has an accurate model of the data source. For a source with statistically independent characters, Huffman or arithmetic codes give optimal compression . In this case it is straightforward to use a fixed source model if the statistics are known in advance, or to adapt the model to unknown or changing statistics. For the many sources which produce dependent characters, a more sophisticated source model can provide much better compression at the expense of the extra space and time for storing and maintaining the model. The space required by a straightforward implementation of a Markov model grows exponentially in the order of the model. The Directed Acyclic Word Graph (DAWG) can be built in linear time and space, and provides the information needed to obtain compression equal to that obtained using a Markov model of high order. This paper presents two algorithms for string compression using DAWGs. The first is a very simple idea which generalizes run-length coding. It obtains good compression in many cases, but is provably non-optimal. The second combines the main idea of the first with arithmetic coding, resulting in a great improvement in performance.
KeywordsDirected Acyclic Graph Data Compression Suffix Tree Good Compression Extra Space
Unable to display preview. Download preview PDF.
- Blumer, “A generalization of run-length coding,” presented at the IEEE International Symposium on Information Theory, June 1985, Brighton, England.Google Scholar
- Blumer, Blumer, Haussler, McConnell, and Ehrenfeucht “Complete inverted files for efficient text retrieval and analysis,” Journal of the ACM, July 1987, pp. 578–595.Google Scholar
- Storer, J., Data Compression: Methods and Theory, Computer Science Press, Rockville, MD, 1988.Google Scholar
- Welch, T.A., “A technique for high-performance data compression,” IEEE Computer, 17, no. 6, June 1984, pp. 8–19.Google Scholar