Advertisement

Sequences pp 303-311 | Cite as

Applications of DAWGs to Data Compression

  • Anselm Blumer
Conference paper

Abstract

A string compression technique can compress well only if it has an accurate model of the data source. For a source with statistically independent characters, Huffman or arithmetic codes give optimal compression [11]. In this case it is straightforward to use a fixed source model if the statistics are known in advance, or to adapt the model to unknown or changing statistics. For the many sources which produce dependent characters, a more sophisticated source model can provide much better compression at the expense of the extra space and time for storing and maintaining the model. The space required by a straightforward implementation of a Markov model grows exponentially in the order of the model. The Directed Acyclic Word Graph (DAWG) can be built in linear time and space, and provides the information needed to obtain compression equal to that obtained using a Markov model of high order. This paper presents two algorithms for string compression using DAWGs. The first is a very simple idea which generalizes run-length coding. It obtains good compression in many cases, but is provably non-optimal. The second combines the main idea of the first with arithmetic coding, resulting in a great improvement in performance.

Keywords

Directed Acyclic Graph Data Compression Suffix Tree Good Compression Extra Space 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Blumer, “A generalization of run-length coding,” presented at the IEEE International Symposium on Information Theory, June 1985, Brighton, England.Google Scholar
  2. [2]
    Blumer, Blumer, Ehrenfeucht, Haussler, Chen and Seiferas, “The smallest automaton recognizing the subwords of a text,” Theoretical Computer Science, (40) 1985, pp. 31–55.MathSciNetCrossRefMATHGoogle Scholar
  3. [3]
    Blumer, Blumer, Haussler, McConnell, and Ehrenfeucht “Complete inverted files for efficient text retrieval and analysis,” Journal of the ACM, July 1987, pp. 578–595.Google Scholar
  4. [4]
    Chung, K.L., Elementary Probability Theory, Springer-Verlag, New York, 1974.MATHGoogle Scholar
  5. [5]
    Cleary, J.G., and Ian H. Witten, “Data compression using adaptive coding and partial string matching,” IEEE Transactions on Communication, COM-32, 4, April 1984, pp. 396–402.CrossRefGoogle Scholar
  6. [6]
    Lempel, Abraham and Jacob Ziv, “On the complexity of finite sequences,” IEEE Transactions on Information Theory, IT-22, no. 1, Jan. 1976, pp. 75–81.MathSciNetCrossRefGoogle Scholar
  7. [7]
    McCreight, E.M., “A space-economical suffix tree construction algorithm,” Journal of the ACM 23, 2, April 1976, pp. 262–272.MathSciNetCrossRefMATHGoogle Scholar
  8. [8]
    Rissanen, J., and G.G. Langdon, “Arithmetic coding,” IBM Journal of Research and Development, 23, 2, March 1979, pp. 149–162.MathSciNetCrossRefMATHGoogle Scholar
  9. [9]
    Rissanen, J., “A universal data compression system,” IEEE Transactions on Information Theory, IT-29, no. 5, September 1983, pp. 656–664.MathSciNetCrossRefGoogle Scholar
  10. [10]
    Rodeh, M., V.R. Pratt, and S. Even, “Linear algorithm for data compression via string matching,” Journal of the ACM 28, 1, January 1981, pp. 16–24.MathSciNetCrossRefMATHGoogle Scholar
  11. [11]
    Storer, J., Data Compression: Methods and Theory, Computer Science Press, Rockville, MD, 1988.Google Scholar
  12. [12]
    Welch, T.A., “A technique for high-performance data compression,” IEEE Computer, 17, no. 6, June 1984, pp. 8–19.Google Scholar
  13. [13]
    Witten, Ian H., Radford M. Neal, and John G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, 30, no. 6, June 1987, pp. 520–540.CrossRefGoogle Scholar
  14. [14]
    Ziv, Jacob and Abraham Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Transactions on Information Theory, IT-24, 5, September 1978, pp. 530–535.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag New York Inc. 1990

Authors and Affiliations

  • Anselm Blumer
    • 1
  1. 1.Department of Computer ScienceTufts UniversityMedfordUSA

Personalised recommendations