Numbers, Information and Complexity pp 421-442 | Cite as

# Universal Lossless Coding of Sources with Large and Unbounded Alphabets

## Abstract

A multilevel arithmetic coding algorithm is proposed to encode data sequences with large or unbounded source alphabets. The algorithm first converts the source alphabet into a dynamic tree, and then represents each symbol in the input sequence by its path in the tree and its index in the corresponding leaf. Encoding of the input sequence is then accomplished by encoding the path sequence and the index sequence conditionally. It is shown that the proposed algorithm is universal in the sense that it can achieve asymptotically the entropy rate of any independently and identically distributed integer source with a finite or infinite alphabet, as long as the mean value is finite. The advantages of the proposed algorithm over the traditional adaptive arithmetic coding algorithm are two folds: (1) the proposed algorithm can be used to encode any data sequence no matter whether the corresponding source alphabet is finite or infinite, while the traditional adaptive arithmetic coding algorithm can work only for data sequences with bounded, small alphabets; (2) in the situation in which the traditional adaptive arithmetic coding algorithm can work, the proposed algorithm can reduce coding complexity and improve compression performance. The proposed algorithm is then used to implement the recent Multilevel Pattern Matching(MPM) algorithms. Simulation results show that for a variety of files, the combination of the proposed algorithm with the MPM algorithms results in compression performance better than that afforded by the UNIX Compress algorithm, which is based on the LZ78 algorithm. Other applications of the proposed algorithm are also discussed.

## Keywords

Input Sequence Compression Rate Binary Search Tree Integer Sequence Arithmetic Code## Preview

Unable to display preview. Download preview PDF.

## References

- [1]R. Ahlswede, T. S. Han, and K. Kobayashi, “Universal coding of integers and unbounded search trees,”
*IEEE Trans. Inform. Theory*43, no. 2, 1997, 669 – 682.MathSciNetMATHCrossRefGoogle Scholar - [2]P. Elias, “Universal codeword sets and representations of the integers,”
*IEEE Trans. Inform. Theory*21, 1975, 194 – 203.MathSciNetMATHCrossRefGoogle Scholar - [3]R. G. Gallager and D. VanVoorhis, “Optimal Source Codes for Geometrically Distributed Integer Alphabets”,
*IEEE Trans. on Inform. Theory*21, 1975, 228 – 230.MATHCrossRefGoogle Scholar - [4]A. Gersho and R. M. Gray,
*Vector Quantization and Signal Compression*, Norwell, MA: Kluwer, 1992MATHCrossRefGoogle Scholar - [5]S. Golomb, “Run-length encodings,”
*IEEE Trans. Inform. Theory*12, 1966, 399 – 401.MathSciNetMATHCrossRefGoogle Scholar - [6]J. C. Kieffer, “Sample converses in source coding theory,”
*IEEE Trans. Inform. Theory*37, 1991, 263 – 268.MathSciNetMATHCrossRefGoogle Scholar - [7]J. C. Kieffer, E.-H. Yang, G. Nelson and P. Cosman, “Universal lossless compression via multilevel pattern matching”, accepted pending for revisions for publication in
*IEEE Trans. Inform. Theory.*Google Scholar - [8]J. C. Kieffer and E.-H. Yang, “Grammar based codes: A new class of universal lossless source codes,”
*IEEE Trans. Inform. Theory*, revised October 1998.Google Scholar - [9]A. Moffat, R. Neal and I.H. Witten, “Arithmetic coding revisited”,
*Comm. for ACM*16, no. 3, 1998, 256 – 294.Google Scholar - [10]I.H. Witten, R. Neal and J. G. Cleary, “Arithmetic coding for data compression”,
*Comm. for ACM*, 30, no. 6, 1987, 520 – 540.CrossRefGoogle Scholar - [11]E.-H. Yang and Y. Jia, “Efficient universal compression of integer sequences by using multilevel arithmetic coding”,
*Proc. of the Sixth Canadian Workshop on Inform. Theory 1999*,Kingston, Ontario.Google Scholar - [12]E.-H. Yang and J. C. Kieffer, “Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform — Part one: without context models”, accepted for publication in
*IEEE Trans. Inform. Theory.*Google Scholar