Coders for the Burrows-Wheeler Transform
Like most transforms, the Burrows-Wheeler Transform does not change the size of the file that has been transformed, but merely rearranges it so that it will be easier to represent it compactly. It then needs to be coded using a second phase which we will refer to as the “Local to Global Transform” (LGT). Figure 3.1 shows a section of the transformed text for Shakespeare's “Hamlet”, which reveals the kind of regularities that the BWT exposes. These characters are ones that appear before the context nd; initially the nd is followed by a space, and hence a is very common, but then the character is followed by ndeed, where the i becomes common, and the last few characters precede nder.
Clearly the text in Figure 3.1 contains a lot of patterns, and therefore will be easy to compress. Many sophisticated techniques have been proposed to exploit the regularities of the BWT transformed text, and yet it has emerged that one of the simplest approaches (RleAc, based on run-length encoding and an order-zero arithmetic coder) gives the best compression and is also very fast compared with more complicated methods. We will begin this section by looking at this simple coder, but later we will also review various other approaches that have been proposed, including Burrows and Wheeler's original “Move to Front” (MTF) list, inversion frequencies, distance coding, frequency counting methods, wavelet trees, and alternative permutations. We will also consider the effect of the block size on compression performance.
KeywordsBlock Size Compression Performance Inverted Index Arithmetic Code Entropy Coder
Unable to display preview. Download preview PDF.