# Processing of Huffman Compressed Texts with a Super-Alphabet

• Kimmo Fredriksson
• Jorma Tarhio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)

## Abstract

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in $$O(n \frac{\log_2\sigma}{b})$$ time, where n is the size of the compressed text in bytes, σ is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of $$O(\frac{b}{H \log_2 \sigma})$$ symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2 b ) space, and O(b2 b ) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time $$O(n \frac{\log_2 \sigma}{Hw})$$, where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time $$O(n \frac{\log_2 \sigma}{b}+t)$$, where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time $$O(n \frac{\log_2 \sigma}{b}\left\lceil (m+s)/w \right\rceil +t)$$, where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is $$O(\frac{b}{H \log_2 \sigma})$$. The method can be applied to several other algorithms as well. We conclude with some experimental results.

## Keywords

Compression Ratio Pattern Match String Match Natural Language Text Deterministic Automaton
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## References

1. 1.
Aho, V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)
2. 2.
Amir, A., Benson, G.: Efficient two-dimensional compressed matching. In: Proceedings of 2nd IEEE Data Compression Conference (DCC 1992), pp. 279–288. IEEE Computer Society Press, Los Alamitos (1992)Google Scholar
3. 3.
Amir, A., Benson, G., Farach, M.: Let sleeping files lie: Pattern matching in Z-compressed files. J. Comput. Syst. Sci. 52(2), 299–307 (1996)
4. 4.
Baeza-Yates, R.A., Gonnet, G.H.: A new approach to text searching. Commun. ACM 35(10), 74–82 (1992)
5. 5.
Baeza-Yates, R.A., Navarro, G.: Multiple approximate string matching. In: Rau-Chaplin, A., Dehne, F., Sack, J.-R., Tamassia, R. (eds.) WADS 1997. LNCS, vol. 1272, pp. 174–184. Springer, Heidelberg (1997)
6. 6.
Choueka, Y., Klein, S.T., Perl, Y.: Efficient variants of Huffman codes in highlevel languages. In: Proceedings of SIGIR 1985, 8th Annual International Conference of Research and Development in Information Retrieval, pp. 122–130. ACM, New York (1985)
7. 7.
Fredriksson, K.: Faster string matching with super-alphabets. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 44–57. Springer, Heidelberg (2002)
8. 8.
Huffman, D.A.: A method for the construction of minimum redundancy codes. Proc. I.R.E. 40, 1098–1101 (1951)
9. 9.
Klein, S.T., Shapira, D.: Pattern matching in Huffman encoded texts. In: Proceedings of 11th IEEE Data Compression Conference (DCC 2001), pp. 449–458. IEEE Computer Society Press, Los Alamitos (2001)
10. 10.
Kytöjoki, J., Salmela, L., Tarhio, J.: Tuning string matching for huge pattern sets. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 211–224. Springer, Heidelberg (2003)
11. 11.
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)
12. 12.
Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of 11th IEEE Data Compression Conference (DCC 2001), pp. 459–468. IEEE Computer Society Press, Los Alamitos (2001)
13. 13.
Takeda, M., Miyamoto, S., Kida, T., Shinohara, A., Fukamachi, S., Shinohara, T., Arikawa, S.: Processing text files as is: Pattern matching over compressed texts, multi-byte character texts, and semi-structured texts. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 170–186. Springer, Heidelberg (2002)
14. 14.
Takeda, M., Shibata, Y., Matsumoto, T., Kida, T., Shinohara, A., Fukamachi, S., Shinohara, T., Arikawa, S.: Speeding up string pattern matching by text compression: The dawn of a new era. Transactions of Information Processing Society of Japan 42(3), 370–384 (2001)
15. 15.
Wu, S., Manber, U.: Fast text searching allowing errors. Commun. ACM 35(10), 83–91 (1992)
16. 16.
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
17. 17.
Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)

## Authors and Affiliations

• Kimmo Fredriksson
• 1
• Jorma Tarhio
• 2
1. 1.Department of CSUniversity of JoensuuJoensuuFinland
2. 2.Department of CSEHelsinki University of TechnologyFinland