Advertisement

Processing of Huffman Compressed Texts with a Super-Alphabet

  • Kimmo Fredriksson
  • Jorma Tarhio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)

Abstract

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in \(O(n \frac{\log_2\sigma}{b})\) time, where n is the size of the compressed text in bytes, σ is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of \(O(\frac{b}{H \log_2 \sigma})\) symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2 b ) space, and O(b2 b ) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time \(O(n \frac{\log_2 \sigma}{Hw})\), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time \(O(n \frac{\log_2 \sigma}{b}+t)\), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time \(O(n \frac{\log_2 \sigma}{b}\left\lceil (m+s)/w \right\rceil +t)\), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is \(O(\frac{b}{H \log_2 \sigma})\). The method can be applied to several other algorithms as well. We conclude with some experimental results.

Keywords

Compression Ratio Pattern Match String Match Natural Language Text Deterministic Automaton 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aho, V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Amir, A., Benson, G.: Efficient two-dimensional compressed matching. In: Proceedings of 2nd IEEE Data Compression Conference (DCC 1992), pp. 279–288. IEEE Computer Society Press, Los Alamitos (1992)Google Scholar
  3. 3.
    Amir, A., Benson, G., Farach, M.: Let sleeping files lie: Pattern matching in Z-compressed files. J. Comput. Syst. Sci. 52(2), 299–307 (1996)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Baeza-Yates, R.A., Gonnet, G.H.: A new approach to text searching. Commun. ACM 35(10), 74–82 (1992)CrossRefGoogle Scholar
  5. 5.
    Baeza-Yates, R.A., Navarro, G.: Multiple approximate string matching. In: Rau-Chaplin, A., Dehne, F., Sack, J.-R., Tamassia, R. (eds.) WADS 1997. LNCS, vol. 1272, pp. 174–184. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  6. 6.
    Choueka, Y., Klein, S.T., Perl, Y.: Efficient variants of Huffman codes in highlevel languages. In: Proceedings of SIGIR 1985, 8th Annual International Conference of Research and Development in Information Retrieval, pp. 122–130. ACM, New York (1985)CrossRefGoogle Scholar
  7. 7.
    Fredriksson, K.: Faster string matching with super-alphabets. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 44–57. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Huffman, D.A.: A method for the construction of minimum redundancy codes. Proc. I.R.E. 40, 1098–1101 (1951)CrossRefGoogle Scholar
  9. 9.
    Klein, S.T., Shapira, D.: Pattern matching in Huffman encoded texts. In: Proceedings of 11th IEEE Data Compression Conference (DCC 2001), pp. 449–458. IEEE Computer Society Press, Los Alamitos (2001)CrossRefGoogle Scholar
  10. 10.
    Kytöjoki, J., Salmela, L., Tarhio, J.: Tuning string matching for huge pattern sets. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 211–224. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  11. 11.
    Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)CrossRefGoogle Scholar
  12. 12.
    Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of 11th IEEE Data Compression Conference (DCC 2001), pp. 459–468. IEEE Computer Society Press, Los Alamitos (2001)CrossRefGoogle Scholar
  13. 13.
    Takeda, M., Miyamoto, S., Kida, T., Shinohara, A., Fukamachi, S., Shinohara, T., Arikawa, S.: Processing text files as is: Pattern matching over compressed texts, multi-byte character texts, and semi-structured texts. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 170–186. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Takeda, M., Shibata, Y., Matsumoto, T., Kida, T., Shinohara, A., Fukamachi, S., Shinohara, T., Arikawa, S.: Speeding up string pattern matching by text compression: The dawn of a new era. Transactions of Information Processing Society of Japan 42(3), 370–384 (2001)MathSciNetGoogle Scholar
  15. 15.
    Wu, S., Manber, U.: Fast text searching allowing errors. Commun. ACM 35(10), 83–91 (1992)CrossRefGoogle Scholar
  16. 16.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Kimmo Fredriksson
    • 1
  • Jorma Tarhio
    • 2
  1. 1.Department of CSUniversity of JoensuuJoensuuFinland
  2. 2.Department of CSEHelsinki University of TechnologyFinland

Personalised recommendations