Advertisement

Approximate String Matching over Ziv—Lempel Compressed Text

  • Juha Kärkkäinen
  • Gonzalo Navarro
  • Esko Ukkonen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1848)

Abstract

We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to insertions, deletions and substitutions, in O(mkn + R) time. The existence problem needs O(mkn) time. We also show that the algorithm can be adapted to run in O(k 2 n + min(mkn, m 2(mσ)k + R) average time, where σ is the alphabet size. The experimental results show a speedup over the basic approach for moderate m and small k.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. DCC’92, pages 279–288, 1992.Google Scholar
  2. 2.
    A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in Z-compressed files. J. of and Sys. Sciences, 52(2):299–307, 1996. Earlier version in Proc. SODA’ 94.CrossRefMathSciNetGoogle Scholar
  3. 3.
    A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press, Oxford, UK, 1997.zbMATHGoogle Scholar
  4. 4.
    R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127–158, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    T. Bell, J. Cleary, and I. Witten. Text Compression. Prentice Hall, 1990.Google Scholar
  6. 6.
    W. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. CPM’92, LNCS 644, pages 172–181, 1992.Google Scholar
  7. 7.
    W. Chang and T. Marr. Approximate string matching and local similarity. In Proc. CPM’94, LNCS 807, pages 259–273, 1994.Google Scholar
  8. 8.
    M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, Oxford, UK, 1994.zbMATHGoogle Scholar
  9. 9.
    M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20:388–404, 1998. Previous version in STOC’95.zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Z. Galil and K. Park. An improved algorithm for approximate string matching. SI AM J. on Computing, 19(6):989–999, 1990.zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Proc. DCC’98, pages 103–112, 1998.Google Scholar
  12. 12.
    T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99, LNCS 1645, pages 1–13, 1999.Google Scholar
  13. 13.
    G. Myers. A fast bit-vector algorithm for approximate pattern matching based on dynamic progamming. In Proc. CPM’98, LNCS 1448, pages 1–13, 1998.Google Scholar
  14. 14.
    G. Navarro. A guided tour to approximate string matching. Technical Report TR/DCC-99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
  15. 15.
    G. Navarro and R. Baeza-Yates. Very fast and simple approximate string matching. Information Processing Letters, 72:65–70, 1999.CrossRefMathSciNetGoogle Scholar
  16. 16.
    G. Navarro and M. Raffinot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99, LNCS 1645, pages 14–36, 1999.Google Scholar
  17. 17.
    G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. CPM’2000, LNCS 1848, 2000, pp. 166–180. In this same volume.Google Scholar
  18. 18.
    S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. of Molecular Biology, 48:444–453, 1970.Google Scholar
  19. 19.
    P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359–373, 1980.zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    T. A. Welch. A technique for high performance data compression. IEEE Computer Magazine, 17(6):8–19, June 1984.Google Scholar
  22. 22.
    S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, 1992.CrossRefGoogle Scholar
  23. 23.
    J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23:337–343, 1977.zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    J. Ziv and A. Lempel. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory, 24:530–536, 1978.zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Juha Kärkkäinen
    • 1
  • Gonzalo Navarro
    • 2
  • Esko Ukkonen
    • 1
  1. 1.Dept. of Computer ScienceUniversity of HelsinkiFinland
  2. 2.Dept. of Computer ScienceUniversity of ChileFinland

Personalised recommendations