Advertisement

A fast bit-vector algorithm for approximate string matching based on dynamic programming

  • Gene Myers
Session I
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1448)

Abstract

The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in O(nmk/w) time where w is the word size of the machine (e.g. 32 or 64 in practice). Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m.

Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu, Manber, and Myers. This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, which computes a region of the d.p. matrix in 1 x w blocks using the basic algorithm as a subroutine, is significantly faster than our previous 4-Russians algorithm, which computes the same region in 1 x 5 blocks using table lookup. This performance improvement yields a code which is superior to all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.

Keywords

Basic Algorithm Table Lookup Edit Distance String Match Approximate String Match 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [BYG92]
    R.A. Baeza-Yates and G.H. Gonnet. A new approach to text searching. Communications of the ACM, 35:74–82, 1992.Google Scholar
  2. [BYN96]
    R.A. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Proc. 7th Symp. on Combinatorial Pattern Matching. Springer LNCS 1075, pages 1–23, 1996.Google Scholar
  3. [CL92]
    W.I. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. 3rd Symp. on Combinatorial Pattern Matching. Springer LNCS 644, pages 172–181, 1992.Google Scholar
  4. [CL94]
    W.I. Chang and E.L. Lawler. Sublinear expected time approximate matching and biological applications. Algorithmica, 12:327–344, 1994.Google Scholar
  5. [GP90]
    Z. Galil and K. Park. An improved algorithm for approximate string matching. SIAM J. on Computing, 19:989–999, 1990.Google Scholar
  6. [LV88]
    G.M. Landau and U. Vishkin. Fast string matching with k differences. J. of Computer and System Sciences, 37:63–78, 1988.Google Scholar
  7. [MP80]
    W.J. Masek and M. S. Paterson. A faster algorithm for computing string edit distances. J. of Computer and System Sciences, 20:18–31, 1980.Google Scholar
  8. [Mye94]
    E.W. Myers. A sublinear algorithm for approximate keywords searching. Algorithmica, 12:345–374, 1994.Google Scholar
  9. [Sel80]
    P.H. Sellers. The theory and computations of evolutionary distances: Pattern recognition. J. of Algorithms, 1:359–373, 1980.Google Scholar
  10. [Ukk85]
    E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.Google Scholar
  11. [WM92]
    S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35:83–91, 1992.Google Scholar
  12. [WMM96]
    S. Wu, U. Manber, and G. Myers. A subquadratic algorithm for approximate limited expression matching. Algorithmica, 15:50–67, 1996.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Gene Myers
    • 1
  1. 1.Dept. of Computer ScienceUniversity of Arizona Tuscon

Personalised recommendations