# A fast bit-vector algorithm for approximate string matching based on dynamic programming

## Abstract

The approximate string matching problem is to find all locations at which a query of length *m* matches a substring of a text of length *n* with *k*-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the *k*-difference automaton for the query, and asymptotically run in *O*(*nmk/w*) time where *w* is the word size of the machine (e.g. 32 or 64 in practice). Here we present an algorithm of comparable simplicity that requires only *O*(*nm/w*) time by virtue of computing a bit representation of the *relocatable* dynamic programming matrix for the problem. Thus the algorithm's performance is independent of *k*, and it is found to be more efficient than the previous results for many choices of *k* and small *m*.

Moreover, because the algorithm is not dependent on *k*, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu, Manber, and Myers. This gives rise to an *O*(*kn/w*) expected-time algorithm for the case where *m* may be arbitrarily large. In practice this new algorithm, which computes a region of the d.p. matrix in 1 x *w* blocks using the basic algorithm as a subroutine, is significantly faster than our previous 4-Russians algorithm, which computes the same region in 1 x 5 blocks using table lookup. This performance improvement yields a code which is superior to all existing algorithms except for some filtration algorithms that are superior when *k/m* is sufficiently small.

## Keywords

Basic Algorithm Table Lookup Edit Distance String Match Approximate String Match## Preview

Unable to display preview. Download preview PDF.

## References

- [BYG92]R.A. Baeza-Yates and G.H. Gonnet. A new approach to text searching.
*Communications of the ACM*, 35:74–82, 1992.Google Scholar - [BYN96]R.A. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In
*Proc. 7th Symp. on Combinatorial Pattern Matching. Springer LNCS 1075*, pages 1–23, 1996.Google Scholar - [CL92]W.I. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In
*Proc. 3rd Symp. on Combinatorial Pattern Matching. Springer LNCS 644*, pages 172–181, 1992.Google Scholar - [CL94]W.I. Chang and E.L. Lawler. Sublinear expected time approximate matching and biological applications.
*Algorithmica*, 12:327–344, 1994.Google Scholar - [GP90]Z. Galil and K. Park. An improved algorithm for approximate string matching.
*SIAM J. on Computing*, 19:989–999, 1990.Google Scholar - [LV88]G.M. Landau and U. Vishkin. Fast string matching with k differences.
*J. of Computer and System Sciences*, 37:63–78, 1988.Google Scholar - [MP80]W.J. Masek and M. S. Paterson. A faster algorithm for computing string edit distances.
*J. of Computer and System Sciences*, 20:18–31, 1980.Google Scholar - [Mye94]E.W. Myers. A sublinear algorithm for approximate keywords searching.
*Algorithmica*, 12:345–374, 1994.Google Scholar - [Sel80]P.H. Sellers. The theory and computations of evolutionary distances: Pattern recognition.
*J. of Algorithms*, 1:359–373, 1980.Google Scholar - [Ukk85]E. Ukkonen. Finding approximate patterns in strings.
*J. of Algorithms*, 6:132–137, 1985.Google Scholar - [WM92]S. Wu and U. Manber. Fast text searching allowing errors.
*Communications of the ACM*, 35:83–91, 1992.Google Scholar - [WMM96]S. Wu, U. Manber, and G. Myers. A subquadratic algorithm for approximate limited expression matching.
*Algorithmica*, 15:50–67, 1996.Google Scholar