Abstract
Some results are summarized on approximate string-matching with a string distance function that is computable in linear time and is based on the so-called q-grams (‘n-grams’). An algorithm is given for the associated string matching problem that finds the locally best approximate occurrences of pattern P, ∣P∣ = m, in text T, ∣T∣ = n, in time O(n log(m - q)). The occurrences with distance ≤ k can be found in time O(nlog k). This should be compared to the edit distance based k-differences problem for which the best algorithm currently known needs O(kn). The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.
This work was supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany). The work was done during a visit to the Institut für Informatik, University of Freiburg.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, T. Chen and J. Seiferas: The smallest automaton recognizing the subwords of a text. Theor. Comp. Sci. 40 (1985), 31–55.
W. I. Chang and E. L. Lawler: Approximate string matching in sublinear expected time. In: Proc. IEEE 1990 Ann. Symposium of Foundations of Computer Science, pp. 116–124.
T. H. Cormen, C. E. Leiserson and R. L. Rivest: Introduction to Algorithms. (The MIT Press 1990.)
M. Crochemore: Transducers and repetitions. Theor. Comp. Sci. 45 (1986), 63–89.
M. Crochemore: String matching with constraints. In: Proc. MFCS’88 Symposium. Lect. Notes in Computer Science 324, (Springer-Verlag 198), 44–58.
G. R. Dowling and P. Hall: Approximate string matching. ACM Computing Surveys 12 (1980), 381–402.
Z. Galil and R. Giancarlo: Data structures and algorithms for approximate string matching. J. Complexity 4 (1988), 33–72.
Z. Galil and K. Park: An improved algorithm for approximate string matching. In: Automata, Languages, and Programming (ICALP’89). Lect. Notes in Computer Science 372 (Springer-Verlag 1989), 394–404.
R. Grossi and F. Luccio: Simple and efficient string matching with k mismatches. Inf. Proc. Letters 33 (1989), 113–120.
P. Jokinen, J. Tarhio, and E. Ukkonen: A comparison of approximate string matching algorithms. Submitted.
R. M. Karp and M. O. Rabin: Efficient randomized pattern matching. IBM J. Res. Dev. 31 (1987), 249–260.
T. Kohonen and E. Reuhkala: A very fast associative method for the recognition and correction of misspellt words, based on redundant hash-addressing. In: Proc. 4th Joint Conf. on Pattern Recognition, 1978, Kyoto, Japan, pp. 807–809.
G. Landau and U. Vishkin: Fast string matching with k differences. J. Comp. Syst. Sci. 37 (1988), 63–78.
G. Landau and U. Vishkin: Fast parallel and serial approximate string matching. J. Algorithms 10 (1989), 157–169.
V. I. Levenshtein: Binary codes of correcting deletions, insertions and reversals. Sov. Phys.-Dokl 10 (1966), 707–710.
E. M. McCreight: A space-economical suffix tree construction algorithm. J. ACM 23 (1976), 262–272.
O. Owolabi and D. R. McGregor: Fast approximate string matching. Software — Practice and Experience 18 (1988), 387–393.
P. H. Sellers: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1 (1980), 359–373.
C. E. Shannon: A mathematical theory of communications. The Bell Systems Techn. Journal 27 (1948), 379–423.
J. Tarhio and E. Ukkonen: Boyer-Moore approach to approximate string matching. In: Proc. 2nd Scand. Workshop on Algorithm Theory (SWAT’90), Lect. Notes in Computer Science 447 (Springer-Verlag 1990), 348–359.
E. Ukkonen: Finding approximate patterns in strings. J. Algorithms 6 (1985), 132–137.
E. Ukkonen: Algorithms for approximate string matching. Information and Control 64 (1985), 100–118.
E. Ukkonen: Approximate string-matching with q-grams and maximal matches.
E. Ukkonen and D. Wood: Approximate string matching with suffix automata. Submitted. Report A-1990–4, Department of Computer Science, University of Helsinki, April 1990.
R. E. Wagner and M. J. Fisher: The string-to-string correction problem. J. ACM 21 (1974), 168–173.
P. Weiner: Linear pattern matching algorithms. In: Proc. 14th IEEE Ann. Symp. on Switching and Automata Theory, 1973, pp. 1–11.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1993 Springer-Verlag New York, Inc.
About this paper
Cite this paper
Ukkonen, E. (1993). Approximate string-matching and the q-gram distance. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds) Sequences II. Springer, New York, NY. https://doi.org/10.1007/978-1-4613-9323-8_22
Download citation
DOI: https://doi.org/10.1007/978-1-4613-9323-8_22
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4613-9325-2
Online ISBN: 978-1-4613-9323-8
eBook Packages: Springer Book Archive