Skip to main content

Approximate string-matching and the q-gram distance

  • Conference paper
Sequences II

Abstract

Some results are summarized on approximate string-matching with a string distance function that is computable in linear time and is based on the so-called q-grams (‘n-grams’). An algorithm is given for the associated string matching problem that finds the locally best approximate occurrences of pattern P, ∣P∣ = m, in text T, ∣T∣ = n, in time O(n log(m - q)). The occurrences with distance ≤ k can be found in time O(nlog k). This should be compared to the edit distance based k-differences problem for which the best algorithm currently known needs O(kn). The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.

This work was supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany). The work was done during a visit to the Institut für Informatik, University of Freiburg.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, T. Chen and J. Seiferas: The smallest automaton recognizing the subwords of a text. Theor. Comp. Sci. 40 (1985), 31–55.

    Article  MathSciNet  MATH  Google Scholar 

  2. W. I. Chang and E. L. Lawler: Approximate string matching in sublinear expected time. In: Proc. IEEE 1990 Ann. Symposium of Foundations of Computer Science, pp. 116–124.

    Google Scholar 

  3. T. H. Cormen, C. E. Leiserson and R. L. Rivest: Introduction to Algorithms. (The MIT Press 1990.)

    MATH  Google Scholar 

  4. M. Crochemore: Transducers and repetitions. Theor. Comp. Sci. 45 (1986), 63–89.

    Article  MathSciNet  MATH  Google Scholar 

  5. M. Crochemore: String matching with constraints. In: Proc. MFCS’88 Symposium. Lect. Notes in Computer Science 324, (Springer-Verlag 198), 44–58.

    Google Scholar 

  6. G. R. Dowling and P. Hall: Approximate string matching. ACM Computing Surveys 12 (1980), 381–402.

    Article  MathSciNet  Google Scholar 

  7. Z. Galil and R. Giancarlo: Data structures and algorithms for approximate string matching. J. Complexity 4 (1988), 33–72.

    Article  MathSciNet  MATH  Google Scholar 

  8. Z. Galil and K. Park: An improved algorithm for approximate string matching. In: Automata, Languages, and Programming (ICALP’89). Lect. Notes in Computer Science 372 (Springer-Verlag 1989), 394–404.

    Chapter  Google Scholar 

  9. R. Grossi and F. Luccio: Simple and efficient string matching with k mismatches. Inf. Proc. Letters 33 (1989), 113–120.

    Article  MathSciNet  MATH  Google Scholar 

  10. P. Jokinen, J. Tarhio, and E. Ukkonen: A comparison of approximate string matching algorithms. Submitted.

    Google Scholar 

  11. R. M. Karp and M. O. Rabin: Efficient randomized pattern matching. IBM J. Res. Dev. 31 (1987), 249–260.

    Article  MathSciNet  MATH  Google Scholar 

  12. T. Kohonen and E. Reuhkala: A very fast associative method for the recognition and correction of misspellt words, based on redundant hash-addressing. In: Proc. 4th Joint Conf. on Pattern Recognition, 1978, Kyoto, Japan, pp. 807–809.

    Google Scholar 

  13. G. Landau and U. Vishkin: Fast string matching with k differences. J. Comp. Syst. Sci. 37 (1988), 63–78.

    Article  MathSciNet  MATH  Google Scholar 

  14. G. Landau and U. Vishkin: Fast parallel and serial approximate string matching. J. Algorithms 10 (1989), 157–169.

    Article  MathSciNet  MATH  Google Scholar 

  15. V. I. Levenshtein: Binary codes of correcting deletions, insertions and reversals. Sov. Phys.-Dokl 10 (1966), 707–710.

    MathSciNet  Google Scholar 

  16. E. M. McCreight: A space-economical suffix tree construction algorithm. J. ACM 23 (1976), 262–272.

    Article  MathSciNet  MATH  Google Scholar 

  17. O. Owolabi and D. R. McGregor: Fast approximate string matching. Software — Practice and Experience 18 (1988), 387–393.

    Article  Google Scholar 

  18. P. H. Sellers: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1 (1980), 359–373.

    Article  MathSciNet  MATH  Google Scholar 

  19. C. E. Shannon: A mathematical theory of communications. The Bell Systems Techn. Journal 27 (1948), 379–423.

    MathSciNet  MATH  Google Scholar 

  20. J. Tarhio and E. Ukkonen: Boyer-Moore approach to approximate string matching. In: Proc. 2nd Scand. Workshop on Algorithm Theory (SWAT’90), Lect. Notes in Computer Science 447 (Springer-Verlag 1990), 348–359.

    Google Scholar 

  21. E. Ukkonen: Finding approximate patterns in strings. J. Algorithms 6 (1985), 132–137.

    Article  MathSciNet  MATH  Google Scholar 

  22. E. Ukkonen: Algorithms for approximate string matching. Information and Control 64 (1985), 100–118.

    Article  MathSciNet  MATH  Google Scholar 

  23. E. Ukkonen: Approximate string-matching with q-grams and maximal matches.

    Google Scholar 

  24. E. Ukkonen and D. Wood: Approximate string matching with suffix automata. Submitted. Report A-1990–4, Department of Computer Science, University of Helsinki, April 1990.

    Google Scholar 

  25. R. E. Wagner and M. J. Fisher: The string-to-string correction problem. J. ACM 21 (1974), 168–173.

    Article  MATH  Google Scholar 

  26. P. Weiner: Linear pattern matching algorithms. In: Proc. 14th IEEE Ann. Symp. on Switching and Automata Theory, 1973, pp. 1–11.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1993 Springer-Verlag New York, Inc.

About this paper

Cite this paper

Ukkonen, E. (1993). Approximate string-matching and the q-gram distance. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds) Sequences II. Springer, New York, NY. https://doi.org/10.1007/978-1-4613-9323-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-9323-8_22

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4613-9325-2

  • Online ISBN: 978-1-4613-9323-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics