Advertisement

Indexing Text with Approximate q-Grams

  • Gonzalo Navarro
  • Erkki Sutinen
  • Jani Tanninen
  • Jorma Tarhio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1848)

Abstract

We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still efficient.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, New York, 1985.zbMATHGoogle Scholar
  2. 2.
    M. Araújo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP’97, pages 2–20. Carleton University Press, 1997.Google Scholar
  3. 3.
    R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.Google Scholar
  4. 4.
    R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127–158, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. J. of the American Society for Information Science (JASIS), 51(1):69–82, January 2000.Google Scholar
  6. 6.
    A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. The samllest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55, 1985.zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    W. Chang and T. Marr. Approximate string matching and local similarity. In Proc. CPM’94, LNCS 807, pages 259–273, 1994.Google Scholar
  8. 8.
    A. Cobbs. Fast approximate matching using suffix trees. In Proc. CPM’95, pages 41–54, 1995. LNCS 937.Google Scholar
  9. 9.
    M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63–86, 1986.zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    M. Farach, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proc. SODA’ 98, pages 174–183, 1998.Google Scholar
  11. 11.
    R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. WAE’99, LNCS 1668, pages 30–42, 1999.Google Scholar
  12. 12.
    G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.Google Scholar
  13. 13.
    G. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures and Algorithms, chapter 3: New indices for text: Pat trees and Pat arrays, pages 66–82. Prentice-Hall, 1992.Google Scholar
  14. 14.
    N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32. University of Joensuu, 1994.Google Scholar
  15. 15.
    P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. of MFCS’91, volume 16, pages 240–248, 1991.MathSciNetGoogle Scholar
  16. 16.
    D. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, 1973.Google Scholar
  17. 17.
    U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, pages 935–948, 1993.Google Scholar
  18. 18.
    U. Manber and S. Wu. glimpse: A tool to search through entire file systems. In Proc. USENIX Technical Conference, pages 23–32, Winter 1994.Google Scholar
  19. 19.
    E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, Oct/Nov 1994.Google Scholar
  20. 20.
    G. Navarro. A guided tour to approximate string matching. Technical Report TR/DCC-99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
  21. 21.
    G. Navarro and R. Baeza-Yates. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.
  22. 22.
    G. Navarro and R. Baeza-Yates. A new indexing method for approximate string matching. In Proc. CPM’99, LNCS 1645, pages 163–186, 1999.Google Scholar
  23. 23.
    P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359–373, 1980.zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    F. Shi. Fast approximate string matching with q-blocks sequences. In Proc. WSP’96, pages 257–271. Carleton University Press, 1996.Google Scholar
  25. 25.
    E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. CPM’96, LNCS 1075, pages 50–61, 1996.Google Scholar
  26. 26.
    E. Ukkonen. Approximate string matching over suffix trees. In Proc. CPM’93, pages 228–242, 1993.Google Scholar
  27. 27.
    S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, October 1992.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • Erkki Sutinen
    • 2
  • Jani Tanninen
    • 2
  • Jorma Tarhio
    • 2
  1. 1.Dept. of Computer ScienceUniversity of ChileFinland
  2. 2.Dept. of Computer ScienceUniversity of JoensuuFinland

Personalised recommendations