Skip to main content

Text Searching: Theory and Practice

  • Chapter
Formal Languages and Applications

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 148))

Summary

We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main observation is that simpler ideas are better in practice.

In theory, there is no difference between theory and practice.

In practice, there is.

Jan L.A. van de Snepscheut

The best theory is inspired by practice.

The best practice is inspired by theory.

Donald E. Knuth

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. K. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16: 1039ā€“1051, 1987.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  2. A. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18 (6): 333ā€“340, 1975.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  3. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers - Principles, Techniques and Tools. Addison-Wesley, 1986.

    Google ScholarĀ 

  4. C. Allauzen, M. Crochemore, and M. Raffinot. Efficient experimental string matching by weak factor recognition. In Proc. 12th Ann. Symp. on Combinatorial Pattern Matching (CPMā€™01), LNCS v. 2089, pages 51ā€“72, 2001.

    Google ScholarĀ 

  5. C. Allauzen and M. Raffinot. Factor oracle of a set of words. Technical report 99ā€“11, Institut Gaspard-Monge, UniversitĆ© de Marne-la-VallĆ©e, 1999.

    Google ScholarĀ 

  6. A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 85ā€“96. Springer-Verlag, 1985.

    Google ScholarĀ 

  7. R. Baeza-Yates. Improved string searching. Software-Practice and Experience, 19 (3): 257ā€“271, 1989.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  8. R. Baeza-Yates, E. Barbosa, and N. Ziviani. Hierarchies of indexes for text searching. Information Systems, 21 (6): 497ā€“514, 1996.

    ArticleĀ  Google ScholarĀ 

  9. R. Baeza-Yates and G. Gonnet. A new approach to text searching. In Proc. 12th Ann. Int. ACM Conf. on Research and Development in Information Retrieval (SIGIRā€™89),pages 168ā€“175, 1989. (Addendum in ACM SIGIR Forum, V. 23, Numbers 3, 4, 1989, page 7.).

    Google ScholarĀ 

  10. R. Baeza-Yates. and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM, 43 (6): 915ā€“936, 1996.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  11. R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23 (2): 127ā€“158, 1999.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  12. R. Baeza-Yates and G. Navarro. Block-addressing indexes for approximate text retrieval. Journal of the American Society for Information Science, 51 (1): 69ā€“82, 2000.

    ArticleĀ  Google ScholarĀ 

  13. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.

    Google ScholarĀ 

  14. A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussier, and R. McConnel. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34 (3): 578ā€“595, 1987.

    ArticleĀ  Google ScholarĀ 

  15. R. Boyer and S. Moore. A fast string searching algorithm. Communications of the ACM, 20: 762ā€“772, 1977.

    ArticleĀ  MATHĀ  Google ScholarĀ 

  16. W. Chang and T. Marr. Approximate string matching with local similarity. In Proc. 5th Ann. Symp. on Combinatorial Pattern Matching (CPMā€™94), LNCS v. 807, pages 259ā€“273, 1994.

    ChapterĀ  Google ScholarĀ 

  17. R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algorithm. In Proc. 2nd ACM-SIAM Ann. Symp. on Discrete Algorithms (SODAā€™91), pages 224ā€“233, 1991.

    Google ScholarĀ 

  18. L. Colussi, Z. Galil, and R. Giancarlo. The exact complexity of string matching. In Proc. 31st IEEE Ann. Symp. on Foundations of Computer Science, volume 1, pages 135ā€“143, 1990.

    Google ScholarĀ 

  19. B. Commentz-Walter. A string matching algorithm fast on the average. In Proc. 6th Int. Coll. on Automata, Languages and Programming (ICALPā€™79), LNCS v. 71, pages 118ā€“132, 1979.

    ChapterĀ  Google ScholarĀ 

  20. A. Crauser and P. Ferragina. On constructing suffix arrays in external memory. Algorithmica, 32 (1): 1ā€“35, 2002.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  21. M. Crochemore, A. Czumaj, L. Ggsieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string matching algorithms. Algorithmica, 12 (4/5): 247ā€“267, 1994.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  22. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.

    Google ScholarĀ 

  23. M. Crochemore and R. VĆ©rin. Direct construction of compact directed acyclic word graphs. In Proc. 8th Annual Symposium on Combinatorial Pattern Matching (CPMā€™97), LNCS v. 1264, pages 116ā€“129, 1997.

    ChapterĀ  Google ScholarĀ 

  24. M. Fischer and M. Paterson. String matching and other products. In Proc. 7th SIAM-AMS Complexity of Computation, pages 113ā€“125. American Mathematical Society, 1974.

    Google ScholarĀ 

  25. W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.

    Google ScholarĀ 

  26. K. Fredriksson and G. Navarro. Average-optimal multiple approximate string matching. In Proc. 14th Ann. Symp. on Combinatorial Pattern Matching (CPMā€™03), LNCS v. 2676, pages 109ā€“128, 2003.

    ChapterĀ  Google ScholarĀ 

  27. Z. Gaiil and K. Park. An improved algorithm for approximate string matching. SIAM Journal of Computing, 19 (6): 989ā€“999, 1990.

    ArticleĀ  Google ScholarĀ 

  28. Z. Gaiil and J. Seiferas. Linear-time string matching using only a fixed number of local storage locations. Theoretical Computer Science, 13: 331ā€“336, 1981.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  29. R. Giegerich and S. Kurtz. From ukkonen to mccreight and weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19 (3): 331ā€“353, 1997.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  30. R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. 3rd Workshop on Algorithm Engineering (WAEā€™99), LNCS v. 1668, pages 30ā€“42, 1999.

    Google ScholarĀ 

  31. G. Gonnet and R. Baeza-Yates. Handbook of Algorithms and Data Structures - In Pascal and C. Addison-Wesley, 2nd edition, 1991.

    Google ScholarĀ 

  32. G. Gonnet, R. Baeza-Yates, and T. Snider. New indexes for text: Pat trees and pat arrays. In W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter 5, pages 66ā€“82. Prentice-Hall, 1992.

    Google ScholarĀ 

  33. G.H. Gonnet. PAT 3.1: An efficient text searching system, Userā€™s manual. UW Centre for the New OED, University of Waterloo, 1987.

    Google ScholarĀ 

  34. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.

    Google ScholarĀ 

  35. H. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, 1978.

    Google ScholarĀ 

  36. R. Horspool. Practical fast searching in strings. Software Practice and Experience, 10 (6): 501ā€“506, 1980.

    ArticleĀ  Google ScholarĀ 

  37. H. Hyyrƶ and G. Navarro. Faster bit-parallel approximate string matching. In Proc. 13th Annual Symposium on Combinatorial Pattern Matching (CPMā€™02), LNCS 2373, pages 203ā€“224, 2002.

    ChapterĀ  Google ScholarĀ 

  38. J. Karkkainen and P. Sanders. Simple linear work suffix array construction. In ICALP, to appear, 2003.

    Google ScholarĀ 

  39. D. Kim, J. Sim, H. Park, and K. Park. Linear-time construction of suffix arrays. In Proc. 14th Ann. Symp. on Combinatorial Pattern Matching (CPMā€™03), LNCS v. 2676, pages 186ā€“199, 2003.

    ChapterĀ  Google ScholarĀ 

  40. J. Kim and J. Shawe-Taylor. Fast string matching using an n-gram algorithm. University of London, 1991.

    Google ScholarĀ 

  41. D. Knuth, J. Morris, and V. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6: 323ā€“350, 1977.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  42. P. Ko and S. Aluru. Space efficient linear time construction of suffix arrays. In Proc. 14th Ann. Symp. on Combinatorial Pattern Matching (CPMā€™03), LNCS v. 2676, pages 200ā€“210, 2003.

    ChapterĀ  Google ScholarĀ 

  43. U. Manber and R. A. Baeza-Yates. An algorithm for string matching with a sequence of donā€™t cares. Information Processing Letters, 37 (3): 133ā€“136, 1991.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  44. U. Manber and E. W. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22 (5): 935ā€“948, 1993.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  45. U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proc. USENIX Technical Conference, pages 23ā€“32. USENIX Association, Berkeley, CA, USA, Winter 1994.

    Google ScholarĀ 

  46. E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of Algorithms, 23 (2): 262ā€“272, 1976.

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  47. E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18 (2): 113ā€“139, 2000.

    ArticleĀ  Google ScholarĀ 

  48. E. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming Journal of the ACM, 46 (3): 395ā€“415, 1999.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  49. E. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51 (1): 5ā€“37, 1989.

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  50. E. W. Myers. A four Russians algorithm for regular expression pattern matching. Journal of the ACM, 39 (2): 430ā€“448, 1992.

    ArticleĀ  MATHĀ  Google ScholarĀ 

  51. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33 (1): 31ā€“88, 2001.

    ArticleĀ  Google ScholarĀ 

  52. G. Navarro. Nr-grep: a fast and flexible pattern matching tool. Software Practice and Experience, 31: 1265ā€“1312, 2001.

    ArticleĀ  MATHĀ  Google ScholarĀ 

  53. G. Navarro. Approximate regular expression searching with arbitrary integer weights. Technical Report TR/DCC-2002ā€“6, Department of Computer Science, University of Chile, July 2002.

    Google ScholarĀ 

  54. G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms, 1 (1): 205ā€“239, 2000.

    MathSciNetĀ  Google ScholarĀ 

  55. G. Navarro, E. Moura, M. Neubert, N. Ziviani, and R. Baeza-Yates. Adding compression to block addressing inverted indexes. Information Retrieval, 3 (1): 49ā€“77, 2000.

    ArticleĀ  Google ScholarĀ 

  56. G. Navarro and M. Raffinot Fast regular expression search. In Proc. 3rd Workshop on Algorithm Engineering (WAEā€™99), LNCS v. 1668, pages 199ā€“213, 1999.

    Google ScholarĀ 

  57. G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics, 5 (4), 2000.

    Google ScholarĀ 

  58. G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings - Practical online search algorithms for texts and biological sequences. Cambridge University Press, 2002.

    Google ScholarĀ 

  59. G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPMā€™2000), LNCS v. 1848, pages 350ā€“363, 2000.

    ChapterĀ  Google ScholarĀ 

  60. R. Pinter. Efficient string matching with donā€™t-care patterns. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 11ā€“29. Springer-Verlag, 1985.

    Google ScholarĀ 

  61. P. Sellers. The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1: 359ā€“373, 1980.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  62. D. Sunday. A very fast substring search algorithm. Communications of the ACM, 33 (8): 132ā€“142, 1990.

    ArticleĀ  Google ScholarĀ 

  63. E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPMā€™96), LNCS v. 1075, pages 50ā€“61, 1996.

    ChapterĀ  Google ScholarĀ 

  64. J. Tarhio and H. Peltola. String matching in the DNA alphabet. Software Practice and Experience, 27 (7): 851ā€“861, 1997.

    ArticleĀ  Google ScholarĀ 

  65. K. Thompson. Regular expression search algorithm. Communications of the ACM, 11: 419ā€“422, 1968.

    ArticleĀ  MATHĀ  Google ScholarĀ 

  66. E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6 (1ā€“3): 132ā€“137, 1985.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  67. E. Ukkonen. Constructing suffix trees on-line in linear time. In Proc. 12th IFIP World Computer Congress (IFIPā€™92), pages 484ā€“492. North-Holland, 1992.

    Google ScholarĀ 

  68. E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Annual Symposium on Combinatorial Pattern Matching (CPMā€™93), LNCS v. 520, pages 228ā€“242, 1993.

    ChapterĀ  Google ScholarĀ 

  69. J. Ullman. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 10: 141ā€“147, 1977.

    ArticleĀ  Google ScholarĀ 

  70. P. Weiner. Linear pattern matching algorithm. In Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pages 1ā€“11, 1973.

    ChapterĀ  Google ScholarĀ 

  71. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Van Nostrand Reinhold, 2nd edition, 1999.

    Google ScholarĀ 

  72. S. Wu and U. Manber. Agrepā€“a fast approximate pattern-matching tool. In Proc. USENIX Winter 1992 Technical Conference, pages 153ā€“162, 1992.

    Google ScholarĀ 

  73. S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35: 83ā€“91, 1992.

    ArticleĀ  Google ScholarĀ 

  74. S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94ā€“17, Department of Computer Science, University of Arizona, 1994.

    Google ScholarĀ 

  75. S. Wu, U. Manber, and E. Myers. A subquadratic algorithm for approximate regular expression matching. Journal of Algorithms, 19 (3): 346ā€“360, 1995.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  76. A. Yao. The complexity of pattern matching for a random string. SIAM Journal on Computing, 8: 368ā€“387, 1979.

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2004 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Baeza-Yates, R., Navarro, G. (2004). Text Searching: Theory and Practice. In: Martƭn-Vide, C., Mitrana, V., Păun, G. (eds) Formal Languages and Applications. Studies in Fuzziness and Soft Computing, vol 148. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39886-8_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39886-8_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53554-3

  • Online ISBN: 978-3-540-39886-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics