Summary
We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main observation is that simpler ideas are better in practice.
In theory, there is no difference between theory and practice.
In practice, there is.
Jan L.A. van de Snepscheut
The best theory is inspired by practice.
The best practice is inspired by theory.
Donald E. Knuth
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
K. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16: 1039ā1051, 1987.
A. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18 (6): 333ā340, 1975.
A. V. Aho, R. Sethi, and J. D. Ullman. Compilers - Principles, Techniques and Tools. Addison-Wesley, 1986.
C. Allauzen, M. Crochemore, and M. Raffinot. Efficient experimental string matching by weak factor recognition. In Proc. 12th Ann. Symp. on Combinatorial Pattern Matching (CPMā01), LNCS v. 2089, pages 51ā72, 2001.
C. Allauzen and M. Raffinot. Factor oracle of a set of words. Technical report 99ā11, Institut Gaspard-Monge, UniversitĆ© de Marne-la-VallĆ©e, 1999.
A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 85ā96. Springer-Verlag, 1985.
R. Baeza-Yates. Improved string searching. Software-Practice and Experience, 19 (3): 257ā271, 1989.
R. Baeza-Yates, E. Barbosa, and N. Ziviani. Hierarchies of indexes for text searching. Information Systems, 21 (6): 497ā514, 1996.
R. Baeza-Yates and G. Gonnet. A new approach to text searching. In Proc. 12th Ann. Int. ACM Conf. on Research and Development in Information Retrieval (SIGIRā89),pages 168ā175, 1989. (Addendum in ACM SIGIR Forum, V. 23, Numbers 3, 4, 1989, page 7.).
R. Baeza-Yates. and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM, 43 (6): 915ā936, 1996.
R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23 (2): 127ā158, 1999.
R. Baeza-Yates and G. Navarro. Block-addressing indexes for approximate text retrieval. Journal of the American Society for Information Science, 51 (1): 69ā82, 2000.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussier, and R. McConnel. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34 (3): 578ā595, 1987.
R. Boyer and S. Moore. A fast string searching algorithm. Communications of the ACM, 20: 762ā772, 1977.
W. Chang and T. Marr. Approximate string matching with local similarity. In Proc. 5th Ann. Symp. on Combinatorial Pattern Matching (CPMā94), LNCS v. 807, pages 259ā273, 1994.
R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algorithm. In Proc. 2nd ACM-SIAM Ann. Symp. on Discrete Algorithms (SODAā91), pages 224ā233, 1991.
L. Colussi, Z. Galil, and R. Giancarlo. The exact complexity of string matching. In Proc. 31st IEEE Ann. Symp. on Foundations of Computer Science, volume 1, pages 135ā143, 1990.
B. Commentz-Walter. A string matching algorithm fast on the average. In Proc. 6th Int. Coll. on Automata, Languages and Programming (ICALPā79), LNCS v. 71, pages 118ā132, 1979.
A. Crauser and P. Ferragina. On constructing suffix arrays in external memory. Algorithmica, 32 (1): 1ā35, 2002.
M. Crochemore, A. Czumaj, L. Ggsieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string matching algorithms. Algorithmica, 12 (4/5): 247ā267, 1994.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
M. Crochemore and R. VĆ©rin. Direct construction of compact directed acyclic word graphs. In Proc. 8th Annual Symposium on Combinatorial Pattern Matching (CPMā97), LNCS v. 1264, pages 116ā129, 1997.
M. Fischer and M. Paterson. String matching and other products. In Proc. 7th SIAM-AMS Complexity of Computation, pages 113ā125. American Mathematical Society, 1974.
W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.
K. Fredriksson and G. Navarro. Average-optimal multiple approximate string matching. In Proc. 14th Ann. Symp. on Combinatorial Pattern Matching (CPMā03), LNCS v. 2676, pages 109ā128, 2003.
Z. Gaiil and K. Park. An improved algorithm for approximate string matching. SIAM Journal of Computing, 19 (6): 989ā999, 1990.
Z. Gaiil and J. Seiferas. Linear-time string matching using only a fixed number of local storage locations. Theoretical Computer Science, 13: 331ā336, 1981.
R. Giegerich and S. Kurtz. From ukkonen to mccreight and weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19 (3): 331ā353, 1997.
R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. 3rd Workshop on Algorithm Engineering (WAEā99), LNCS v. 1668, pages 30ā42, 1999.
G. Gonnet and R. Baeza-Yates. Handbook of Algorithms and Data Structures - In Pascal and C. Addison-Wesley, 2nd edition, 1991.
G. Gonnet, R. Baeza-Yates, and T. Snider. New indexes for text: Pat trees and pat arrays. In W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter 5, pages 66ā82. Prentice-Hall, 1992.
G.H. Gonnet. PAT 3.1: An efficient text searching system, Userās manual. UW Centre for the New OED, University of Waterloo, 1987.
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
H. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, 1978.
R. Horspool. Practical fast searching in strings. Software Practice and Experience, 10 (6): 501ā506, 1980.
H. Hyyrƶ and G. Navarro. Faster bit-parallel approximate string matching. In Proc. 13th Annual Symposium on Combinatorial Pattern Matching (CPMā02), LNCS 2373, pages 203ā224, 2002.
J. Karkkainen and P. Sanders. Simple linear work suffix array construction. In ICALP, to appear, 2003.
D. Kim, J. Sim, H. Park, and K. Park. Linear-time construction of suffix arrays. In Proc. 14th Ann. Symp. on Combinatorial Pattern Matching (CPMā03), LNCS v. 2676, pages 186ā199, 2003.
J. Kim and J. Shawe-Taylor. Fast string matching using an n-gram algorithm. University of London, 1991.
D. Knuth, J. Morris, and V. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6: 323ā350, 1977.
P. Ko and S. Aluru. Space efficient linear time construction of suffix arrays. In Proc. 14th Ann. Symp. on Combinatorial Pattern Matching (CPMā03), LNCS v. 2676, pages 200ā210, 2003.
U. Manber and R. A. Baeza-Yates. An algorithm for string matching with a sequence of donāt cares. Information Processing Letters, 37 (3): 133ā136, 1991.
U. Manber and E. W. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22 (5): 935ā948, 1993.
U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proc. USENIX Technical Conference, pages 23ā32. USENIX Association, Berkeley, CA, USA, Winter 1994.
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of Algorithms, 23 (2): 262ā272, 1976.
E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18 (2): 113ā139, 2000.
E. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming Journal of the ACM, 46 (3): 395ā415, 1999.
E. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51 (1): 5ā37, 1989.
E. W. Myers. A four Russians algorithm for regular expression pattern matching. Journal of the ACM, 39 (2): 430ā448, 1992.
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33 (1): 31ā88, 2001.
G. Navarro. Nr-grep: a fast and flexible pattern matching tool. Software Practice and Experience, 31: 1265ā1312, 2001.
G. Navarro. Approximate regular expression searching with arbitrary integer weights. Technical Report TR/DCC-2002ā6, Department of Computer Science, University of Chile, July 2002.
G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms, 1 (1): 205ā239, 2000.
G. Navarro, E. Moura, M. Neubert, N. Ziviani, and R. Baeza-Yates. Adding compression to block addressing inverted indexes. Information Retrieval, 3 (1): 49ā77, 2000.
G. Navarro and M. Raffinot Fast regular expression search. In Proc. 3rd Workshop on Algorithm Engineering (WAEā99), LNCS v. 1668, pages 199ā213, 1999.
G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics, 5 (4), 2000.
G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings - Practical online search algorithms for texts and biological sequences. Cambridge University Press, 2002.
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPMā2000), LNCS v. 1848, pages 350ā363, 2000.
R. Pinter. Efficient string matching with donāt-care patterns. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 11ā29. Springer-Verlag, 1985.
P. Sellers. The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1: 359ā373, 1980.
D. Sunday. A very fast substring search algorithm. Communications of the ACM, 33 (8): 132ā142, 1990.
E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPMā96), LNCS v. 1075, pages 50ā61, 1996.
J. Tarhio and H. Peltola. String matching in the DNA alphabet. Software Practice and Experience, 27 (7): 851ā861, 1997.
K. Thompson. Regular expression search algorithm. Communications of the ACM, 11: 419ā422, 1968.
E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6 (1ā3): 132ā137, 1985.
E. Ukkonen. Constructing suffix trees on-line in linear time. In Proc. 12th IFIP World Computer Congress (IFIPā92), pages 484ā492. North-Holland, 1992.
E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Annual Symposium on Combinatorial Pattern Matching (CPMā93), LNCS v. 520, pages 228ā242, 1993.
J. Ullman. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 10: 141ā147, 1977.
P. Weiner. Linear pattern matching algorithm. In Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pages 1ā11, 1973.
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Van Nostrand Reinhold, 2nd edition, 1999.
S. Wu and U. Manber. Agrepāa fast approximate pattern-matching tool. In Proc. USENIX Winter 1992 Technical Conference, pages 153ā162, 1992.
S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35: 83ā91, 1992.
S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94ā17, Department of Computer Science, University of Arizona, 1994.
S. Wu, U. Manber, and E. Myers. A subquadratic algorithm for approximate regular expression matching. Journal of Algorithms, 19 (3): 346ā360, 1995.
A. Yao. The complexity of pattern matching for a random string. SIAM Journal on Computing, 8: 368ā387, 1979.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2004 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Baeza-Yates, R., Navarro, G. (2004). Text Searching: Theory and Practice. In: MartĆn-Vide, C., Mitrana, V., PÄun, G. (eds) Formal Languages and Applications. Studies in Fuzziness and Soft Computing, vol 148. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39886-8_30
Download citation
DOI: https://doi.org/10.1007/978-3-540-39886-8_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53554-3
Online ISBN: 978-3-540-39886-8
eBook Packages: Springer Book Archive