- 185 Downloads
Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length w which contains episode P.
In this paper, we introduce a few efficient off-line as well as on-line algorithms for the entire problem, where by on-line algorithms we mean algorithms which search from left to right consecutive text symbols only once. We present two alphabet independent algorithms which work in time O(nm). The off-line algorithm operates in O(1) additional space while the on-line algorithm pays for its property with O(m) additional space. Two other on-line algorithms have subquadratic time complexity. One of them works in time O(nm/log m) and O(m) additional space. The other one gives a time/space trade-off, i.e., it works in time O(n+s+nm log log s/log(s/m)) when additional space is limited to O(s). Finally, we present two approximation algorithms for the optimization problem. The off-line algorithm is alphabet independent, it has superlinear time complexity O(n/∈+nloglog(n/m)) and it uses only constant space. The on-line algorithm works in time O(n/∈+n) and uses space O(m). Both approximation algorithms achieve 1+∈ approximation ratio, for any ∈>0.
KeywordsRegular Expression Edit Operation Text Character Additional Space Approximate String Match
Unable to display preview. Download preview PDF.
- 1.A. V. Aho, J. E. Hopcroft and J. D. Ullman: The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974.Google Scholar
- 2.Z. Galil and K. Park: An improved algorithm for approximate string matching. SIAM J. Comp., 19(6) (Dec. 1990), 989–999.Google Scholar
- 3.G. M. Landau and U. Vishkin: Fast parallel and serial approximate string matching. J. Algorithms, 10(2) (June 1989), 157–169.Google Scholar
- 4.J. H. van Lint and R. M. Wilson: A Course in Combinatorics. Cambridge University Press, 1992.Google Scholar
- 5.H. Mannila and H. Toivonen: Discovering frequent episodes in sequences. Proc. 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96), 146–151. AAAI Press 1996.Google Scholar
- 6.H. Mannila, H. Toivonen and A. I. Verkamo: Discovering frequent episodes in sequences. Proc. 1st International Conference on Knowledge Discovery and Data Mining (KDD'95), 210–215. AAAI Press 1995.Google Scholar
- 7.W. J. Masek and M. S. Paterson: A faster algorithm for computing string edit distances. J. Comput. System Sci., 20 (1980), 18–31.Google Scholar
- 8.S. B. Needleman and C. D. Wunsch: A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Molecular Biol. 48 (1970), 443–453.Google Scholar
- 9.P. H. Sellers: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms, 1(4) (Dec. 1980), 359–373.Google Scholar
- 10.H. Toivonen: Discovery of Frequent Patterns in Large Data Collections. Ph.D. Thesis, Report A-1996-5, Department of Computer Science, University of Helsinki, 1996.Google Scholar
- 11.E. Ukkonen: Finding approximate patterns in strings. J. Algorithms, 6(1) (May 1985), 132–137.Google Scholar
- 12.S. Wu, U. Manber: Agrep — a fast approximate pattern-matching tool. Proc. Usenix Winter 1992 Technical Conference, 153–162. Jan. 1992.Google Scholar
- 13.S. Wu, U. Manber and G. Myers: A subquadratic algorithm for approximate limited expression matching. Algorithmica, 15(1) (Jan. 1996), 50–67.Google Scholar