A Fast Longest Common Subsequence Algorithm for Similar Strings

• Abdullah N. Arslan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6031)

Abstract

The longest common subsequence problem is a very important computational problem for which there are many algorithms. We present a new algorithm for this problem. Let X and Y be any two given strings each of length O(n). We observe that a longest common subsequence can be obtained by using longest common prefixes of suffixes (longest common extensions) of X and Y. The longest common extension problem asks for the longest common prefix of suffixes starting in a given pair of positions in X and Y, respectively. Let e be the number of edit operations, insert, delete, and substitute to change X to Y (i.e. let e be the edit distance between X and Y). Our algorithm visits $$O(\min\{en,(1+\sqrt{2})^{2e+1})$$ nodes in the edit graph, and for every visited node, performs one longest common extension query. Each of these queries can be answered in constant time if we represent the strings by a suffix tree or a suffix array. These data structures can be created in linear time. We do not assume that the edit distance e is known beforehand, therefore we try values for e starting with e = 1 (without loss of generality X ≠ Y) and double e until our algorithm finds a longest common subsequence. The total time complexity of our algorithm is $$O(\min\{en\log{n},n+e(1+\sqrt{2})^{2e+1}\})$$. This is a better time complexity result compared to those of existing solutions for the problem when e is small. For example, when $$e\leq \frac{1}{3}((\log_{(1+\sqrt{2})}~{n})-1)$$ our algorithm finds an optimal solution in time O(n).

Keywords

algorithm string edit distance longest common subsequence suffix tree lowest common ancestor suffix array longest common extension dynamic programming

Preview

Unable to display preview. Download preview PDF.

References

1. 1.
Apostolico, A., Guerra, C.: The longest common subsequence problem revisited. Algorithmica (2), 315–336 (1987)
2. 2.
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)
3. 3.
Bergroth, L., Hakonen, H., Ratia, T.: A survey of longest common subsequence algorithms. In: SPIRE, pp. 39–48 (2000)Google Scholar
4. 4.
Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006)
5. 5.
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
6. 6.
Ilie, L., Tinta, L.: Practical algorithms for the longest common extension problem. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 302–309. Springer, Heidelberg (2009)Google Scholar
7. 7.
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest common prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
8. 8.
Kuo, S., Cross, G.R.: An algorithm to find the length of the longest common subsequence of two strings. ACM SIGIR Forum 23(3-4), 89–99 (1989)
9. 9.
Masek, W.J., Paterson, M.S.: A faster algorithm for computing string-edit distances. Journal of Computer and System Sciences 20(1), 18–31 (1980)
10. 10.
Miller, W., Myers, E.W.: A file comparison program. Softw. Pract. Exp. 15(11), 1025–1040 (1985)
11. 11.
Nakatsu, N., Kambayashi, Y., Yajima, S.: A longest common subsequence algorithm suitable for similar texts. Acta Informatica 18, 171–179 (1982)
12. 12.
Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64, 100–118 (1985)
13. 13.
Wagner, R.A., Fisher, M.J.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1975)
14. 14.
Wu, S., Manber, U., Myers, G., Miller, W.: An O(NP) sequence comparison algorithm. Inf. Proc. Lett. 35, 317–323 (1990)