String Editing and Longest Common Subsequences

  • Alberto Apostolico
Chapter

Summary

The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol of x with another symbol. String editing models a variety of problems arising in such diverse areas as text and speech processing, geology and, last but not least, molecular biology. Special cases of string editing include the longest common subsequence problem, local alignment and similarity searching in DNA and protein sequences, and approximate string searching. We describe serial and parallel algorithmic solutions for the problem and some of its basic variants.

Keywords

Expense Autocorrelation Hunt Adenine Sorting 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Aho, A. V. [ 1990 ], Algorithms for finding patterns in strings, Handbook of Theoretical Computer Science, J. van Leeuwen, Ed., Elsevier, Amsterdam, 255–300.Google Scholar
  2. [2]
    Aho, A. V., D. S. Hirschberg and J. D. Ullman [ 1976 ], Bounds on the complexity of the longest common subsequence problem, J. Assoc. Comput. Mach., 23, 1–12.MathSciNetMATHCrossRefGoogle Scholar
  3. [3]
    Aho, A. V., J. E. Hopcroft and J. D. Ullman [ 1974 ], The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA.Google Scholar
  4. [4]
    Aggarwal, A. and J. Park [ 1988 ], Notes on searching in multidimensional monotone arrays, in Proc. 29th Annual IEEE Symposium on Foundations of Computer Science, 1988, IEEE Computer Society, Washington, DC, 497–512.Google Scholar
  5. [5]
    Apostolico, A. [ 1986 ], Improving the worst case performance of the Hunt-Szymanski strategy for the longest common subsequence of two strings, Information Processing Letters 23, 63–69.MathSciNetMATHCrossRefGoogle Scholar
  6. [6]
    Apostolico, A. [ 1987 ], Remark on HSU-DU New Algorithm for the LCS Problem. Information Processing Letters 25, 235–236.MathSciNetCrossRefGoogle Scholar
  7. [7]
    Apostolico, A., Ed. [1994], Algorithmica 4/5, Special Issue on String Algorithmics and Its Applications.Google Scholar
  8. [8]
    Apostolico, A., M. J. Atallah, L. L. Larmore and S. Mcfaddin [1990], Efficient parallel algorithms for string editing and related problems, SIAM Journal on Computing 19, 968–988. Also: Proceedings of the 26th Allerton Conf. on Comm., Control and Comp., Monticello, IL, Sept. 1988, 253–263.Google Scholar
  9. [9]
    Apostolico, A., S. Browne and C. Guerra [ 1992 ], Fast linear space computations of longest common subsequences, Theoretical Computer Science, 92, 3–17.MathSciNetMATHCrossRefGoogle Scholar
  10. [10]
    Apostolico, A. and Z. Galil, Eds. [ 1985 ], Combinatorial Algorithms on Words, Springer-Verlag, Berlin.MATHGoogle Scholar
  11. [11]
    Apostolico, A. and C. Guerra [ 1985 ], A fast linear space algorithm for computing longest common subsequences, Proceedings of the 23rd Allerton Conference, Monticello, IL (1985).Google Scholar
  12. [12]
    Apostolico, A. and C. Guerra [ 1987 ], The longest common subsequence problem revisited, Algorithmica, 2, 315–336.MathSciNetMATHCrossRefGoogle Scholar
  13. [13]
    Arlazarov, V.L., E. A. Dinic, M. A. Kronrod, and I. A. Faradzev[1970]. On economical construction of the transitive closure of a directed graph, Dokl. Akad. Nauk SSSR 194, 487–488 (in Russian). English translation in Soviet Math. Dokl. 11:5, 1209–1210.Google Scholar
  14. [14]
    Atallah, M. J. [ 1993 ] A Faster Parallel Algorithm for a Matrix Searching Problem, Algorithmica, 9, 156–167.MathSciNetMATHCrossRefGoogle Scholar
  15. [15]
    Bentley, J. L. and A. C-C. Yao [ 1976 ], An almost optimal algorithm for unbounded searching, Inform. Process. Letters 5, 82–87.MathSciNetMATHCrossRefGoogle Scholar
  16. [16]
    Bishop, M. J. and C. J Rawlings, Eds. [ 1987 ], Nucleic Acids and Protein Sequence Analysis, IRL Press, Oxford.Google Scholar
  17. [17]
    Bogart, K. P. [ 1983 ], Introductory Combinatorics, Pitman, N.Y.MATHGoogle Scholar
  18. [18]
    Brown, M. R. and R. E. Tarjan [ 1978 ], A representation of linear lists with movable fingers. Proceedings of the 10-th STOC, San Diego, CA, 19–29.Google Scholar
  19. [19]
    Chang, W. I. and E. L. Lawler [1990], Approximate string matching in sublinear expected time, in Proc. 31st Annual IEEE Symp. on Foundations of Computer Science, St. Louis, MO, 116–124Google Scholar
  20. [20]
    Chao, K. M. [1994], Computing all suboptimal alignments in linear space, in Combinatorial Pattern Matching 1991, M. Crochemore and D. Gusfield, Eds., Proceedings of the 5th Annual Symposium, Asilomar, CA, June 1994, Springer-Verlag Lecture Notes in Computer Science Vol. 807 (1994).Google Scholar
  21. [21]
    Crochemore, M. and W. Rytter [ 1994 ], Text Algorithms, Oxford University Press, N.Y.Google Scholar
  22. [22]
    Dilworth, R. P. [1950], A decomposition theorem for partially ordered sets, Ann. Math. 51, 161–165.Google Scholar
  23. [23]
    Doolittle, R. F., Ed. [ 1990 ], Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods of Enzymology 183, Academic Press, San Diego, CA.Google Scholar
  24. [24]
    van Emde Boas, P. [ 1975 ], Preserving order in a forest in less than logarithmic time, Proc. 16th FOCS, 75–84.Google Scholar
  25. [25]
    Eppstein, D. and Z. Galil [ 1988 ], Parallel algorithmic techniques for combinatorial computation, Ann. Rev. Comput. Sci., 3, 233–283.MathSciNetCrossRefGoogle Scholar
  26. [26]
    Eppstein, D., Z. Galil, R. Giancarlo, and G. Italiano [ 1990 ]. Sparse dynamic programming, Proc. Symp. on Discrete Algorithms, San Francisco, CA, 513–522.Google Scholar
  27. [27]
    Fredman, M. L. [ 1975 ], On Computing the Length of Longest Increasing Subsequences, Discrete Mathematics 11, 29–35.MathSciNetMATHCrossRefGoogle Scholar
  28. [28]
    Fuchs, H., Z. M. Kedem, and S. P. Uselton [ 1977 ], Optimal surface reconstruction from planar contours, Communications of the Assoc. Comput. Mach., 20, 693–702.MathSciNetMATHGoogle Scholar
  29. [29]
    Galil Z. and R. Giancarlo [ 1988 ], Data structures and algorithms for approximate string matching, J. Complexity 4, 33–72.MathSciNetMATHCrossRefGoogle Scholar
  30. [30]
    Galil, Z. and K. Park [ 1990 ], An improved algorithm for approximate string matching, SIAM Jour. Computing 19, 989–999.MathSciNetMATHCrossRefGoogle Scholar
  31. [31]
    Gotoh, O. [ 1982 ]. An improved algorithm for matching biological sequences, J. Mol. Biol. 162, 705–708.CrossRefGoogle Scholar
  32. [32]
    von Heijne, G. [ 1987 ], Sequence Analysis in Molecular Biology, Academic Press, San Diego.Google Scholar
  33. [33]
    Hirschberg, D.S. [ 1975 ], A linear space algorithm for computing maximal common subsequences, CACM 18, 6, 341–343.MathSciNetMATHCrossRefGoogle Scholar
  34. [34]
    Hirschberg, D. S. [ 1977 ], Algorithms for the longest common subsequence problem, JACM 24, 4, 664–675.MathSciNetMATHCrossRefGoogle Scholar
  35. [35]
    Hirschberg, D. S. [ 1978 ], An information theoretic lower bound for the longest common subsequence problem, Inform. Process. Lett. 7: 1, 40–41.MathSciNetMATHCrossRefGoogle Scholar
  36. [36]
    Hsu, W. J., and M. W.Du [ 1984 ], New algorithms for the LCS Problem, J. Comput. System Sci., 29, 133–152.MathSciNetMATHCrossRefGoogle Scholar
  37. [37]
    Hunt, J. W. and T. G. Szymanski [ 1977 ], A fast algorithm for computing longest common subsequences, CACM 20, 5, 350–353.MathSciNetMATHCrossRefGoogle Scholar
  38. [38]
    Ja Ja, J. [ 1992 ], An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA.Google Scholar
  39. [39]
    Jacobson, G. and K. P. Vo [1992], Heaviest increasing/common subsequence problems, in Combinatorial Pattern Matching, Proceedings of the Third Annual Symposium, A. Apostolico, M. Crochemore, Z. Galil and U. Manger, Eds., Tucson, Arizona, 1992. Springer-Verlag, Berlin, Lecture Notes in Computer Science 644, 52–66.Google Scholar
  40. [40]
    Johnson, D. B. [ 1982 ]. A priority queue in which initialization and queue operations take O(log log D) time, Math. Systems Theory 15, 295–309.MATHCrossRefGoogle Scholar
  41. [41]
    Ivanov, A. G. [ 1985 ], Recognition of an approximate occurrence of. words on a Turing machine in real time, Math. USSR Izv., 24, 479–522.MATHCrossRefGoogle Scholar
  42. [42]
    Kedem, Z. M. and H. Fuchs [1980], On finding several shortest paths in certain graphs, in Proc. 18th Allerton Conference on Communication, Control, and Computing, October 1980, pp. 677–683.Google Scholar
  43. [43]
    Kumar, S. K. and C. P. Rangan [ 1987 ], A linear space algorithm for the LCS problem, Acta Informatica 24, 353–362.MathSciNetMATHCrossRefGoogle Scholar
  44. [44]
    Ladner, R. E., and M. J. Fischer [ 1980 ], Parallel prefix computation, J. Assoc. Comput. Mach., 27, 831–838.MathSciNetMATHCrossRefGoogle Scholar
  45. [45]
    Landau. G. M. and U. Vishkin [ 1986 ], Introducing efficient parallelism into approximate string matching and a new serial algorithm, in Proc. 18th Annual ACM STOC, New York, 1986, 220–230.Google Scholar
  46. [46]
    Landau, G. M. and U. Vishkin [ 1988 ], Fast string matching with k differences, Jour. Comp. and System Sci. 37, 63–78.MathSciNetMATHCrossRefGoogle Scholar
  47. [47]
    Leighton, F. T. [ 1992 ], Introduction to Parallel Algorithms and Architectures, Morgan Kaufmann, San Mateo, CA.Google Scholar
  48. [48]
    Levenshtein, V. I. [ 1966 ], Binary codes capable of correcting deletions, insertions and reversals, Soviet Phys. Dokl., 10, 707–710.Google Scholar
  49. [49]
    Lipton, R. J. and D. Lopresti [ 1985 ], A systolic array for rapid string comparison Proc. Chapel Hill Conf. on Very Large Scale Integration, H. Fucs, Ed., Computer Science Press, 363–376.Google Scholar
  50. [50]
    H. M. Martinez, Ed. [ 1984 ], Mathematical and computational problems in the analysis of molecular sequences, Bull. Math. Bio. 46, ( Special Issue Honoring M. O. Dayhoff ).Google Scholar
  51. [51]
    Masek, W. J. and M. S. Paterson [ 1980 ], A faster algorithm computing string edit distances, J. Comput. System Sci., 20, 18–31.MathSciNetMATHCrossRefGoogle Scholar
  52. [52]
    Mathies, T. R. [ 1988 ], A fast parallel algorithm to determine edit distance, Tech. Report CMU-CS-88–130, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, April 1988.Google Scholar
  53. [53]
    Mehlhorn, K. [ 1984 ], Data structures and algorithms 1: sorting and searching, EATCS Monographs on TCS, Springer-Verlag, Berlin.MATHGoogle Scholar
  54. [54]
    Myers, E. W. and W. Miller [ 1988 ], Optimal alignments in linear space, Comp. Appl. Biosc. 4, 1, 11-17.Google Scholar
  55. [55]
    Myers, E. W. [ 1986 ], An O(ND) difference algorithm and its variations, Algorithmica 1, 251–266.MathSciNetMATHCrossRefGoogle Scholar
  56. [56]
    Nakatsu, N., Y. Kambayashi, and S. Yajima [ 1982 ], A longest common subsequence algorithm suitable for similar text strings, Acta Informatica 18, 171–179.MathSciNetMATHCrossRefGoogle Scholar
  57. [57]
    Needleman, R. B. and C. D. Wunsch [ 1973 ], A general method applicable to the search for similarities in the amino-acid sequence of two proteins, J. Molecular Bio., 48, 443–453.CrossRefGoogle Scholar
  58. [58]
    Ranka, S. and S. Sahni [ 1988 ], String editing on an SIMD hypercube multi-computer, Tech. Report 88–29, Department of Computer Science, University of Minnesota, March 1988, J. Parallel Distributed Comput.Google Scholar
  59. [59]
    Salomaa, A. [ 1973 ] Formal Languages, Academic Press, Orlando, Fl.MATHGoogle Scholar
  60. [60]
    Sankoff, D.[ 1972 ], Matching sequences under deletion-insertion constraints, Proc. Nat. Acad. Sci. U.S.A., 69, 4–6.MathSciNetMATHCrossRefGoogle Scholar
  61. [61]
    Sankoff, D. and J. B. Kruskal, Eds. [ 1983 ], Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA.Google Scholar
  62. [62]
    Sankoff, D. and P. H. Sellers [ 1973 ], Shortcuts, Diversions and Maximal Chains in Partially Ordered Sets, Discrete Mathematics, 4, 287–293.MathSciNetMATHCrossRefGoogle Scholar
  63. [63]
    Sellers, P. H. [ 1980 ], The theory and computation of evolutionary distance: pattern recognition, J. Algorithms, 1, 359–373.MathSciNetMATHCrossRefGoogle Scholar
  64. [64]
    Smith, T. F. and M. S. Waterman [ 1981 ], Identification of Common Molecular Subsequences, Journal of Molecular Biology 147, 195–197.CrossRefGoogle Scholar
  65. [65]
    Ukkonen, E. [ 1985 ], Finding approximate patterns in strings, J. Algorithms 6, 132–137.MathSciNetMATHCrossRefGoogle Scholar
  66. [67]
    Wagner, R. A. and M. J. Fischer [ 1974 ], The string to string correction problem, J. Assoc. Comput. Mach., 21, 168–173.MathSciNetMATHCrossRefGoogle Scholar
  67. [68]
    Waterman, M. S. (Ed.) [ 1989 ], Mathematical Methods for DNA sequences, CRC Press, Boca Raton.MATHGoogle Scholar
  68. [69]
    Wong, C. K. and A. K. Chandra [ 1976 ], Bounds for the string editing problem, J. Assoc. Comput. Mach., 23, 13–16.MathSciNetMATHCrossRefGoogle Scholar
  69. [70]
    Wu, S., U. Manber, E. W. Myers, and W. Miller [ 1990 ]. An O(NP) sequence comparison algorithm, Info. Proc. Letters 35, 317–323.MathSciNetMATHCrossRefGoogle Scholar
  70. [71]
    Wu, S., U. Manber, and E. Myers [ 1991 ]. Improving the running times for some string-matching problems.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Alberto Apostolico

There are no affiliations available

Personalised recommendations