Advertisement

Greedy algorithms for the shortest common superstring that are asymtotically optimal

  • Alan Frieze
  • Wojciech Szpankowski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1136)

Abstract

There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NP-hard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most β times (2≤β≤4) worse than optimal. We analyze the problem in a probabilistic framework,and consider the optimal total overlap O n opt and the overlap O n gr produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that in several cases, with high probability \(\lim _{n \to \infty } \tfrac{{O_n^{opt} }}{{n\log n}} = \lim _{n \to \infty } \tfrac{{O_n^{gr} }}{{n\log n}} = \tfrac{1}{H}\)where n is the number of original strings, and H is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short. Finally, we provide several generalizations and extensions of our basic result.

Keywords

Greedy Algorithm Optimal Total Suffix Tree Bernoulli Model Cycle Cover 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    K. Alexander, Shortest Common Superstring of Random Strings, Proc. Combinatorial Pattern Matching, Springer-Verlag, LNCS #807, 164–172, 1994Google Scholar
  2. [2]
    C.Armen and C.Stein, Short Superstrings and the Structure of Overlapping Strings, Journal of Computational Biology, to appear.Google Scholar
  3. [3]
    C.Armen and C.Stein, A 2-2/3 Approximation Algorithm for the Shortest Superstring Problem, Proc. Combinatorial Pattern Matching, 1996.Google Scholar
  4. [4]
    W. Bains and G. Smith, A Novel Method for Nucleic Acid Sequence Determination, J. Theor. Biol., 135, 303–307, 1988.PubMedGoogle Scholar
  5. [5]
    A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis, Linear Approximation of Shortest Superstring, J. the ACM, 41, 630–647, 1994; also STOC, 328–336, 1991.Google Scholar
  6. [6]
    T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley&Sons, New York (1991).Google Scholar
  7. [7]
    A.Czumaj, L.Gasienic, M.Piotrow and W.Rytter, Parallel and Sequential Approximations of Shortest Superstrings, Proceedings of the Fourth Scandinavian Workshop on Algorithm Theory, 95–106, 1994.Google Scholar
  8. [8]
    R. Drmanac and C. Crkvenjakov, Sequencing by Hybridization (SBH) with Oligonucloide Probes as an Integral Approach for the Analysis of Complex Genome, Int. J. genomic Research, 1, 59–79, 1992.Google Scholar
  9. [9]
    J. Gallant, D. Maier and J.A. Storer, On Finding Minimal Length Superstrings, Journal of Computer and System Sciences, 20, 50–58, 1980.CrossRefGoogle Scholar
  10. [10]
    P. Jacquet and W. Szpankowski, Analysis of Digital Tries with Markovian Dependency, IEEE Trans. on Information Theory, 37, 1470–1475, 1991.Google Scholar
  11. [11]
    T. Jiang and M. Li, Approximating Shortest Superstring with Constraints, WADS, 385–396, Montreal 1993.Google Scholar
  12. [12]
    T.Jiang, Z.Jiang and D.Breslauer, Rotation of Periodic Strings and Short Superstrings, Proceedings of the Third South American Conference on String Processing, to appear.Google Scholar
  13. [13]
    D. E. Knuth, The Art of Computer Programming. Sorting and Searching, Addison-Wesley 1973.Google Scholar
  14. [14]
    D. E. Knuth, Motwani, and B. Pittel, Stable Husbands, Random Structures and Algorithms, 1, 1–14, 1990.Google Scholar
  15. [15]
    S.R.Kosaraju, J.K.Park and C.Stein, Long Tours and Short Superstrings, Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 166–177, 1994.Google Scholar
  16. [16]
    A. Lesek (Ed.), Computational Molecular Biology, Sources and Methods for Sequence Analysis, Oxford University Press, 1988.Google Scholar
  17. [17]
    Ming Li, Towards a DNA Sequencing Theory, Proc. of 31st IEEE Symp. on Foundation of Computer Science, 125–134 1990.Google Scholar
  18. [18]
    T. Luczak and W. Szpankowski, A Lossy Data Compression Based on an Approximate Pattern Matching, IEEE Trans. Information Theory, to appear; also Purdue University, CSD-TR-94-072, 1994.Google Scholar
  19. [19]
    P. Pevzner, l-tuple DNA Sequencing: Computer Analysis, J. Biomolecular Structure and Dynamics, 7, 63–73, 1989.Google Scholar
  20. [20]
    B. Pittel, Asymptotic Growth of a Class of Random Trees, Ann. Probab., 18, 414–427, 1985.Google Scholar
  21. [21]
    P. Shields, Entropy and Prefixes, Ann. Probab., 20, 403–409, 1992.Google Scholar
  22. [22]
    W. Szpankowski, The Evaluation of an Alternative (sic!) Sum with Applications to the Analysis of Some Data Structures, Information Processing Letters, 28, 13–19, 1988.Google Scholar
  23. [23]
    W. Szpankowski, A Generalized Suffix Tree and its (Un)Expected Asymptotic Behaviors, SIAM J. Computing, 22, pp. 1176–1198, 1993.Google Scholar
  24. [24]
    S. Teng and F. Yao, Approximating Shortest Superstring, Proc. FOCS, 158–165, 1993.Google Scholar
  25. [25]
    E. Ukkonen, A Linear-Time Algorithm for Finding Approximate Shortest Common Superstrings, Algorithmica, 5, 313–323, 1990.Google Scholar
  26. [26]
    E. Ukkonen, Approximate String-Matching over Suffix Trees, Proc. Combinatorial Pattern Matching, 228–242, Padova, 1993.Google Scholar
  27. [27]
    E-H. Yang and Z. Zhang, The Shortest Common Superstring Problem: Average Case Analysis for Both Exact Matching and Approximate Matching, preprint.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Alan Frieze
    • 1
  • Wojciech Szpankowski
    • 2
  1. 1.Dept. of MathematicsCarnegie Mellon UniversityPittsburghUSA
  2. 2.Dept. of Computer SciencePurdue UniversityW. LafayetteUSA

Personalised recommendations