# Greedy algorithms for the shortest common superstring that are asymtotically optimal

## Abstract

There has recently been a resurgence of interest in the *shortest common superstring* problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NP-hard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most *β* times (2≤*β*≤4) worse than optimal. We analyze the problem in a probabilistic framework,and consider the optimal total overlap *O* _{n} ^{opt} and the overlap *O* _{n} ^{gr} produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that in several cases, with high probability \(\lim _{n \to \infty } \tfrac{{O_n^{opt} }}{{n\log n}} = \lim _{n \to \infty } \tfrac{{O_n^{gr} }}{{n\log n}} = \tfrac{1}{H}\)where *n* is the number of original strings, and *H* is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short. Finally, we provide several generalizations and extensions of our basic result.

## Keywords

Greedy Algorithm Optimal Total Suffix Tree Bernoulli Model Cycle Cover## Preview

Unable to display preview. Download preview PDF.

## References

- [1]K. Alexander, Shortest Common Superstring of Random Strings,
*Proc. Combinatorial Pattern Matching*, Springer-Verlag, LNCS #807, 164–172, 1994Google Scholar - [2]C.Armen and C.Stein, Short Superstrings and the Structure of Overlapping Strings,
*Journal of Computational Biology*, to appear.Google Scholar - [3]C.Armen and C.Stein, A 2-2/3 Approximation Algorithm for the Shortest Superstring Problem,
*Proc. Combinatorial Pattern Matching*, 1996.Google Scholar - [4]W. Bains and G. Smith, A Novel Method for Nucleic Acid Sequence Determination,
*J. Theor. Biol.*, 135, 303–307, 1988.PubMedGoogle Scholar - [5]A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis, Linear Approximation of Shortest Superstring,
*J. the ACM*, 41, 630–647, 1994; also*STOC*, 328–336, 1991.Google Scholar - [6]T.M. Cover and J.A. Thomas,
*Elements of Information Theory*, John Wiley&Sons, New York (1991).Google Scholar - [7]A.Czumaj, L.Gasienic, M.Piotrow and W.Rytter, Parallel and Sequential Approximations of Shortest Superstrings,
*Proceedings of the Fourth Scandinavian Workshop on Algorithm Theory*, 95–106, 1994.Google Scholar - [8]R. Drmanac and C. Crkvenjakov, Sequencing by Hybridization (SBH) with Oligonucloide Probes as an Integral Approach for the Analysis of Complex Genome,
*Int. J. genomic Research*, 1, 59–79, 1992.Google Scholar - [9]J. Gallant, D. Maier and J.A. Storer, On Finding Minimal Length Superstrings,
*Journal of Computer and System Sciences*, 20, 50–58, 1980.CrossRefGoogle Scholar - [10]P. Jacquet and W. Szpankowski, Analysis of Digital Tries with Markovian Dependency,
*IEEE Trans. on Information Theory*, 37, 1470–1475, 1991.Google Scholar - [11]T. Jiang and M. Li, Approximating Shortest Superstring with Constraints,
*WADS*, 385–396, Montreal 1993.Google Scholar - [12]T.Jiang, Z.Jiang and D.Breslauer, Rotation of Periodic Strings and Short Superstrings,
*Proceedings of the Third South American Conference on String Processing*, to appear.Google Scholar - [13]D. E. Knuth,
*The Art of Computer Programming. Sorting and Searching*, Addison-Wesley 1973.Google Scholar - [14]D. E. Knuth, Motwani, and B. Pittel, Stable Husbands,
*Random Structures and Algorithms*, 1, 1–14, 1990.Google Scholar - [15]S.R.Kosaraju, J.K.Park and C.Stein, Long Tours and Short Superstrings,
*Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science*, 166–177, 1994.Google Scholar - [16]A. Lesek (Ed.),
*Computational Molecular Biology, Sources and Methods for Sequence Analysis*, Oxford University Press, 1988.Google Scholar - [17]Ming Li, Towards a DNA Sequencing Theory,
*Proc. of 31st IEEE Symp. on Foundation of Computer Science*, 125–134 1990.Google Scholar - [18]T. Luczak and W. Szpankowski, A Lossy Data Compression Based on an Approximate Pattern Matching,
*IEEE Trans. Information Theory*, to appear; also Purdue University, CSD-TR-94-072, 1994.Google Scholar - [19]P. Pevzner,
*l*-tuple DNA Sequencing: Computer Analysis,*J. Biomolecular Structure and Dynamics*,**7**, 63–73, 1989.Google Scholar - [20]B. Pittel, Asymptotic Growth of a Class of Random Trees,
*Ann. Probab.*,**18**, 414–427, 1985.Google Scholar - [21]
- [22]W. Szpankowski, The Evaluation of an Alternative (sic!) Sum with Applications to the Analysis of Some Data Structures,
*Information Processing Letters*, 28, 13–19, 1988.Google Scholar - [23]W. Szpankowski, A Generalized Suffix Tree and its (Un)Expected Asymptotic Behaviors,
*SIAM J. Computing*, 22, pp. 1176–1198, 1993.Google Scholar - [24]S. Teng and F. Yao, Approximating Shortest Superstring,
*Proc. FOCS*, 158–165, 1993.Google Scholar - [25]E. Ukkonen, A Linear-Time Algorithm for Finding Approximate Shortest Common Superstrings,
*Algorithmica*, 5, 313–323, 1990.Google Scholar - [26]E. Ukkonen, Approximate String-Matching over Suffix Trees,
*Proc. Combinatorial Pattern Matching*, 228–242, Padova, 1993.Google Scholar - [27]E-H. Yang and Z. Zhang, The Shortest Common Superstring Problem: Average Case Analysis for Both Exact Matching and Approximate Matching, preprint.Google Scholar