Abstract
We investigate the application of trie-based data structures, suffix trees and suffix arrays in the problem of overlap detection in fragment assembly. Both data structures are theoretically and experimentally analyzed on speed and space. By using heuristics, we can greatly reduce the calls to the time-consuming dynamic programming, and have improved the speed of overlap detection up to 1,000 times with high accuracy in our collaborative DNA sequencing with Brookhaven National Laboratory. We also studied the problem of approximating maximum space savings in tries structures for unification factoring in logic programming, which is proved to be hard.
Supported by ONR award 400x116yip01 and NSF Grant CCR-9625669.
Preview
Unable to display preview. Download preview PDF.
References
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990.
M Bellare, O. Goldreich, and M. Sudan. Free bits, PCPs, and non-approximability — towards tight results. In Proc. IEEE 36th Symp. Foundations of Computer Science, pages 422–431, 1995.
D.R. Clark and J.I. Munro. Efficient suffix trees on secondary storage. In Proc. Seventh ACM Symp. on Discrete Algorithms (SODA), pages 383–391, 1996.
S. Dawson, C.R. Ramakrishnan, I.V. Ramakrishnan, K. Sagonas, T. Swift, and D.S. Warren. Unification factoring for efficient execution of logic programs. In 2nd ACM Symposium on Principles of Programming Languages (POPL '95), pages 247–258, 1995.
S. Dawson, C.R. Ramakrishnan, and T. Swift. Principles and practice of unification factoring. In ACM Trans. on Programming Languages (TOPLAS), pages 528–563, 1996.
M.L. Engle and C. Burks. Artificially generated data sets for testing DNA fragment assembly algorithms. Genomics, 16:286–288, 1993.
P. Green. Documentation for phrap. Genome Center, University of Washington, http://bozeman.mbt.washington.edu, 1996.
J. Kececioglu and E.W. Myers. Exact and approximate algorithms for the sequence reconstruction problem. Algorithmica, 13:5–51, 1995.
C.-L. Lin. Optimizing tries for ordered pattern matching is π p2 -complete. In Proc. 10th IEEE Structures in Complexity Theory Conference, pages 238–244, 1995.
C. Lund and M. Yannakakis. The approximation of maximum subgraph problems. In Proc. 20th ICALP, pages 40–51, 1992.
U. Manber and E.W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Computing, 22:935–948, 1993.
E. W. Myers. Towards simplifying and accurately formulating fragment assembly. J. Comp. Biol., 2(2):275–290, 1995.
W.R. Pearson and D.J. Lipman. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci., pages 2444–2448, 1988.
H. Simon. On approximate solutions for combinatorial optimization problems. SIAM J. Discrete Math., 3:294–310, 1990.
G.G. Sutton, O. White, M.D. Admas, and A.R. Kerlavage. TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1:9–19, 1995.
M. S. Waterman. Introduction to Computational Biology. Chapman & Hall, London, UK, 1995.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, T., Skiena, S.S. (1997). Trie-based data structures for sequence assembly. In: Apostolico, A., Hein, J. (eds) Combinatorial Pattern Matching. CPM 1997. Lecture Notes in Computer Science, vol 1264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63220-4_61
Download citation
DOI: https://doi.org/10.1007/3-540-63220-4_61
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63220-7
Online ISBN: 978-3-540-69214-0
eBook Packages: Springer Book Archive