Abstract
Despite recent advances, assembly of genomes from the high-throughput data generated by the next-generation sequencing (NGS) technologies remains one of the most challenging tasks in modern biology. Here we address the sequence reconstruction problem, whereby, for a given collection of subsequences or factors, one has to determine the set of sequences compliant with the collection. First, we give a brief review of sequencing technologies, along with an exposition of the advantages and shortcomings of the existing algorithmic approaches to sequence assembly. In addition, we enumerate some properties of subsequences, which have been overlooked in the existing heuristic solutions despite their effect on the quality of the assembly. We then give an overview of the sequence reconstruction problem from a language-theoretic perspective, and present a comprehensive review of theoretical results that may prove relevant to the genome assembly problem. Finally, we outline a new optimization-based formulation which casts the sequence reconstruction problem as a quadratic integer programming problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
J. Adams, DNA sequencing technologies. Nat. Educ. 1(1) (2008)
J. Butler, I. MacCallum, M. Kleber, I.A. Shlyakhter, M.K. Belmonte, E.S. Lander, C. Nusbaum, D.B. Jaffe, ALLPATHS, de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008)
A. Carpi, A. De Luca, Words and special factors. Theor. Comput. Sci. 259(1–2), 145–182 (2001)
A. Carpi, A. De Luca, S. Varricchio, Words, univalent factors, and boxes. Acta Inform. 38, 409–436 (2002)
J.C. Dohm, C. Lottaz, T. Borodina, H. Himmelbauer, SHARCGS, a fast and highly accurate short read assembly algorithm for de nove genomic sequencing. Genome Res. 17, 1697–1706 (2007)
M. Dudik, L.J. Schulman, Reconstruction from subsequences. J. Comb. Theory A 103, 337–348 (2003)
P.L. Erdos, P. Ligeti, P. Sziklai, D.C. Torney, Subwords in reverse-complement order. Ann. Comb. 10, 415–430 (2006)
R.D. Fleischmann, M.D. Adams, O. White, R.A. Clayton, E.F. Kirkness, A.R. Kerlavage, C.J. Bult, J.F. Tomb, B.A. Doughherty, J.M. Merrick, K. McKenney, G. Sutton, W. FitzHugh, C. Fields, J.D. Gocyne, J. Scott, R. Shirley, L. Liu, A. Glodek, J.M. Kelley, J.F. Weidman, C.A. Phillips, T. Spriggs, E. Hedblom, M.D. Cotton, T.R. Utterback, M.C. Hanna, D.T. Nguyen, D.M. Saudek, R.C. Brandon, L.D. Fine, J.L. Fritchman, J.L. Fuhrmann, N.S.M. Geoghagen, C.L. Gnehm, L.A. McDonald, K.V. Small, C.M. Fraser, H.O. Smith, J.C. Venter, Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512 (1995)
http://www.lifetechnologies.com/content/lifetech/us/en/home/about-us/news-gallery/press-releases/2012/life-techologies-itroduces-the-bechtop-io-proto.html.html. Accessed Mar 2013
X. Huang, A. Madan, CAP3: a DNA sequence assembly program. Genome Res. 9, 868–877 (1999)
Human Genome Project Information, Genomic science program. http://www.genomics.energy.gov. Accessed Oct 2012
R.M. Idury, M.S. Waterman, A new algorithm for DNA sequence assembly. J. Comput. Biol. 2(2), 291–306 (1995)
W.R. Jeck, J.A. Reinhardt, D.A. Baltrus, M.T. Hickenbotham, V. Magrini, E.R. Mardis, J.L. Dangl, C.D. Jones, Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 2942–2944 (2007)
S. Koren, M.C. Schatz, B.P. Walenz, J. Martin, J.T. Howard, G. Ganapathy, Z. Wang, D.A. Rasko, W.R. McCombie, E.D. Jarvis, A.M. Phillippy, Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012)
I. Krasikov, Y. Roditty, On a reconstruction problem of sequences. J. Comb. Theory A77, 344–348 (1997)
H. Lee, H. Tang, Next-generation sequencing technologies and fragment assembly algorithms. Methods Mol. Biol. 855(2), 155–174 (2012)
V. Levenshtein, Reconstruction of objects from a minimum number of distorted patterns. Dokl. Math. 55, 417–420 (1997)
V. Levenshtein, Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory A 93, 310–332 (2001)
L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu, M. Law, Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 1–11 (2012)
J. Manuch, Characterization of a word by its subwords, in Developments in Language Theory – Foundations, Applications, and Perspectives, Proc. DLT 2000, ed. by G. Rozenberg, W. Thomas, pp. 210–219
B. Manvel, A. Meyerowitz, A. Schwenk, K. Smith, P. Stockmeyer, Reconstruction of sequences. Discret. Math. 94, 209–219 (1991)
M. Margulies, M. Egholm, W.E. Altman, S. Attiya, J.S. Bader, L.A. Bemben, J. Berka, M.S. Braverman, Y. Chen, Z. Chen, S.B. Dewell, A. de Winter, J. Drake, L. Du, J.M. Fierro, R. Forte, X.V. Gomes, B.C. Godwin, W. He, S. Helgesen, C.H. Ho, S.K. Hutchison, G. Irzyk, S.C. Jando, M.L.I. Alenquer, T.P. Jarvie, K.B. Jirage, J. Kim, J.R. Knight, J.R. Lanza, J.H. Leamon, W.L. Lee, S.M. Lefkowitz, M. Lei, J. Li, K.L. Lohman, H. Lu, V.B. Makhijani, K.E. McDade, M.P. McKenna, E.W. Myers, E. Nickerson, J.R. Nobile, R. Plant, B.P. Puc, M. Reifler, M.T. Ronan, G.T. Roth, G.J. Sarkis, J.F. Simons, J.W. Simpson, M. Srinivasan, K.R. Tartaro, A. Tomasz, K.A. Vogt, G.A. Volkmer, S.H. Wang, Y. Wang, M.P. Weiner, D.A. Willoughby, P. Yu, R.F. Begley, J.M. Rothberg, Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)
P. Medvedev, M. Stanciu, M. Brudno, Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6, S13–S20 (2009)
M. Metzker, Sequencing technologies – the next generation. Nat. Genet. 11, 31–46 (2010)
J.R. Miller, S. Koren, G. Sutton, Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)
E.W. Myers, G.G. Sutton, A.L. Delcher, I.M. Dew, D.P. Fasulo, M.J. Flanigan, S.A. Kravitz, C.M. Mobarry, K.H. Reinert, K.A. Remington, E.L. Anson, R.A. Bolanos, H. Chou, C.M. Jordan, A.L. Halpern, S. Lonardi, E.M. Beasley, R.C. Brandon, L. Chen, P.J. Dunn, Z. Lai, Y. Liang, D.R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G.M. Rubin, M.D. Adams, J.C. Venter, A whole genome assembly of Drosophilia. Science 287, 2196–2204 (2000)
P.C. Ng, E.F. Kirkness, Whole genome sequencing. Methods Mol. Biol. 628, 215–226 (2010)
A.P. Pevzner, T. Haixu, S.M. Waterman, An Eulerian path approach to DNA fragment assembly. PNAS 98(17), 9748–9753 (2001)
A.M. Phillippy, M.C. Schatz, M. Pop, Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. (2008). doi:10.1186/gb-2008-9-3-r55
M. Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform. 10(4), 354–366 (2009)
M. Quail, M.E. Smith, P. Coupland, T.D. Otto, S.R. Harris, T.R. Connor, A. Bertoni, H.P. Swerdlow, Y. Gu, A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1), 341 (2012). doi:10.1186/1471-2164-13-341
F. Sanger, A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94, 441–448 (1975)
M.C. Schatz, A.L. Delcher, S.L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)
J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein, S.J. Jones, I. Byrol, ABySS, a parralel asembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)
G.G. Sutton, O. White, M.D. Adams, A.R. Kerlavage, TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Technol. 1, 9–19 (1995)
T.J. Treangen, S.L. Salzberg, Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(2), 36–46 (2012)
R.L. Warren, G.G. Sutton, S.J. Jones, R.A. Holt, Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–501 (2007)
K.A. Wetterstrand, DNA sequencing costs: data from the NHGRI large-scale genome sequencing program. http://www.genome.gov/sequencingcosts. Accessed Oct 2012
D.R. Zerbino, E. Birney, Velvet, algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Angeleska, A., Kleessen, S., Nikoloski, Z. (2014). The Sequence Reconstruction Problem. In: Jonoska, N., Saito, M. (eds) Discrete and Topological Models in Molecular Biology. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40193-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-40193-0_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40192-3
Online ISBN: 978-3-642-40193-0
eBook Packages: Computer ScienceComputer Science (R0)