Skip to main content

The Sequence Reconstruction Problem

  • Chapter
  • First Online:
Discrete and Topological Models in Molecular Biology

Part of the book series: Natural Computing Series ((NCS))

  • 2112 Accesses

Abstract

Despite recent advances, assembly of genomes from the high-throughput data generated by the next-generation sequencing (NGS) technologies remains one of the most challenging tasks in modern biology. Here we address the sequence reconstruction problem, whereby, for a given collection of subsequences or factors, one has to determine the set of sequences compliant with the collection. First, we give a brief review of sequencing technologies, along with an exposition of the advantages and shortcomings of the existing algorithmic approaches to sequence assembly. In addition, we enumerate some properties of subsequences, which have been overlooked in the existing heuristic solutions despite their effect on the quality of the assembly. We then give an overview of the sequence reconstruction problem from a language-theoretic perspective, and present a comprehensive review of theoretical results that may prove relevant to the genome assembly problem. Finally, we outline a new optimization-based formulation which casts the sequence reconstruction problem as a quadratic integer programming problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. J. Adams, DNA sequencing technologies. Nat. Educ. 1(1) (2008)

    Google Scholar 

  2. J. Butler, I. MacCallum, M. Kleber, I.A. Shlyakhter, M.K. Belmonte, E.S. Lander, C. Nusbaum, D.B. Jaffe, ALLPATHS, de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008)

    Article  Google Scholar 

  3. A. Carpi, A. De Luca, Words and special factors. Theor. Comput. Sci. 259(1–2), 145–182 (2001)

    Article  MATH  Google Scholar 

  4. A. Carpi, A. De Luca, S. Varricchio, Words, univalent factors, and boxes. Acta Inform. 38, 409–436 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  5. J.C. Dohm, C. Lottaz, T. Borodina, H. Himmelbauer, SHARCGS, a fast and highly accurate short read assembly algorithm for de nove genomic sequencing. Genome Res. 17, 1697–1706 (2007)

    Article  Google Scholar 

  6. M. Dudik, L.J. Schulman, Reconstruction from subsequences. J. Comb. Theory A 103, 337–348 (2003)

    Google Scholar 

  7. P.L. Erdos, P. Ligeti, P. Sziklai, D.C. Torney, Subwords in reverse-complement order. Ann. Comb. 10, 415–430 (2006)

    Article  MathSciNet  Google Scholar 

  8. R.D. Fleischmann, M.D. Adams, O. White, R.A. Clayton, E.F. Kirkness, A.R. Kerlavage, C.J. Bult, J.F. Tomb, B.A. Doughherty, J.M. Merrick, K. McKenney, G. Sutton, W. FitzHugh, C. Fields, J.D. Gocyne, J. Scott, R. Shirley, L. Liu, A. Glodek, J.M. Kelley, J.F. Weidman, C.A. Phillips, T. Spriggs, E. Hedblom, M.D. Cotton, T.R. Utterback, M.C. Hanna, D.T. Nguyen, D.M. Saudek, R.C. Brandon, L.D. Fine, J.L. Fritchman, J.L. Fuhrmann, N.S.M. Geoghagen, C.L. Gnehm, L.A. McDonald, K.V. Small, C.M. Fraser, H.O. Smith, J.C. Venter, Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512 (1995)

    Article  Google Scholar 

  9. http://www.lifetechnologies.com/content/lifetech/us/en/home/about-us/news-gallery/press-releases/2012/life-techologies-itroduces-the-bechtop-io-proto.html.html. Accessed Mar 2013

  10. X. Huang, A. Madan, CAP3: a DNA sequence assembly program. Genome Res. 9, 868–877 (1999)

    Article  Google Scholar 

  11. Human Genome Project Information, Genomic science program. http://www.genomics.energy.gov. Accessed Oct 2012

  12. R.M. Idury, M.S. Waterman, A new algorithm for DNA sequence assembly. J. Comput. Biol. 2(2), 291–306 (1995)

    Article  Google Scholar 

  13. W.R. Jeck, J.A. Reinhardt, D.A. Baltrus, M.T. Hickenbotham, V. Magrini, E.R. Mardis, J.L. Dangl, C.D. Jones, Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 2942–2944 (2007)

    Article  Google Scholar 

  14. S. Koren, M.C. Schatz, B.P. Walenz, J. Martin, J.T. Howard, G. Ganapathy, Z. Wang, D.A. Rasko, W.R. McCombie, E.D. Jarvis, A.M. Phillippy, Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012)

    Article  Google Scholar 

  15. I. Krasikov, Y. Roditty, On a reconstruction problem of sequences. J. Comb. Theory A77, 344–348 (1997)

    Article  MathSciNet  Google Scholar 

  16. H. Lee, H. Tang, Next-generation sequencing technologies and fragment assembly algorithms. Methods Mol. Biol. 855(2), 155–174 (2012)

    Article  Google Scholar 

  17. V. Levenshtein, Reconstruction of objects from a minimum number of distorted patterns. Dokl. Math. 55, 417–420 (1997)

    Google Scholar 

  18. V. Levenshtein, Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory A 93, 310–332 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  19. L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu, M. Law, Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 1–11 (2012)

    Google Scholar 

  20. J. Manuch, Characterization of a word by its subwords, in Developments in Language Theory – Foundations, Applications, and Perspectives, Proc. DLT 2000, ed. by G. Rozenberg, W. Thomas, pp. 210–219

    Google Scholar 

  21. B. Manvel, A. Meyerowitz, A. Schwenk, K. Smith, P. Stockmeyer, Reconstruction of sequences. Discret. Math. 94, 209–219 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  22. M. Margulies, M. Egholm, W.E. Altman, S. Attiya, J.S. Bader, L.A. Bemben, J. Berka, M.S. Braverman, Y. Chen, Z. Chen, S.B. Dewell, A. de Winter, J. Drake, L. Du, J.M. Fierro, R. Forte, X.V. Gomes, B.C. Godwin, W. He, S. Helgesen, C.H. Ho, S.K. Hutchison, G. Irzyk, S.C. Jando, M.L.I. Alenquer, T.P. Jarvie, K.B. Jirage, J. Kim, J.R. Knight, J.R. Lanza, J.H. Leamon, W.L. Lee, S.M. Lefkowitz, M. Lei, J. Li, K.L. Lohman, H. Lu, V.B. Makhijani, K.E. McDade, M.P. McKenna, E.W. Myers, E. Nickerson, J.R. Nobile, R. Plant, B.P. Puc, M. Reifler, M.T. Ronan, G.T. Roth, G.J. Sarkis, J.F. Simons, J.W. Simpson, M. Srinivasan, K.R. Tartaro, A. Tomasz, K.A. Vogt, G.A. Volkmer, S.H. Wang, Y. Wang, M.P. Weiner, D.A. Willoughby, P. Yu, R.F. Begley, J.M. Rothberg, Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)

    Google Scholar 

  23. P. Medvedev, M. Stanciu, M. Brudno, Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6, S13–S20 (2009)

    Article  Google Scholar 

  24. M. Metzker, Sequencing technologies – the next generation. Nat. Genet. 11, 31–46 (2010)

    Google Scholar 

  25. J.R. Miller, S. Koren, G. Sutton, Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)

    Article  Google Scholar 

  26. E.W. Myers, G.G. Sutton, A.L. Delcher, I.M. Dew, D.P. Fasulo, M.J. Flanigan, S.A. Kravitz, C.M. Mobarry, K.H. Reinert, K.A. Remington, E.L. Anson, R.A. Bolanos, H. Chou, C.M. Jordan, A.L. Halpern, S. Lonardi, E.M. Beasley, R.C. Brandon, L. Chen, P.J. Dunn, Z. Lai, Y. Liang, D.R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G.M. Rubin, M.D. Adams, J.C. Venter, A whole genome assembly of Drosophilia. Science 287, 2196–2204 (2000)

    Article  Google Scholar 

  27. P.C. Ng, E.F. Kirkness, Whole genome sequencing. Methods Mol. Biol. 628, 215–226 (2010)

    Article  Google Scholar 

  28. A.P. Pevzner, T. Haixu, S.M. Waterman, An Eulerian path approach to DNA fragment assembly. PNAS 98(17), 9748–9753 (2001)

    Article  MATH  Google Scholar 

  29. A.M. Phillippy, M.C. Schatz, M. Pop, Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. (2008). doi:10.1186/gb-2008-9-3-r55

    Google Scholar 

  30. M. Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform. 10(4), 354–366 (2009)

    Article  Google Scholar 

  31. M. Quail, M.E. Smith, P. Coupland, T.D. Otto, S.R. Harris, T.R. Connor, A. Bertoni, H.P. Swerdlow, Y. Gu, A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1), 341 (2012). doi:10.1186/1471-2164-13-341

    Google Scholar 

  32. F. Sanger, A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94, 441–448 (1975)

    Article  Google Scholar 

  33. M.C. Schatz, A.L. Delcher, S.L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)

    Article  Google Scholar 

  34. J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein, S.J. Jones, I. Byrol, ABySS, a parralel asembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)

    Article  Google Scholar 

  35. G.G. Sutton, O. White, M.D. Adams, A.R. Kerlavage, TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Technol. 1, 9–19 (1995)

    Article  Google Scholar 

  36. T.J. Treangen, S.L. Salzberg, Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(2), 36–46 (2012)

    Google Scholar 

  37. R.L. Warren, G.G. Sutton, S.J. Jones, R.A. Holt, Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–501 (2007)

    Article  Google Scholar 

  38. K.A. Wetterstrand, DNA sequencing costs: data from the NHGRI large-scale genome sequencing program. http://www.genome.gov/sequencingcosts. Accessed Oct 2012

  39. D.R. Zerbino, E. Birney, Velvet, algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angela Angeleska .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Angeleska, A., Kleessen, S., Nikoloski, Z. (2014). The Sequence Reconstruction Problem. In: Jonoska, N., Saito, M. (eds) Discrete and Topological Models in Molecular Biology. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40193-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40193-0_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40192-3

  • Online ISBN: 978-3-642-40193-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics