Abstract
The computational process of reconstructing a genome by assembling large amounts of raw sequencing data into long DNA fragments poses great challenges. This chapter illustrates current genome sequencing technologies and assembly algorithms by example of the tomato genome sequencing project. Over the last decade, “Next Generation Sequencing” technologies have placed great emphasis on efficient library preparation, high throughput and long read length. These developments have pushed the evolution of genome assembly approaches from greedy overlap-layout-consensus approaches that were used to assemble Sanger sequences, to de Bruijn graph and string graph approaches that are currently in use to assemble these new types of sequencing data produced in large volume. Nonetheless, many species still lack a high-quality, gold-standard genome sequence as genome assembly is still far from a solved problem. Several approaches have been developed to estimate the quality of assembled genome sequences and to perform so-called genome finishing, a complicated and costly procedure to complete the unresolved regions of the genome. We expect that within this decade sequencing technologies will undergo another dramatic improvement, resulting in “Third Generation Sequencing” technologies with which chromosomes and genomes can be sequenced in their entirety with high accuracy. Plant breeding will benefit enormously from this development, providing breeders with the tools, data and understanding to design new traits and varieties from natural and induced genetic variation in an entirely rationalized and economical manner, and much beyond our current capabilities. The tomato genome described here was sequenced within an international collaboration and its completion spanned almost a decade. The novel sequencing technologies that were invented and commercialized during the course of this effort resulted in the generation of multiple types of sequence datasets. This in turn required development and application of state-of-the-art bioinformatics approaches to process the vast and varied datasets in order to produce a near-complete and high quality genome assembly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Au KF, Underwood JG, Lee L, Wong WH (2012) Improving PacBio long read accuracy by short read alignment. PLoS One 7:e46679. doi:10.1371/journal.pone.0046679
Bevan MW, Uauy C (2013) Genomics reveals new landscapes for crop improvement. Genome Biol 14:206. doi:10.1186/gb-2013-14-6-206
Boetzer M, Pirovano W (2012) Toward almost closed genomes with GapFiller. Genome Biol 13:R56. doi:10.1186/gb-2012-13-6-r56
Bonfield JK, Smith KF, Staden R (1995) A new DNA sequence assembly program. Nucleic Acids Res 23:4992–4999
Bradnam KR, Fass JN, Alexandrov A et al (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2:10. doi:10.1186/2047-217X-2-10
Campagna D, Albiero A, Bilardi A et al (2009) PASS: a program to align short sequences. Bioinformatics 25:967–968. doi:10.1093/bioinformatics/btp087
Earl D, Bradnam K, St. John J et al (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21:2224–2241. doi:10.1101/gr.126599.111
English AC, Richards S, Han Y et al (2012) Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7:e47768. doi:10.1371/journal.pone.0047768
Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194. doi:10.1101/gr.8.3.175
Frohme M, Camargo AA, Czink C et al (2001) Directed gap closure in large-scale sequencing projects. Genome Res 11:901–903. doi:10.1101/gr.179401
Garber M, Zody MC, Arachchi HM et al (2009) Closing gaps in the human genome using sequencing by synthesis. Genome Biol 10:R60. doi:10.1186/gb-2009-10-6-r60
Gnerre S, Maccallum I, Przybylski D et al (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA 108:1513–1518. doi:10.1073/pnas.1017351108
Gonnella G, Kurtz S (2012) Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinform 13:82. doi:10.1186/1471-2105-13-82
Gritsenko AA, Nijkamp JF, Reinders MJT, de Ridder D (2012) GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 28:1429–1437. doi:10.1093/bioinformatics/bts175
Hiatt JB, Patwardhan RP, Turner EH et al (2010) Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods 7:119–122. doi:10.1038/nmeth.1416
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877. doi:10.1101/gr.9.9.868
Ilie L, Fazayeli F, Ilie S (2011) HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27:295–302. doi:10.1093/bioinformatics/btq653
Jeck WR, Reinhardt JA, Baltrus DA et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23:2942–2944. doi:10.1093/bioinformatics/btm451
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11:R116. doi:10.1186/gb-2010-11-11-r116
Li S, Chou H-H (2004) LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics 20:2865–2866. doi:10.1093/bioinformatics/bth302
Lindgreen S (2012) AdapterRemoval: easy cleaning of next generation sequencing reads. BMC Res Notes 5:337. doi:10.1186/1756-0500-5-337
Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1:18. doi:10.1186/2047-217X-1-18
Magoč T, Salzberg SL (2011) FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:2957–2963. doi:10.1093/bioinformatics/btr507
Margulies M, Egholm M, Altman WE et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi:10.1038/nature04726
Meyers BC, Tingey SV, Morgante M (2001) Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Res 11:1660–1676. doi:10.1101/gr.188201
Meyers LA, Levin DA (2006) On the abundance of polyploids in flowering plants. Evolution 60:1198–1206. doi:10.1111/j.0014-3820.2006.tb01198.x
Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628. doi:10.1038/nmeth.1226
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21(Suppl 2):ii79–ii85. doi:10.1093/bioinformatics/bti1114
Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204. doi:10.1126/science.287.5461.2196
Nagarajan N, Pop M (2013) Sequence assembly demystified. Nat Rev Genet 14:157–167. doi:10.1038/nrg3367
Pellicer J, Fay MF, Leitch IJ (2010) The largest eukaryotic genome of them all? Bot J Linn Soc 164:10–15. doi:10.1111/j.1095-8339.2010.01072.x
Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98:9748–9753. doi:10.1073/pnas.171285098
Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: finding the elusive mis-assembly. Genome Biol 9:R55. doi:10.1186/gb-2008-9-3-r55
Pop M, Kosack DS, Salzberg SL (2004) Hierarchical scaffolding with Bambus. Genome Res 14:149–159. doi:10.1101/gr.1536204
Quail MA, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genom 13:341. doi:10.1186/1471-2164-13-341
Rahman A, Pachter L (2013) CGAL: computing genome assembly likelihoods. Genome Biol 14:R8. doi:10.1186/gb-2013-14-1-r8
Roach JC, Boysen C, Wang K, Hood L (1995) Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics 26:345–353. doi:10.1016/0888-7543(95)80219-C
Ronen R, Boucher C, Chitsaz H, Pevzner P (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28:i188–i196. doi:10.1093/bioinformatics/bts219
Salmela L, Mäkinen V, Välimäki N et al (2011) Fast scaffolding with small independent mixed integer programs. Bioinformatics 27:3259–3265. doi:10.1093/bioinformatics/btr562
Salzberg SL, Phillippy AM, Zimin A et al (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557–567. doi:10.1101/gr.131383.111
Sanger F, Nicklen S (1977) DNA sequencing with chain-terminating. Biochemistry 74:5463–5467
Schnable PS, Ware D, Fulton RS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 80(326):1112–1114. doi:10.1126/science.1178534
Schwartz S, Kent WJ, Smit A et al (2003) Human-mouse alignments with BLASTZ. Genome Res 13:103–107. doi:10.1101/gr.809403
Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22:549–556. doi:10.1101/gr.126953.111
Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26:i367–i373. doi:10.1093/bioinformatics/btq217
Simpson JT, Wong K, Jackman SD et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123. doi:10.1101/gr.089532.108
Smit A, Green P (1996) RepeatMasker. http://ftp.genome.washington.edu/RM/RepeatMasker.html
Soderlund C, Longden I, Mott R (1997) FPC: a system for building contigs from restriction fingerprinted clones. Comput Appl Biosci 13:523–535
Sutton GG, White O, Adams MD, Kerlavage AR (1995) TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol 1:9–19. doi:10.1089/gst.1995.1.9
Timkovsky V (1993) On the approximation of shortest common non-subsequences and supersequences. Technical Report
The Tomato Genome Consortium (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485:635–641. doi:10.1038/nature11119
Van Nieuwerburgh F, Thompson RC, Ledesma J et al (2012) Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination. Nucleic Acids Res 40:e24. doi:10.1093/nar/gkr1000
Van Oeveren J, de Ruiter M, Jesse T et al (2011) Sequence-based physical mapping of complex genomes by whole genome profiling. Genome Res 21:618–625. doi:10.1101/gr.112094.110
Vezzi F, Narzisi G, Mishra B (2012) Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS ONE 7:e52210. doi:10.1371/journal.pone.0052210
Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23:500–501. doi:10.1093/bioinformatics/btl629
Wetzel J, Kingsford C, Pop M (2011) Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinform 12:95. doi:10.1186/1471-2105-12-95
Xue W, Li J-T, Zhu Y-P et al (2013) L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genom 14:604. doi:10.1186/1471-2164-14-604
Yang X, Dorman KS, Aluru S (2010) Reptile: representative tiling for short read error correction. Bioinformatics 26:2526–2533. doi:10.1093/bioinformatics/btq468
Young AL, Abaan HO, Zerbino D et al (2010) A new strategy for genome assembly using short sequence reads and reduced representation libraries. Genome Res 20:249–256. doi:10.1101/gr.097956.109
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829. doi:10.1101/gr.074492.107
Acknowledgments
The WGPTM technology is protected by patents and patent applications owned by Keygene N.V. WGP is a trademark of Keygene N.V.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Tang, J., Datema, E., Janssen, A., van Ham, R.C.H.J. (2016). Assembly and Application to the Tomato Genome. In: Causse, M., Giovannoni, J., Bouzayen, M., Zouine, M. (eds) The Tomato Genome. Compendium of Plant Genomes. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53389-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-662-53389-5_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53387-1
Online ISBN: 978-3-662-53389-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)