Assembly and Application to the Tomato Genome

Tang, Jifeng; Datema, Erwin; Janssen, Antoine; van Ham, Roeland C. H. J.

doi:10.1007/978-3-662-53389-5_8

Jifeng Tang⁶,
Erwin Datema⁶,
Antoine Janssen⁶ &
…
Roeland C. H. J. van Ham⁶

Part of the book series: Compendium of Plant Genomes ((CPG))

1584 Accesses
1 Altmetric

Abstract

The computational process of reconstructing a genome by assembling large amounts of raw sequencing data into long DNA fragments poses great challenges. This chapter illustrates current genome sequencing technologies and assembly algorithms by example of the tomato genome sequencing project. Over the last decade, “Next Generation Sequencing” technologies have placed great emphasis on efficient library preparation, high throughput and long read length. These developments have pushed the evolution of genome assembly approaches from greedy overlap-layout-consensus approaches that were used to assemble Sanger sequences, to de Bruijn graph and string graph approaches that are currently in use to assemble these new types of sequencing data produced in large volume. Nonetheless, many species still lack a high-quality, gold-standard genome sequence as genome assembly is still far from a solved problem. Several approaches have been developed to estimate the quality of assembled genome sequences and to perform so-called genome finishing, a complicated and costly procedure to complete the unresolved regions of the genome. We expect that within this decade sequencing technologies will undergo another dramatic improvement, resulting in “Third Generation Sequencing” technologies with which chromosomes and genomes can be sequenced in their entirety with high accuracy. Plant breeding will benefit enormously from this development, providing breeders with the tools, data and understanding to design new traits and varieties from natural and induced genetic variation in an entirely rationalized and economical manner, and much beyond our current capabilities. The tomato genome described here was sequenced within an international collaboration and its completion spanned almost a decade. The novel sequencing technologies that were invented and commercialized during the course of this effort resulted in the generation of multiple types of sequence datasets. This in turn required development and application of state-of-the-art bioinformatics approaches to process the vast and varied datasets in order to produce a near-complete and high quality genome assembly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Au KF, Underwood JG, Lee L, Wong WH (2012) Improving PacBio long read accuracy by short read alignment. PLoS One 7:e46679. doi:10.1371/journal.pone.0046679
Article CAS PubMed PubMed Central Google Scholar
Bevan MW, Uauy C (2013) Genomics reveals new landscapes for crop improvement. Genome Biol 14:206. doi:10.1186/gb-2013-14-6-206
Article PubMed PubMed Central Google Scholar
Boetzer M, Pirovano W (2012) Toward almost closed genomes with GapFiller. Genome Biol 13:R56. doi:10.1186/gb-2012-13-6-r56
Article PubMed PubMed Central Google Scholar
Bonfield JK, Smith KF, Staden R (1995) A new DNA sequence assembly program. Nucleic Acids Res 23:4992–4999
Article CAS PubMed PubMed Central Google Scholar
Bradnam KR, Fass JN, Alexandrov A et al (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2:10. doi:10.1186/2047-217X-2-10
Article PubMed PubMed Central Google Scholar
Campagna D, Albiero A, Bilardi A et al (2009) PASS: a program to align short sequences. Bioinformatics 25:967–968. doi:10.1093/bioinformatics/btp087
Article CAS PubMed Google Scholar
Earl D, Bradnam K, St. John J et al (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21:2224–2241. doi:10.1101/gr.126599.111
Article CAS PubMed PubMed Central Google Scholar
English AC, Richards S, Han Y et al (2012) Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7:e47768. doi:10.1371/journal.pone.0047768
Article CAS PubMed PubMed Central Google Scholar
Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194. doi:10.1101/gr.8.3.175
Article CAS PubMed Google Scholar
Frohme M, Camargo AA, Czink C et al (2001) Directed gap closure in large-scale sequencing projects. Genome Res 11:901–903. doi:10.1101/gr.179401
Article CAS PubMed PubMed Central Google Scholar
Garber M, Zody MC, Arachchi HM et al (2009) Closing gaps in the human genome using sequencing by synthesis. Genome Biol 10:R60. doi:10.1186/gb-2009-10-6-r60
Article PubMed PubMed Central Google Scholar
Gnerre S, Maccallum I, Przybylski D et al (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA 108:1513–1518. doi:10.1073/pnas.1017351108
Article CAS PubMed Google Scholar
Gonnella G, Kurtz S (2012) Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinform 13:82. doi:10.1186/1471-2105-13-82
Article Google Scholar
Gritsenko AA, Nijkamp JF, Reinders MJT, de Ridder D (2012) GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 28:1429–1437. doi:10.1093/bioinformatics/bts175
Article CAS PubMed Google Scholar
Hiatt JB, Patwardhan RP, Turner EH et al (2010) Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods 7:119–122. doi:10.1038/nmeth.1416
Article CAS PubMed PubMed Central Google Scholar
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877. doi:10.1101/gr.9.9.868
Article CAS PubMed PubMed Central Google Scholar
Ilie L, Fazayeli F, Ilie S (2011) HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27:295–302. doi:10.1093/bioinformatics/btq653
Article CAS PubMed Google Scholar
Jeck WR, Reinhardt JA, Baltrus DA et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23:2942–2944. doi:10.1093/bioinformatics/btm451
Article CAS PubMed Google Scholar
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11:R116. doi:10.1186/gb-2010-11-11-r116
Article CAS PubMed PubMed Central Google Scholar
Li S, Chou H-H (2004) LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics 20:2865–2866. doi:10.1093/bioinformatics/bth302
Article CAS PubMed Google Scholar
Lindgreen S (2012) AdapterRemoval: easy cleaning of next generation sequencing reads. BMC Res Notes 5:337. doi:10.1186/1756-0500-5-337
Article PubMed PubMed Central Google Scholar
Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1:18. doi:10.1186/2047-217X-1-18
Article PubMed PubMed Central Google Scholar
Magoč T, Salzberg SL (2011) FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:2957–2963. doi:10.1093/bioinformatics/btr507
Article PubMed PubMed Central Google Scholar
Margulies M, Egholm M, Altman WE et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi:10.1038/nature04726
CAS PubMed PubMed Central Google Scholar
Meyers BC, Tingey SV, Morgante M (2001) Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Res 11:1660–1676. doi:10.1101/gr.188201
Article CAS PubMed PubMed Central Google Scholar
Meyers LA, Levin DA (2006) On the abundance of polyploids in flowering plants. Evolution 60:1198–1206. doi:10.1111/j.0014-3820.2006.tb01198.x
Article PubMed Google Scholar
Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628. doi:10.1038/nmeth.1226
Article CAS PubMed Google Scholar
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21(Suppl 2):ii79–ii85. doi:10.1093/bioinformatics/bti1114
Article CAS PubMed Google Scholar
Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204. doi:10.1126/science.287.5461.2196
Article CAS PubMed Google Scholar
Nagarajan N, Pop M (2013) Sequence assembly demystified. Nat Rev Genet 14:157–167. doi:10.1038/nrg3367
Article CAS PubMed Google Scholar
Pellicer J, Fay MF, Leitch IJ (2010) The largest eukaryotic genome of them all? Bot J Linn Soc 164:10–15. doi:10.1111/j.1095-8339.2010.01072.x
Article Google Scholar
Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98:9748–9753. doi:10.1073/pnas.171285098
Article CAS PubMed PubMed Central Google Scholar
Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: finding the elusive mis-assembly. Genome Biol 9:R55. doi:10.1186/gb-2008-9-3-r55
Article PubMed PubMed Central Google Scholar
Pop M, Kosack DS, Salzberg SL (2004) Hierarchical scaffolding with Bambus. Genome Res 14:149–159. doi:10.1101/gr.1536204
Article CAS PubMed PubMed Central Google Scholar
Quail MA, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genom 13:341. doi:10.1186/1471-2164-13-341
Article CAS Google Scholar
Rahman A, Pachter L (2013) CGAL: computing genome assembly likelihoods. Genome Biol 14:R8. doi:10.1186/gb-2013-14-1-r8
Article PubMed PubMed Central Google Scholar
Roach JC, Boysen C, Wang K, Hood L (1995) Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics 26:345–353. doi:10.1016/0888-7543(95)80219-C
Article CAS PubMed Google Scholar
Ronen R, Boucher C, Chitsaz H, Pevzner P (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28:i188–i196. doi:10.1093/bioinformatics/bts219
Article CAS PubMed PubMed Central Google Scholar
Salmela L, Mäkinen V, Välimäki N et al (2011) Fast scaffolding with small independent mixed integer programs. Bioinformatics 27:3259–3265. doi:10.1093/bioinformatics/btr562
Article CAS PubMed PubMed Central Google Scholar
Salzberg SL, Phillippy AM, Zimin A et al (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557–567. doi:10.1101/gr.131383.111
Article CAS PubMed PubMed Central Google Scholar
Sanger F, Nicklen S (1977) DNA sequencing with chain-terminating. Biochemistry 74:5463–5467
CAS Google Scholar
Schnable PS, Ware D, Fulton RS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 80(326):1112–1114. doi:10.1126/science.1178534
Article Google Scholar
Schwartz S, Kent WJ, Smit A et al (2003) Human-mouse alignments with BLASTZ. Genome Res 13:103–107. doi:10.1101/gr.809403
Article CAS PubMed PubMed Central Google Scholar
Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22:549–556. doi:10.1101/gr.126953.111
Article CAS PubMed PubMed Central Google Scholar
Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26:i367–i373. doi:10.1093/bioinformatics/btq217
Article CAS PubMed PubMed Central Google Scholar
Simpson JT, Wong K, Jackman SD et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123. doi:10.1101/gr.089532.108
Article CAS PubMed PubMed Central Google Scholar
Smit A, Green P (1996) RepeatMasker. http://ftp.genome.washington.edu/RM/RepeatMasker.html
Soderlund C, Longden I, Mott R (1997) FPC: a system for building contigs from restriction fingerprinted clones. Comput Appl Biosci 13:523–535
CAS PubMed Google Scholar
Sutton GG, White O, Adams MD, Kerlavage AR (1995) TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol 1:9–19. doi:10.1089/gst.1995.1.9
Article CAS Google Scholar
Timkovsky V (1993) On the approximation of shortest common non-subsequences and supersequences. Technical Report
Google Scholar
The Tomato Genome Consortium (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485:635–641. doi:10.1038/nature11119
Article Google Scholar
Van Nieuwerburgh F, Thompson RC, Ledesma J et al (2012) Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination. Nucleic Acids Res 40:e24. doi:10.1093/nar/gkr1000
Article PubMed Google Scholar
Van Oeveren J, de Ruiter M, Jesse T et al (2011) Sequence-based physical mapping of complex genomes by whole genome profiling. Genome Res 21:618–625. doi:10.1101/gr.112094.110
Article PubMed PubMed Central Google Scholar
Vezzi F, Narzisi G, Mishra B (2012) Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS ONE 7:e52210. doi:10.1371/journal.pone.0052210
Article CAS PubMed PubMed Central Google Scholar
Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23:500–501. doi:10.1093/bioinformatics/btl629
Article CAS PubMed Google Scholar
Wetzel J, Kingsford C, Pop M (2011) Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinform 12:95. doi:10.1186/1471-2105-12-95
Article Google Scholar
Xue W, Li J-T, Zhu Y-P et al (2013) L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genom 14:604. doi:10.1186/1471-2164-14-604
Article Google Scholar
Yang X, Dorman KS, Aluru S (2010) Reptile: representative tiling for short read error correction. Bioinformatics 26:2526–2533. doi:10.1093/bioinformatics/btq468
Article PubMed Google Scholar
Young AL, Abaan HO, Zerbino D et al (2010) A new strategy for genome assembly using short sequence reads and reduced representation libraries. Genome Res 20:249–256. doi:10.1101/gr.097956.109
Article CAS PubMed PubMed Central Google Scholar
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829. doi:10.1101/gr.074492.107
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

The WGP^TM technology is protected by patents and patent applications owned by Keygene N.V. WGP is a trademark of Keygene N.V.

Author information

Authors and Affiliations

Keygene N.V., Agro Business Park 90, 6708 PW, Wageningen, The Netherlands
Jifeng Tang, Erwin Datema, Antoine Janssen & Roeland C. H. J. van Ham

Authors

Jifeng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Erwin Datema
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Janssen
View author publications
You can also search for this author in PubMed Google Scholar
Roeland C. H. J. van Ham
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roeland C. H. J. van Ham .

Editor information

Editors and Affiliations

GAFL, INRA, Montfavet Cedex, France
Mathilde Causse
Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York, USA
Jim Giovannoni
INRA-INP Toulouse, Castanet Tolosan, France
Mondher Bouzayen
INRA-INP Toulouse, Castanet Tolosan, France
Mohamed Zouine

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tang, J., Datema, E., Janssen, A., van Ham, R.C.H.J. (2016). Assembly and Application to the Tomato Genome. In: Causse, M., Giovannoni, J., Bouzayen, M., Zouine, M. (eds) The Tomato Genome. Compendium of Plant Genomes. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53389-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-662-53389-5_8
Published: 24 November 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53387-1
Online ISBN: 978-3-662-53389-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics