Skip to main content

Assembly and Data Quality

  • Chapter
  • First Online:
Phylogenomics

Abstract

Methods to assemble sequence reads into larger pieces are described. In many cases, the raw data of sequencing machines are pictures, which are translated in a subsequent analysis step (base calling) into sequence reads. Each position of a sequence read receives a quality score, indicating the probability of a sequencing error. After quality filtering and trimming of adapter regions or barcoding indices, these reads can be assembled de novo into larger pieces. Basically three different types of assembly strategies are in use: greedy algorithms, overlap-layout-consensus assemblers and methods relying on k-mer graphs. Overlapping reads producing contiguous sequences are named contigs. Positional information from paired-end reads or mate pairs can be used to order contigs into scaffolds. In the ideal case of genome sequencing, the number of scaffolds would equal the number of expected chromosomes. Several statistics can be used to describe or compare different sequence assemblies. Generally, a diversity of programs and chosen parameters should be explored to find the best assembly. Different strategies are used for genome, transcriptome and metagenome assemblies, and all of them greatly benefit from the inclusion of long reads. Assembly methods are becoming an increasingly important tool for everybody working with sequence data, since the vast majority of published sequence data in NCBI GenBank is deposited as short reads in the sequence read archive (► http://www.ncbi.nlm.nih.gov/sra/). This data is usually not directly searchable by methods like BLAST and needs to be assembled for subsequent analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Bankevich A, Pevzner PA (2016) TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat Methods 13:248–250

    Article  CAS  PubMed  Google Scholar 

  • Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES (2002) ARACHNE: a whole-genome shotgun assembler. Genome Res 12:177–189

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W (2011) Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27:578–579

    Article  CAS  PubMed  Google Scholar 

  • Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman J, Chapuis G, Chikhi R, Chitsaz H, Chou W-C, Corbeil J, Del Fabbro C, Docking T, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca N, Ganapathy G, Gibbs R, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt J, Ho I, Howard J, Hunt M, Jackman S, Jaffe D, Jarvis E, Jiang H, Kazakov S, Kersey P, Kitzman J, Knight J, Koren S, Lam T-W, Lavenier D, Laviolette F, Li Y, Li Z, Liu B, Liu Y, Luo R, MacCallum I, MacManes M, Maillet N, Melnikov S, Naquin D, Ning Z, Otto T, Paten B, Paulo O, Phillippy A, Pina-Martins F, Place M, Przybylski D, Qin X, Qu C, Ribeiro F, Richards S, Rokhsar D, Ruby J, Scalabrin S, Schatz M, Schwartz D, Sergushichev A, Sharpe T, Shaw T, Shendure J, Shi Y, Simpson J, Song H, Tsarev F, Vezzi F, Vicedomini R, Vieira B, Wang J, Worley K, Yin S, Yiu S-M, Yuan J, Zhang G, Zhang H, Zhou S, Korf I (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2:10

    Article  PubMed  PubMed Central  Google Scholar 

  • Chang Z, Wang Z, Li G (2014) The impacts of read length and transcriptome complexity for De Novo assembly: a simulation study. PLoS One 9:e94825

    Article  PubMed  PubMed Central  Google Scholar 

  • Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WEG, Wetter T, Suhai S (2004) Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 14:1147–1159

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30:31–37

    Article  CAS  PubMed  Google Scholar 

  • Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, Turner SW, Korlach J (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10:563–569

    Article  CAS  PubMed  Google Scholar 

  • Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771

    Article  CAS  PubMed  Google Scholar 

  • Coughlan L, Cotter P, Hill C, Alvarez-Ordóñez A (2015) Biotechnological applications of functional metagenomics in the food and pharmaceutical industries. Front Microbiol 6:672

    Article  PubMed  PubMed Central  Google Scholar 

  • David M, Dursi LJ, Yao D, Boutros PC, Simpson JT (2017) Nanocall: an open source basecaller for Oxford nanopore sequencing data. Bioinformatics 33:49–55

    Article  PubMed  Google Scholar 

  • Dohmen E, Kremer LPM, Bornberg-Bauer E, Kemena C (2016) DOGMA: domain-based transcriptome and proteome quality assessment. Bioinformatics 32:2577–2581

    Article  CAS  PubMed  Google Scholar 

  • Donmez N, Brudno M (2013) SCARPA: scaffolding reads with practical algorithms. Bioinformatics 29:428–434

    Article  CAS  PubMed  Google Scholar 

  • Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Hung On Ken Y, Buffalo V, Zerbino DR, Diekhans M, Ngan N, Ariyaratne PN, Sung W-K, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang S-P, Wu W, Chou W-C, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21:2224–2241

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194

    Article  CAS  PubMed  Google Scholar 

  • Gao S, Sung W-K, Nagarajan N (2011) Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 18:1681–1691

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P, Schatz MC, McCombie WR (2015) Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25:1750–1756

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011) Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol 29:644–U130

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Hackl T, Hedrich R, Schultz J, Förster F (2014) proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30:3004–3011

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18:802–809

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29:2959–2963

    Article  CAS  PubMed  Google Scholar 

  • Howison M, Zapata F, Edwards EJ, Dunn CW (2014) Bayesian genome assembly and assessment by Markov chain Monte Carlo sampling. PLoS One 9:e99497

    Article  PubMed  PubMed Central  Google Scholar 

  • Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Hunt M, Newbold C, Berriman M, Otto T (2014) A comprehensive evaluation of assembly scaffolding tools. Genome Biol 15:R42

    Article  PubMed  PubMed Central  Google Scholar 

  • Kelley D, Schatz M, Salzberg S (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11:R116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kircher M, Heyn P, Kelso J (2011) Addressing challenges in the production and analysis of Illumina sequencing data. BMC Genomics 12:382

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina genome analyzer using machine learning strategies. Genome Biol 10:R83

    Article  PubMed  PubMed Central  Google Scholar 

  • Koren S, Schatz M, Walenz B, Martin J, Howard J, Ganapathy G, Wang Z, Rasko D, McCombie W, Jarvis E, Phillippy A (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30:693–700

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM (2016) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. bioRxiv. doi.org/10.1101/071282.

  • Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ (2015) Assessing the performance of the Oxford nanopore technologies MinION. Biomol Detect Quantif 3:1–8

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Li D, Luo R, Liu C-M, Leung C-M, Ting H-F, Sadakane K, Yamashita H, Lam T-W (2016) MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102:3–11

    Article  CAS  PubMed  Google Scholar 

  • Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32:2103–2110

    Article  CAS  PubMed  Google Scholar 

  • Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder O, Leung F-C, Zhou Y, Cao J, Sun X, Fu Y (2010) The sequence and de novo assembly of the giant panda genome. Nature 463:311–317

    Article  CAS  PubMed  Google Scholar 

  • Lin Y, Yuan J, Kolmogorov M, Shen MW, Pevzner PA (2016) Assembly of long error-prone reads using de Bruijn graphs. Proc Natl Acad Sci USA 113:E8396-E8405 (In press)

    Google Scholar 

  • Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18

    Article  PubMed  PubMed Central  Google Scholar 

  • MacCallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, Williams L, Young S, Nusbaum C, Jaffe DB (2009) ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol 10:R103

    Article  PubMed  PubMed Central  Google Scholar 

  • Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–770

    Article  PubMed  PubMed Central  Google Scholar 

  • Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682

    Article  CAS  PubMed  Google Scholar 

  • Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou H-H, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204

    Article  CAS  PubMed  Google Scholar 

  • Nagarajan N, Pop M (2013) Sequence assembly demystified. Nat Rev Genet 14:157–167

    Article  CAS  PubMed  Google Scholar 

  • Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S (2011) Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39:e90

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) MetaVelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 40:e155

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Peng Y, Leung HCM, Yiu S-M, Chin FYL (2010) IDBA—a practical iterative de Bruijn graph de novo assembler. In: Berger B (ed) Research in computational molecular biology, vol 6044. Springer, Berlin, pp 426–440

    Chapter  Google Scholar 

  • Peng Y, Leung HCM, Yiu S-M, Lv M-J, Zhu X-G, Chin FYL (2013) IDBA-Tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics 29:326–334

    Article  Google Scholar 

  • Peng Y, Leung HCM, Yiu SM, Chin FYL (2011) Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 27:i94–i101

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Pevzner P, Tang H, Waterman M (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci 98:9748–9753

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Renaud G, Kircher M, Stenzel U, Kelso J (2013) freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers. Bioinformatics 29:1208–1209

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu A-L, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7:909–912

    Article  CAS  PubMed  Google Scholar 

  • Salmela L, Rivals E (2014) LoRDEC: accurate and efficient long read error correction. Bioinformatics 30:3506–3514

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Salzberg S, Phillippy A, Zimin A, Puiu D, Magoc T, Koren S, Treangen T, Schatz M, Delcher A, Roberts M, Marcais G, Pop M, Yorke J (2012) GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557–567

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Schatz MC, Delcher AL, Salzberg SL (2010) Assembly of large genomes using second-generation sequencing. Genome Res 20:1165–1173

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28:1086–1092

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212

    Article  PubMed  Google Scholar 

  • Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Smeds L, Kunstner A (2011) CONDETRI - A content dependent read trimmer for Illumina data. PLoS One 6:e26314

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Smith-Unna R, Boursnell C, Patro R, Hibberd J, Kelly S (2016) TransRate: reference free quality assessment of de novo transcriptome assemblies. Genome Res 26:1134–1144

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Sović I, Križanović K, Skala K, Šikić M (2016) Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads. Bioinformatics 32:2582–2589

    Article  PubMed  Google Scholar 

  • Wang Z, Gerstein M, Snyder M (2009) RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Zerbino D, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Bleidorn, C. (2017). Assembly and Data Quality. In: Phylogenomics. Springer, Cham. https://doi.org/10.1007/978-3-319-54064-1_5

Download citation

Publish with us

Policies and ethics