DNA Sequence Assembly and Annotation of Genes

How to Generate the DNA Sequence and to Predict the Function of Genes
  • Henrik ChristensenEmail author
  • Arshnee Moodley
Part of the Learning Materials in Biosciences book series (LMB)


This chapter describes the different sequencing strategies, the pros and cons of the different strategies to help you select the optimal DNA sequencing strategy for your research question, and how to assembly and annotate DNA sequences. DNA sequencing is the determination of the order of nucleotides of parts or whole chromosomes of organisms and virus. DNA sequencing can be done for a single gene or a whole genome or many genomes at a time such as in metagenomics. One of the most popular sequencing machines is the MiSeq from Illumina which is capable of doing small whole-genome sequencing, transcriptomics, and 16S rRNA metagenomics. It is possible to multiplex by using unique combinations of specific barcodes and indexes. Real-time, single-molecule sequencing allows for sequencing of the native DNA, resulting in significantly longer read lengths and sequence information available when the bases are incorporated, i.e., information available in real time. Base calling is the first step in sequencing where the electronic signal generated in the sequencing machine is separated from random noise and converted to nucleotide information. Then the nucleotide information needs to be assembled to DNA sequences which resemble the original DNA sequenced as best as possible. This can either be done de novo without a reference or with a reference if the genome of the organism or virus is well known. The most important quality parameter to consider is the coverage. Another important parameter is N50. Comparison of different assemblies can be made with Quast. The “minimum information about a genome sequence (MIGS) specification provides an exhaustive list of the information required for genomic sequences including demands to metadata. Genome annotation is the identification and labeling of all the relevant features of the genomic sequence. At first, this includes the coordinates provided as nucleotide positions where coding regions are predicted. It is mainly a prediction of coding genes; however, other structural genes such as rRNA are also identified.


  1. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O. 2008. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75.CrossRefPubMedPubMedCentralGoogle Scholar
  2. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120.CrossRefPubMedPubMedCentralGoogle Scholar
  3. Chun J, Oren A, Ventosa A, Christensen H, Arahal DR, da Costa MS, Rooney AP, Yi H, Xu XW, De Meyer S, Trujillo ME. 2018. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int J Syst Evol Microbiol. 68, 461–466.CrossRefPubMedGoogle Scholar
  4. Cock et al. 2010. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771CrossRefPubMedGoogle Scholar
  5. Compeau PE, Pevzner PA, Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol. 29:987–91.CrossRefPubMedPubMedCentralGoogle Scholar
  6. Ewing B, Hillier L, Wend MC, & Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome research 8, 175–185.CrossRefPubMedGoogle Scholar
  7. Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, Angiuoli SV, Ashburner M, Axelrod N, Baldauf S, Ballard S, Boore J, Cochrane G, Cole J, Dawyndt P, De Vos P, DePamphilis C, Edwards R, Faruque N, Feldman R, Gilbert J, Gilna P, Glöckner FO, Goldstein P, Guralnick R, Haft D, Hancock D, Hermjakob H, Hertz-Fowler C, Hugenholtz P, Joint I, Kagan L, Kane M, Kennedy J, Kowalchuk G, Kottmann R, Kolker E, Kravitz S, Kyrpides N, Leebens-Mack J, Lewis SE, Li K, Lister AL, Lord P, Maltsev N, Markowitz V, Martiny J, Methe B, Mizrachi I, Moxon R, Nelson K, Parkhill J, Proctor L, White O, Sansone SA, Spiers A, Stevens R, Swift P, Taylor C, Tateno Y, Tett A, Turner S, Ussery D, Vaughan B, Ward N, Whetzel T, San Gil I, Wilson G, Wipat A. 2008. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol. 26, 541–7.CrossRefPubMedPubMedCentralGoogle Scholar
  8. Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F. 2010. Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold Spring Harb Protoc.Google Scholar
  9. Goodwin S, McPherson JD, McCombie WR. 2016. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 17:333–51.CrossRefPubMedGoogle Scholar
  10. Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29,1072–5.CrossRefPubMedPubMedCentralGoogle Scholar
  11. Idury RM, Waterman MS. 1995. A new algorithm for DNA sequence assembly. J Comput Biol. 1995 Summer;2(2):291–306.CrossRefPubMedGoogle Scholar
  12. Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. 2016. KEGG as a reference resource for gene and protein annotation. Nucleic Acid Res. 44(D1):D457–62.CrossRefPubMedGoogle Scholar
  13. Koren S, Harhay GP, Smith TP, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14: R101.CrossRefPubMedPubMedCentralGoogle Scholar
  14. Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Pontén T, Ussery DW, Aarestrup FM, Lund O. 2012. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 50, 1355–61.CrossRefPubMedPubMedCentralGoogle Scholar
  15. Madigan M, Bender KS, Buckley DH, Sattley WM, & Stahl D. 2019. Brock biology of Microorganisms. Pearson, Harlow UK.Google Scholar
  16. Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, Lapidus A, Prjibelski AD, Pyshkin A, Sirotkin A, Sirotkin Y, Stepanauskas R, Clingenpeel SR, Woyke T, McLean JS, Lasken R, Tesler G, Alekseyev MA, Pevzner PA. 2013. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J Comput Biol. 20, 714–37.CrossRefPubMedPubMedCentralGoogle Scholar
  17. Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, Edwards RA, Gerdes S, Parrello B, Shukla M, Vonstein V, Wattam AR, Xia F, Stevens R. 2014. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 42(Database issue):D206–14.CrossRefPubMedGoogle Scholar
  18. Pearson WR, Lipman DJ. 1988. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 85, 2444–8.CrossRefPubMedPubMedCentralGoogle Scholar
  19. Pevzner PA, Tang H, Waterman MS. 2001. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 98, 9748–53.CrossRefPubMedPubMedCentralGoogle Scholar
  20. Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74, 5463–7.CrossRefPubMedGoogle Scholar
  21. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–9.CrossRefPubMedGoogle Scholar
  22. Zerbino DR, Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–9.CrossRefPubMedPubMedCentralGoogle Scholar

Further Reading

  1. Loosdrecht, M. C. M. van, Nielsen, P. H., Lopez Vazquez, C. M. and Brdjanovic, D. 2016. Experimental methods in wastewater treatment. IWA publishing, London, UKCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Veterinary Animal SciencesUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations