Abstract
Annotated sequence data are instrumental in nearly all realms of biology. However, the advent of next-generation sequencing has rapidly facilitated an imbalance between accurate sequence data and accurate annotation data. To increase the annotation accuracy of the Caulobacter vibrioides CB13b1a (CB13) genome, we compared the PGAP and RAST annotations of the CB13 genome. A total of 64 unique genes were identified in the PGAP annotation that were either completely or partially absent in the RAST annotation, and a total of 16 genes were identified in the RAST annotation that were not included in the PGAP annotation. Moreover, PGAP identified 73 frameshifted genes and 22 genes with an internal stop. In contrast, RAST annotated the larger segment of these frameshifted genes without indicating a change in reading frame may have occurred. The RAST annotation did not include any genes with internal stop codons, since it chose start codons that were after the internal stop. To confirm the discrepancies between the two annotations and verify the accuracy of the CB13 genome sequence data, we re-sequenced and re-annotated the entire genome and obtained an identical sequence, except in a small number of homopolymer regions. A genome sequence comparison between the two versions allowed us to determine the correct number of bases in each homopolymer region, which eliminated frameshifts for 31 genes annotated as frameshifted genes and removed 24 pseudogenes from the PGAP annotation. Both annotation systems correctly identified genes that were missed by the other system. In addition, PGAP identified conserved gene fragments that represented the beginning of genes, but it employed no corrective method to adjust the reading frame of frameshifted genes or the start sites of genes harboring an internal stop codon. In doing so, the PGAP annotation identified a large number of pseudogenes, which may reflect evolutionary history but likely do not produce gene products. These results demonstrate that re-sequencing and annotation comparisons can be used to increase the accuracy of genomic data and the corresponding gene annotation.
Similar content being viewed by others
References
Aziz RK, Bartels D, Best AA et al (2008) The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75
Christen B, Abeliuk E, Collier JM et al (2011) The essential genome of a bacterium. Mol Syst Biol 7:528
da Silva CA, Lourenço RF, Mazzon RR et al (2016) Transcriptomic analysis of the stationary phase response regulator SpdR in Caulobacter crescentus. BMC Microbiol 16:66
Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE 5:e11147
Darling AE, Tritt A, Eisen JA et al (2011) Mauve assembly metrics. Bioinformatics 27:2756–2757
Ely B, Scott LE (2014) Correction of the Caulobacter crescentus NA1000 genome annotation. PLoS ONE 9:e91668
Kislyuk AO, Katz LS, Agrawal S et al (2010) A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics 26:1819–1826
Marks ME, Castro-Rojas CM, Teiling C et al (2010) The genetic basis of laboratory adaptation in Caulobacter crescentus. J Bacteriol 192:3678–3688
Nielsen P, Krogh A (2005) Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21:4322–4329
Overbeek R, Olson R, Pusch GD et al (2013) The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST). Nucleic Acids Res 42:D206–D214
Pruitt KD, Tatusova T, Brown GR et al (2011) NCBI reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40:D130–D135
Rutherford K, Parkhill J, Crook J et al (2000) Artemis: sequence visualization and annotation. Bioinformatics 10:944–945
Schrader JM, Li GW, Childers WS et al (2016) Dynamic translation regulation in Caulobacter cell cycle control. Proc Natl Acad Sci 113:E6859–E6867
Scott D, Ely B (2015) Comparison of genome sequencing technology and assembly methods for the analysis of a GC-rich bacterial genome. Curr Microbiol 70:338–344
Shin SC, Ahndo H, Kim SJ et al (2013) Advantages of single-molecule real-time sequencing in high-GC content genomes. PLoS ONE 8:e68824
Tatusova T, DiCuccio M, Badretdin A et al (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624
Funding
This work was funded in part by the National Institutes of Health Grant R25GM076277 to BE.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Berrios, L., Ely, B. Achieving Accurate Sequence and Annotation Data for Caulobacter vibrioides CB13. Curr Microbiol 75, 1642–1648 (2018). https://doi.org/10.1007/s00284-018-1572-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00284-018-1572-3