Comparative Genome Annotation

König, Stefanie; Romoth, Lars; Stanke, Mario

doi:10.1007/978-1-4939-7463-4_6

Stefanie König⁵,
Lars Romoth⁵ &
Mario Stanke⁵

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1704))

4905 Accesses
6 Citations

Abstract

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. Increasingly, such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate a target genome. Newer approaches such as the simultaneous annotation of multiple genomes are also reviewed. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Further, we provide practical advice on genome annotation in general.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Salzberg SL, Angiuoli SV, Dunning Hotopp JC, Tettelin H (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinf 12(1):272
Article Google Scholar
Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (2012) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41:D358–D365
Article PubMed PubMed Central Google Scholar
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 19(2):327–335
Article CAS PubMed PubMed Central Google Scholar
Schmitt-Engel C, Schultheis D, Schwirz J, Ströhlein N, Troelenberg N, Majumdar U, Grossmann D, Richter T, Tech M, Dönitz J, Gerischer L, Theis M, Schild I, Trauner J, Koniszewski NDB, Küster E, Kittelmann S, Hu Y, Lehmann S, Siemanowski J, Ulrich J, Panfilio KA, Schröder R, Morgenstern B, Stanke M, Buchhholz F, Frasch M, Roth S, Wimmer EA, Schoppmeier M, Klingler M, Bucher G (2015) The iBeetle large-scale RNAi screen reveals gene functions for insect development and physiology. Nat Commun 6:7822
Article CAS PubMed PubMed Central Google Scholar
Avila-Herrera A, Pollard KS (2015) Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species. BMC Bioinf 16(1):1–18
Article Google Scholar
Zhang G (2015) Genomics: bird sequencing project takes off. Nature 522(7554):34–34
Article CAS PubMed Google Scholar
Smit AFA, Hubley R (2008–2015) RepeatModeler Open-1.0. http://www.repeatmasker.org
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D (2011) Cactus: algorithms for genome multiple sequence alignment. Genome Res 21(9):1512–1528
Article CAS PubMed PubMed Central Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21
Article CAS PubMed Google Scholar
Wu TD, Nacu S (2010) Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881
Article CAS PubMed PubMed Central Google Scholar
Daehwan K, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36
Article Google Scholar
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotech 33:290–295. StringTie transcript assembler. http://ccb.jhu.edu/software/stringtie. Accessed 28 Oct 2014
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Article CAS PubMed PubMed Central Google Scholar
Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G (2013) MITIE: simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29(20):2529–2538
Article CAS PubMed PubMed Central Google Scholar
Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8):1086–1092
Article CAS PubMed PubMed Central Google Scholar
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Brian Couger M, Eccles D, Li B, Lieber M, et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8(8):1494–1512
Article CAS PubMed Google Scholar
Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644
Article CAS PubMed Google Scholar
Solovyev V, Kosarev P, Seledsov I, Vorobyev D (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol 7(Suppl 1):S10
Article PubMed PubMed Central Google Scholar
Behr J, Bohnert R, Zeller G, Schweikert G, Hartmann L, Rätsch G (2010) Next generation genome annotation with mGene.ngs. BMC Bioinf 11(S10):O8
Google Scholar
Steijger T, Abril JF, Engstrom PG, Kokocinski F, Akerman M, Alioto T, Ambrosini G, Antonarakis SE, Behr J, Bohnert R, et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10(12):1177–1184
Article CAS PubMed Google Scholar
Schweikert G, Zien A, Zeller G, Behr J, Dietrich C, Ong GS, Philips P, De Bona F, Hartmann L, Bohlen A, et al (2009) mGene: accurate SVM-based gene findng with an application to nematode genomes. Genome Res 19:2133–2143
Article CAS PubMed PubMed Central Google Scholar
Seledtsov I, Molodtsov V, Kosarev P, Solovyev V (2014) Transomics transcript assembly pipeline. http://www.softberry.com. Accessed 28 Oct 2014
Slater GSC, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinf 6(1):31
Article Google Scholar
Korf I (2013) Genomics: the state of the art in RNA-seq analysis. Nat Methods 10(12):1165–1166
Article CAS PubMed Google Scholar
Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW (2003) Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299:682–686
Article CAS PubMed Google Scholar
Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei C-L, Kaeppler S, Chen F, Wang Z (2014) A near complete snapshot of the zea mays seedling transcriptome revealed from ultra-deep sequencing. Sci Rep 4:4519
Article PubMed PubMed Central Google Scholar
Gremme G (2013) Computational Gene Structure Prediction. PhD thesis, Universität Hamburg
Google Scholar
Iwata H, Gotoh O (2012) Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 40(20):e161
Article CAS PubMed PubMed Central Google Scholar
ProSplign (2014). http://www.ncbi.nlm.nih.gov/sutils/static/prosplign/prosplign.html. Accessed 17 Oct 2014
Usuka J, Brendel V (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 297(5):1075–1085
Article CAS PubMed Google Scholar
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14:988–995
Article CAS PubMed PubMed Central Google Scholar
Keller O, Kollmar M, Stanke M, Waack S (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27(6):757–763
Article CAS PubMed Google Scholar
Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89
Article PubMed PubMed Central Google Scholar
Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 1 Suppl. 1:S1–S9
Google Scholar
Alexandersson M, Cawley S, Pachter L (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 13:496–502
Article CAS PubMed PubMed Central Google Scholar
Richards S, Liu Y, Bettencourt BR, Hradecky P, Letovsky S, Nielsen R, Thornton K, Hubisz MJ, Chen R, Meisel RP, et al (2005) Comparative genome sequencing of drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res 15(1):1–18
Article CAS PubMed PubMed Central Google Scholar
Gross SS, Brent MR (2005) Using multiple alignments to improve gene prediction. In: Proceedings of RECOMB 2005
Google Scholar
Gross S, Do C, Sirota M, Batzoglou S (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269
Article PubMed PubMed Central Google Scholar
Brent MR (2008) Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62–73
Article CAS PubMed Google Scholar
Elsik C, Worley K, Bennett A, Beye M, Camara F, Childers C, de Graaf D, Debyser G, Deng J, Devreese B, et al (2014) Finding the missing honey bee genes: lessons learned from a genome upgrade. BMC Genomics 15(1):86
Article PubMed PubMed Central Google Scholar
Csuros M, Rogozin IB, Koonin EV (2011) A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Comput Biol 7(9):e1002150
Article CAS PubMed PubMed Central Google Scholar
Gotoh O, Morita M, Nelson DR (2014) Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinf 15(1):189
Article Google Scholar
König S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32:3388–3395
PubMed Google Scholar
König S, Romoth L, Gerischer L, Stanke M (2015) Simultaneous gene finding in multiple genomes. PeerJ PrePrints 3:e1296v1
Google Scholar
Hickey G, Paten B, Earl D, Zerbino D, Haussler D (2013). HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics 29(10):1341–1342
Article CAS PubMed PubMed Central Google Scholar
Nguyen N, Hickey G, Raney BJ, Armstrong J, Clawson H, Zweig A, Karolchik D, Kent WJ, Haussler D, Paten B (2014) Comparative assembly hubs: web-accessible browsers for comparative genomics. Bioinformatics 30:3293–3301
Article CAS PubMed PubMed Central Google Scholar
Hiller M, Schaar BT, Indjeian VB, Kingsley DM, Hagey LR, Bejerano G (2012) A “forward genomics” approach links genotype to phenotype using independent phenotypic losses among related species. Cell Rep 2(4):817–823
Article CAS PubMed PubMed Central Google Scholar
Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PloS One 7(11):e50609
Article CAS PubMed PubMed Central Google Scholar
Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119
Article PubMed PubMed Central Google Scholar
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2015) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769
Article PubMed Google Scholar
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) Busco: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212
Article PubMed Google Scholar
Keller O, Odronitz F, Stanke M, Kollmar M, Waack S (2008) Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinf 9(1):278
Article Google Scholar
Haas B, Salzberg S, Zhu W, Pertea M, Allen J, Orvis J, White O, Buell CR, Wortman J (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol 9(1):R7
Article PubMed PubMed Central Google Scholar
Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf 12:491
Article Google Scholar
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506
Article CAS PubMed PubMed Central Google Scholar
Hoff KJ, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res 41:W123–W1238
Article PubMed PubMed Central Google Scholar
Raney BJ, Dreszer TR, Barber GP, Clawson H, Fujita PA, Wang T, Nguyen N, Paten B, Zweig AS, Karolchik D, Kent WJ (2013) Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics 30(7):1003–1005
Article PubMed PubMed Central Google Scholar
McKay SJ, Vergara IA, Stajich JE (2010) Using the generic synteny browser (gbrowse_syn). Curr Protoc Bioinformatics UNIT 9.12
Google Scholar
Mercer TR, Dinger ME, Mattick JS (2009) Long non-coding RNAs: insights into functions. Nat Rev Genet 10(3):155–159
Article CAS PubMed Google Scholar
Mattick JS, Makunin IV (2006) Non-coding RNA. Hum Mol Genet 15(Suppl 1):R17–R29
Article CAS PubMed Google Scholar
Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, et al (2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22(9):1775–1789
Article CAS PubMed PubMed Central Google Scholar
Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27(13):i275–i282
Article CAS PubMed PubMed Central Google Scholar
Ulitsky I, Bartel DP (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154(1):26–46
Article CAS PubMed PubMed Central Google Scholar
Rivas E, Eddy SR (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2(1):1
Article Google Scholar
Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19:1630–1638
Article CAS PubMed PubMed Central Google Scholar
Pirovano W, Boetzer M, Derks MF, Smit S (2015) NCBI-compliant genome submissions: tips and tricks to save time and money. Brief Bioinform 18(2):179–182
Google Scholar

Download references

Acknowledgement

The research was supported by the German National Academic Foundation (to S.K.) and the German Research Foundation (DFG RTG 1870).

Author information

Authors and Affiliations

Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
Stefanie König, Lars Romoth & Mario Stanke

Authors

Stefanie König
View author publications
You can also search for this author in PubMed Google Scholar
Lars Romoth
View author publications
You can also search for this author in PubMed Google Scholar
Mario Stanke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mario Stanke .

Editor information

Editors and Affiliations

Department of Biochemistry, Institute of Chemistry, University of São Paulo, São Paulo, São Paulo, Brazil
João C. Setubal
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
Jens Stoye
Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
Peter F. Stadler

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

König, S., Romoth, L., Stanke, M. (2018). Comparative Genome Annotation. In: Setubal, J., Stoye, J., Stadler, P. (eds) Comparative Genomics. Methods in Molecular Biology, vol 1704. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7463-4_6

Download citation

DOI: https://doi.org/10.1007/978-1-4939-7463-4_6
Published: 26 December 2017
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7461-0
Online ISBN: 978-1-4939-7463-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics