Abstract
GeneValidator is a tool for determining whether the characteristics of newly predicted protein-coding genes are consistent with those of similar sequences in public databases. For this, it runs up to seven comparisons per gene. Results are shown in an HTML report containing summary statistics and graphical visualizations that aim to be useful for curators. Results are also presented in CSV and JSON formats for automated follow-up analysis.
Here, we describe common usage scenarios of GeneValidator that use the JSON output results together with standard UNIX tools. We demonstrate how GeneValidator’s textual output can be used to filter and subset large gene sets effectively. First, we explain how low-scoring gene models can be identified and extracted for manual curation—for example, as input for genome browsers or gene annotation tools. Second, we show how GeneValidator’s HTML report can be regenerated from a filtered subset of GeneValidator’s JSON output. Subsequently, we demonstrate how GeneValidator’s GUI can be used to complement manual curation efforts. Additionally, we explain how GeneValidator can be used to merge information from multiple annotations by automatically selecting the higher-scoring gene model at each common gene locus. Finally, we show how GeneValidator analyses can be optimized when using large BLAST databases.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsChange history
31 October 2019
This book was published with References 17 and 18 in the incorrect order.
References
Yandell M, Ence D (2012) A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet 13:329–342
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD et al (2018) GenBank. Nucleic Acids Res 46:D41–D47
Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32:767–769
Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics 19:189
Schnoes AM, Brown SD, Dodevski I, Babbitt PC (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5:e1000605
Steijger T, Abril JF, Engström PG, Kokocinski F, RGASP Consortium, Hubbard TJ et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177–1184
Drăgan M-A, Moghul I, Priyam A, Bustos C, Wurm Y (2016) GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics 32(10):1559–1561
The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169
Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, The UniProt Consortium (2015) UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31:926–932
Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G et al (2016) JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol 17:66
Lee E, Helt GA, Reese JT, Munoz-Torres MC, Childers CP, Buels RM et al (2013) Web Apollo: a web-based genomic annotation editing platform. Genome Biol 14:R93
Priyam A, Woodcroft BJ, Rai V, Munagala A, Moghul I, Ter F et al (2015) Sequenceserver: a modern graphical user interface for custom BLAST databases. bioRxiv. https://doi.org/10.1101/033142
Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M et al (2015) Exploiting single-molecule transcript sequencing for eukaryotic gene prediction. Genome Biol 16:549
Bethesda (MD): National Center for Biotechnology Information (2008) BLAST® Command Line Applications User Manual [Internet] - Limiting a Search with a List of Identifiers. https://www.ncbi.nlm.nih.gov/books/NBK279673. Accessed 13 Sept 2018
Wurm Y, Wang J, Riba-Grognuz O, Corona M, Nygaard S, Hunt BG et al (2011) The genome of the fire ant Solenopsis invicta. Proc Natl Acad Sci U S A 108(14):5679–5684
Shen W, Xiong J (2019) TaxonKit: a cross-platform and efficient NCBI taxonomy toolkit. bioRxiv. https://doi.org/10.1101/513523
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60
Acknowledgments
This work was supported by the Natural Environment Research Council [grant NE/L00626X/1] and the Biotechnology and Biological Sciences Research Council [grant BB/K004204/1 and BB/M009513/1]. This research used Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT (https://doi.org/10.5281/zenodo.438045).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Moghul, I., Priyam, A., Wurm, Y. (2019). Choosing the Best Gene Predictions with GeneValidator. In: Kollmar, M. (eds) Gene Prediction. Methods in Molecular Biology, vol 1962. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9173-0_16
Download citation
DOI: https://doi.org/10.1007/978-1-4939-9173-0_16
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9172-3
Online ISBN: 978-1-4939-9173-0
eBook Packages: Springer Protocols