Bioinformatics pp 121-136 | Cite as

Gene Annotation Methods

  • Laurens Wilming
  • Jennifer Harrow


Gene annotation used to refer to the prediction and annotation of a coding transcript on a region of the genome, but as the complexity of the functional features on the genome increases, users require prediction of noncoding RNAs, alternatively spliced transcripts, pseudogenes, and conserved elements. Eight years after the initial draft sequence of the human genome was published, the exact number of coding genes present on this sequence is still unclear. Since new sequencing technologies have reduced the cost of sequencing and dramatically increased the speed, we can expect an enormous expansion in the amount of available genomic and transcript sequence data. To gain insight into the functional information contained within these new sequences, the features within the sequence need to be accurately annotated.


Alternative Splice Gene Annotation Manual Annotation Wellcome Trust Sanger Institute Distribute Annotation System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by the Wellcome Trust.


  1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG et al (2000) The genome sequence of Drosophila melanogaster. Science 287(5461):2185–95CrossRefPubMedGoogle Scholar
  2. Allen JE, Salzberg SL (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18):3596–603CrossRefPubMedGoogle Scholar
  3. Anderson CL, Zundel MA, Werner R (2005) Variable promoter usage and alternative splicing in five mouse connexin genes. Genomics 85(2):238–44CrossRefPubMedGoogle Scholar
  4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–9CrossRefPubMedGoogle Scholar
  5. Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res 10(7):950–8CrossRefPubMedGoogle Scholar
  6. Bentley SD, Chater KF, Cerdeno-Tarraga AM, Challis GL, Thomson NR, James KD et al (2002) Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417(6885):141–7CrossRefPubMedGoogle Scholar
  7. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X et al (2004) Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705):2242–6CrossRefPubMedGoogle Scholar
  8. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH et al (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816CrossRefPubMedGoogle Scholar
  9. Bono H, Kasukawa T, Furuno M, Hayashizaki Y, Okazaki Y (2002) FANTOM DB: database of Functional Annotation of RIKEN Mouse cDNA Clones. Nucleic Acids Res 30(1):116–8CrossRefPubMedGoogle Scholar
  10. Braun BR, van Het Hoog M, d’Enfert C, Martchenko M, Dungan J, Kuo A et al (2005) A human-curated annotation of the Candida albicans genome. PLoS Genet 1(1):36–57CrossRefPubMedGoogle Scholar
  11. Brett D, Pospisil H, Valcarcel J, Reich J, Bork P (2002) Alternative splicing and genome complexity. Nat Genet 30(1):29–30CrossRefPubMedGoogle Scholar
  12. Brown RH, Gross SS, Brent MR (2005) Begin at the beginning: predicting genes with 5′ UTRs. Genome Res 15(5):742–7CrossRefPubMedGoogle Scholar
  13. Bult CJ, Eppig JT, Kadin JA, Richardson JE, Blake JA (2008) The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res 36(Database issue):D724–8PubMedGoogle Scholar
  14. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):78–94CrossRefPubMedGoogle Scholar
  15. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B et al (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1)):188–96CrossRefPubMedGoogle Scholar
  16. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N et al (2005) The transcriptional landscape of the mammalian genome. Science 309(5740):1559–63CrossRefPubMedGoogle Scholar
  17. Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS et al (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res 32(Database issue):D311–4CrossRefPubMedGoogle Scholar
  18. Cooke J, Nowak MA, Boerlijst M, Maynard-Smith J (1997) Evolutionary origins and maintenance of redundant gene expression during metazoan development. Trends Genet 13(9):360–4CrossRefPubMedGoogle Scholar
  19. Donlin MJ (2007) Using the Generic Genome Browser (GBrowse). Curr Protoc Bioinformatics Chapter 9: Unit 9.9Google Scholar
  20. Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L (2001) The distributed annotation system. BMC Bioinformatics 2(1):7CrossRefPubMedGoogle Scholar
  21. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM (2007) Creating a honey bee consensus gene set. Genome Biol 8(1):R13CrossRefPubMedGoogle Scholar
  22. Finn RD, Stalker JW, Jackson DK, Kulesha E, Clements J, Pettett R (2007) ProServer: a simple, extensible Perl DAS server. Bioinformatics 23(12):1568–70CrossRefPubMedGoogle Scholar
  23. Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y et al (2008) Ensembl 2008. Nucleic Acids Res 36(Database issue):D707–14PubMedGoogle Scholar
  24. Ganfornina MD, Sanchez D (1999) Generation of evolutionary novelty by functional shift. Bioessays 21(5):432–9CrossRefPubMedGoogle Scholar
  25. Graveley BR (2001) Alternative splicing: increasing diversity in the proteomic world. Trends Genet 17(2):100–7CrossRefPubMedGoogle Scholar
  26. Gross SS, Brent MR (2006) Using multiple alignments to improve gene prediction. J Comput Biol 13(2):379–93CrossRefPubMedGoogle Scholar
  27. Guigo R, Reese MG (2005) EGASP: collaboration through competition to find human genes. Nat Methods 2(8):575–7CrossRefPubMedGoogle Scholar
  28. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S21–31CrossRefGoogle Scholar
  29. Hall N, Pain A, Berriman M, Churcher C, Harris B, Harris D et al (2002) Sequence of Plasmodium falciparum chromosomes 1, 3-9 and 13. Nature 419(6906):527–31CrossRefPubMedGoogle Scholar
  30. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J et al (2006) GENCODE: producing a reference annotation for ENCODE. Genome Biol 7(Suppl 1):S41–9CrossRefGoogle Scholar
  31. Hide WA, Babenko VN, van Heusden PA, Seoighe C, Kelso JF (2001) The contribution of exon-skipping events on chromosome 22 to protein coding diversity. Genome Res 11(11):1848–53PubMedGoogle Scholar
  32. Hirotsune S, Yoshida N, Chen A, Garrett L, Sugiyama F, Takahashi S et al (2003) An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene. Nature 423(6935):91–6CrossRefPubMedGoogle Scholar
  33. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L et al (2002) The Ensembl genome database project. Nucleic Acids Res 30(1):38–41CrossRefPubMedGoogle Scholar
  34. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y et al (2007) Ensembl 2007. Nucleic Acids Res 35(Database issue):D610–7CrossRefPubMedGoogle Scholar
  35. Huss JW, Orozco C, Goodale J, Wu C, Batalov S, Vickers TJ et al (2008) A Gene Wiki for Community Annotation of Gene Function. PLoS Biol 6(7):e175CrossRefPubMedGoogle Scholar
  36. Imanishi T, Itoh T, Suzuki Y, O’Donovan C, Fukuchi S, Koyanagi KO et al (2004) Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol 2(6):e162CrossRefPubMedGoogle Scholar
  37. Kan Z, States D, Gish W (2002) Selecting for functional alternative splices in ESTs. Genome Res 12(12):1837–45CrossRefPubMedGoogle Scholar
  38. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP et al (2002) Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569):916–9CrossRefPubMedGoogle Scholar
  39. Kawai J, Shinagawa A, Shibata K, Yoshino M, Itoh M, Ishii Y et al (2001) Functional annotation of a full-length mouse cDNA collection. Nature 409(6821):685–90CrossRefPubMedGoogle Scholar
  40. Klee K, Ernst R, Spannagl M, Mayer KF (2007) Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation. BMC Bioinformatics 8:320CrossRefPubMedGoogle Scholar
  41. Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–8PubMedGoogle Scholar
  42. Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J et al (2002) Apollo: a sequence annotation editor. Genome Biol 3(12):RESEARCH0082CrossRefPubMedGoogle Scholar
  43. Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T et al (2008) Gramene: a growing plant comparative genomics resource. Nucleic Acids Res 36(Database issue):D947–53PubMedGoogle Scholar
  44. Lopez AJ (1998) Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu Rev Genet 32:279–305CrossRefPubMedGoogle Scholar
  45. Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engstrom PG et al (2006) Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet 2(4):e62CrossRefPubMedGoogle Scholar
  46. McCarrey JR, Thomas K (1987) Human testis-specific PGK gene lacks introns and possesses characteristics of a processed gene. Nature 326(6112):501–5CrossRefPubMedGoogle Scholar
  47. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B et al (2007) 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 17(12):1797–808CrossRefPubMedGoogle Scholar
  48. Misra S, Harris N (2006) Using Apollo to browse and edit genome annotations. Curr Protoc Bioinformatics Chapter 9: Unit 9.5Google Scholar
  49. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–8CrossRefPubMedGoogle Scholar
  50. O’Connor BD, Day A, Cain S, Arnaiz O, Sperling L, Stein LD (2008) GMODWeb: a web framework for the Generic Model Organism Database. Genome Biol 9(6):R102CrossRefPubMedGoogle Scholar
  51. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S et al (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420(6915):563–73CrossRefPubMedGoogle Scholar
  52. Prlic A, Down TA, Kulesha E, Finn RD, Kahari A, Hubbard TJ (2007) Integrating sequence and structural biology with DAS. BMC Bioinformatics 8:333CrossRefPubMedGoogle Scholar
  53. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35(Database issue):D61–5CrossRefPubMedGoogle Scholar
  54. Reese MG, Guigo R (2006) EGASP: introduction. Genome Biol 7(Suppl 1):S11–3CrossRefGoogle Scholar
  55. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P et al (2008) WormBase 2007. Nucleic Acids Res 36(Database issue):D612–7PubMedGoogle Scholar
  56. Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA et al (2000) Artemis: sequence visualization and annotation. Bioinformatics 16(10):944–5CrossRefPubMedGoogle Scholar
  57. Schadt EE, Edwards SW, GuhaThakurta D, Holder D, Ying L, Svetnik V et al (2004) A comprehensive transcript index of the human genome generated using microarrays and computational approaches. Genome Biol 5(10):R73CrossRefPubMedGoogle Scholar
  58. Schmucker D, Clemens JC, Shu H, Worby CA, Xiao J, Muda M et al (2000) Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity. Cell 101(6):671–84CrossRefPubMedGoogle Scholar
  59. Searle SM, Gilbert J, Iyer V, Clamp M (2004) The otter annotation system. Genome Res 14(5):963–70CrossRefPubMedGoogle Scholar
  60. Siepel A, Diekhans M, Brejova B, Langton L, Stevens M, Comstock CL et al (2007) Targeted discovery of novel human exons by comparative genomics. Genome Res 17(12):1763–73CrossRefPubMedGoogle Scholar
  61. Stamm S, Ben-Ari S, Rafalska I, Tang Y, Zhang Z, Toiber D et al (2005) Function of alternative splicing. Gene 344:1–20CrossRefPubMedGoogle Scholar
  62. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW et al (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450(7167):219–32CrossRefPubMedGoogle Scholar
  63. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A et al (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12(10):1599–610CrossRefPubMedGoogle Scholar
  64. Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S (2007) Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS ONE 2(5):e484CrossRefPubMedGoogle Scholar
  65. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H et al (2008) The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res 36(Database issue):D1009–14PubMedGoogle Scholar
  66. Taneri B, Snyder B, Novoradovsky A, Gaasterland T (2004) Alternative splicing of mouse transcription factors affects their DNA-binding domain architecture and is tissue specific. Genome Biol 5(10):R75CrossRefPubMedGoogle Scholar
  67. Thomas PD, Mi H, Lewis S (2007) Ontology annotation: mapping genomic regions to biological function. Curr Opin Chem Biol 11(1):4–11CrossRefPubMedGoogle Scholar
  68. Twigger S, Lu J, Shimoyama M, Chen D, Pasko D, Long H et al (2002) Rat Genome Database (RGD): mapping disease onto the genome. Nucleic Acids Res 30(1):125–8CrossRefPubMedGoogle Scholar
  69. Twigger SN, Shimoyama M, Bromberg S, Kwitek AE, Jacob HJ (2007) The Rat Genome Database, update 2007 – easing the path from disease to data and back again. Nucleic Acids Res 35(Database issue):D658–62CrossRefPubMedGoogle Scholar
  70. Vanin EF (1985) Processed pseudogenes: characteristics and evolution. Annu Rev Genet 19:253–72CrossRefPubMedGoogle Scholar
  71. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915):520–62CrossRefPubMedGoogle Scholar
  72. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I et al (2008) Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453(7199):1239–43CrossRefPubMedGoogle Scholar
  73. Wilson RJ, Goodman JL, Strelets VB (2008) FlyBase: integration and improvements to query tools. Nucleic Acids Res 36(Database issue):D588–93PubMedGoogle Scholar
  74. Yamasaki C, Murakami K, Fujii Y, Sato Y, Harada E, Takeda J et al (2008) The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts. Nucleic Acids Res 36(Database issue):D793–9PubMedGoogle Scholar
  75. Yeo G, Holste D, Kreiman G, Burge CB (2004) Variation in alternative splicing across human tissues. Genome Biol 5(10):R74CrossRefPubMedGoogle Scholar
  76. Zhang Z, Gerstein M (2004) Large-scale analysis of pseudogenes in the human genome. Curr Opin Genet Dev 14(4):328–35CrossRefPubMedGoogle Scholar
  77. Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW et al (2007) Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 17(6):839–51CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Wellcome Trust Sanger InstituteHinxtonUK

Personalised recommendations