Skip to main content

Gene Prediction

  • Protocol
  • First Online:
Evolutionary Genomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 855))

Abstract

Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gingeras, TR. (2007) Origin of phenotypes: genes and transcripts, Genome Res 17, 682–690.

    Article  PubMed  CAS  Google Scholar 

  2. Borodovsky, M, and McIninch, J. (1993) Recognition of genes in DNA sequence with ambiguities, Biosystems 30, 161–171.

    Article  PubMed  CAS  Google Scholar 

  3. Salzberg, SL, Delcher, AL, Kasif, S, and White, O. (1998) Microbial gene identification using interpolated Markov models, Nucleic Acids Res 26, 544–548.

    Article  PubMed  CAS  Google Scholar 

  4. Hyatt, D, Chen, GL, Locascio, PF, Land, ML, Larimer, FW, and Hauser, LJ. (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics 11, 119.

    Article  PubMed  Google Scholar 

  5. Wang, ET, Sandberg, R, Luo, S, Khrebtukova, I, Zhang, L, Mayr, C, Kingsmore, SF, Schroth, GP, and Burge, CB. (2008) Alternative isoform regulation in human tissue transcriptomes, Nature 456, 470–476.

    Article  PubMed  CAS  Google Scholar 

  6. Kozak, M. (1981) Possible role of flanking nucleotides in recognition of the AUG initiator codon by eukaryotic ribosomes, Nucleic Acids Res 9, 5233–5252.

    Article  PubMed  CAS  Google Scholar 

  7. Altschul, SF, Gish, W, Miller, W, Myers, EW, and Lipman, DJ. (1990) Basic local alignment search tool. Journal of molecular biology. 215, 403–410.

    PubMed  CAS  Google Scholar 

  8. Gelfand, MS, Mironov, AA, and Pevzner, PA. (1996) Gene recognition via spliced sequence alignment, Proceedings of the National Academy of Sciences of the United States of America 93, 9061–9066.

    Article  PubMed  CAS  Google Scholar 

  9. Mott, R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA, Computer applications in the biosciences:CABIOS 13, 477–478.

    PubMed  CAS  Google Scholar 

  10. Florea, L, Hartzell, G, Zhang, Z, Rubin, GM, and Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res 8, 967–974.

    PubMed  CAS  Google Scholar 

  11. Kent, WJ. (2002) BLAT – the BLAST-like alignment tool, Genome research. 12, 656–2292R.

    PubMed  CAS  Google Scholar 

  12. Wu, T, and Watanabe, C. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences, Bioinformatics (Oxford, England) 21, 1859–1875.

    Google Scholar 

  13. Slater, G, and Birney, E. (2005) Automated generation of heuristics for biological sequence comparison, BMC bioinformatics [electronic resource]. 6, 31.

    Google Scholar 

  14. Birney, E, Clamp, M, and Durbin, R. (2004) GeneWise and Genomewise, Genome Research 14, 988–995.

    Article  PubMed  CAS  Google Scholar 

  15. Hubbard, T, Barker, D, Birney, E, Cameron, G, Chen, Y, Clark, L, Cox, T, Cuff, J, Curwen, V, Down, T, et al. (2002) The Ensembl genome database project, Nucleic acids research. 30, 38–41.

    Article  PubMed  CAS  Google Scholar 

  16. Hsu, F, Kent, WJ, Clawson, H, Kuhn, RM, Diekhans, M, and Haussler, D. (2006) The UCSC Known Genes, Bioinformatics (Oxford, England) 22, 1036–1046.

    Google Scholar 

  17. Trapnell, C, Williams, BA, Pertea, G, Mortazavi, A, Kwan, G, van Baren, MJ, Salzberg, SL, Wold, BJ, and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat Biotechnol. 28, 511–515.

    Google Scholar 

  18. Guttman, M, Garber, M, Levin, JZ, Donaghey, J, Robinson, J, Adiconis, X, Fan, L, Koziol, MJ, Gnirke, A, Nusbaum, C, Rinn, JL, Lander, ES, and Regev, A. (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs, Nat Biotechnol 28, 503–510.

    Google Scholar 

  19. Stanke, M, Keller, O, Gunduz, I, Hayes, A, Waack, S, and Morgenstern, B. (2006) AUGUSTUS: ab initio prediction of alternative transcripts, Nucleic acids research 34, W435–439.

    Article  PubMed  CAS  Google Scholar 

  20. Parra, G, Blanco, E, and Guigó, R. (2000) GeneID in Drosophila, Genome Research 10, 511–515.

    Article  PubMed  CAS  Google Scholar 

  21. Barash, Y, Calarco, JA, Gao, W, Pan, Q, Wang, X, Shai, O, Blencowe, BJ, and Frey, BJ. (2010) Deciphering the splicing code, Nature 465, 53–59.

    Article  PubMed  CAS  Google Scholar 

  22. Tilgner, H, Nikolaou, C, Althammer, S, Sammeth, M, Beato, M, Valcarcel, J, and Guigo, R. (2009) Nucleosome positioning as a determinant of exon recognition, Nat Struct Mol Biol 16, 996–1001.

    Article  PubMed  CAS  Google Scholar 

  23. Burge, C, and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA, J Mol Biol 268, 78–94.

    Article  PubMed  CAS  Google Scholar 

  24. Castelo, R, and Guigo, R. (2004) Splice site identification by idlBNs, Bioinformatics 20 Suppl 1, i69–76.

    Article  PubMed  CAS  Google Scholar 

  25. Sun, Y-F, Fan, X-D, and Li, Y-D. (2003) Identifying splicing sites in eukaryotic RNA: support vector machine approach, Computers in biology and medicine 33, 17–29.

    Article  PubMed  CAS  Google Scholar 

  26. Zhang, XHF, Heller, KA, Hefter, I, Leslie, CS, and Chasin, LA. (2003) Sequence information for the splicing of human pre-mRNA identified by support vector machine classification, Genome Research 13, 2637–2650.

    Article  PubMed  CAS  Google Scholar 

  27. Degroeve, S, Saeys, Y, De Baets, B, Rouzé, P, and Van de Peer, Y. (2005) SpliceMachine: predicting splice sites from high-dimensional local context representations, Bioinformatics (Oxford, England) 21, 1332–1338.

    Google Scholar 

  28. Baten, AKMA, Chang, BCH, Halgamuge, SK, and Li, J. (2006) Splice site identification using probabilistic parameters and SVM classification, BMC Bioinformatics 7 Suppl 5, S15.

    Google Scholar 

  29. Ratsch, G, Sonnenburg, S, and Schafer, C. (2006) Learning interpretable SVMs for biological sequence classification, BMC Bioinformatics 7 Suppl 1, S9.

    Google Scholar 

  30. Fickett, JW, and Tung, CS. (1992) Assessment of protein coding measures, Nucleic acids research 20, 6441–6450.

    Article  PubMed  CAS  Google Scholar 

  31. Gelfand, MS. (1995) Prediction of function in DNA sequence analysis, Journal of computational biology: a journal of computational molecular cell biology 2, 87–115.

    Article  CAS  Google Scholar 

  32. Guigo, R, and Fickett, JW. (1995) Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA, J Mol Biol 253, 51–60.

    Article  PubMed  CAS  Google Scholar 

  33. Uberbacher, EC, and Mural, RJ. (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach, Proceedings of the National Academy of Sciences of the United States of America 88, 11261–11265.

    Article  PubMed  CAS  Google Scholar 

  34. Xu, Y, Einstein, JR, Mural, RJ, Shah, M, and Uberbacher, EC. (1994) An improved system for exon recognition and gene modeling in human DNA sequences, In International Conference on Intelligent Systems for Molecular Biology, pp 376–384.

    Google Scholar 

  35. Alexandersson, M, Cawley, S, and Pachter, L. (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model, Genome Res 13, 496–502.

    Article  PubMed  CAS  Google Scholar 

  36. Parra, G, Agarwal, P, Abril, JF, Wiehe, T, Fickett, JW, and Guigo, R. (2003) Comparative gene prediction in human and mouse, Genome Res 13, 108–117.

    Article  PubMed  CAS  Google Scholar 

  37. Korf, I, Flicek, P, Duan, D, and Brent, MR. (2001) Integrating genomic homology into gene structure prediction, Bioinformatics 17 Suppl 1, S140–148.

    Article  PubMed  Google Scholar 

  38. Pedersen, JS, and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure and evolution, Bioinformatics (Oxford, England) 19, 219–227.

    Google Scholar 

  39. Siepel, A, and Haussler, D. (2004) Combining phylogenetic and hidden Markov models in biosequence analysis, Journal of computational biology: a journal of computational molecular cell biology 11, 413–428.

    Article  CAS  Google Scholar 

  40. Gross, S, Do, C, Sirota, M, and Batzoglou, S. (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction, Genome Biol 8, R269.

    Article  PubMed  Google Scholar 

  41. Gelfand, MS, and Roytberg, MA. (1993) Prediction of the exon-intron structure by a dynamic programming approach, Biosystems 30, 173–182.

    Article  PubMed  CAS  Google Scholar 

  42. Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming, J Comput Biol 5, 681–702.

    Article  PubMed  CAS  Google Scholar 

  43. Solovyev, VV, Salamov, AA, and Lawrence, CB. (1995) Identification of human gene structure using linear discriminant functions and dynamic programming, Proc Int Conf Intell Syst Mol Biol 3, 367–375.

    PubMed  CAS  Google Scholar 

  44. Blanco, E, Parra, G, and Guigo, R. (2007) Using geneid to identify genes, Curr Protoc Bioinformatics Chapter 4, Unit 4 3.

    Google Scholar 

  45. Salzberg, SL, Pertea, M, Delcher, AL, Gardner, MJ, and Tettelin, H. (1999) Interpolated Markov models for eukaryotic gene finding, Genomics 59, 24–31.

    Article  PubMed  CAS  Google Scholar 

  46. Krogh, A, Mian, IS, and Haussler, D. (1994) A hidden Markov model that finds genes in E. coli DNA, Nucleic Acids Res 22, 4768–4778.

    Article  PubMed  CAS  Google Scholar 

  47. Kulp, D, Haussler, D, Reese, MG, and Eeckman, FH. (1996) A generalized hidden Markov model for the recognition of human genes in DNA, Proc Int Conf Intell Syst Mol Biol 4, 134–142.

    PubMed  CAS  Google Scholar 

  48. Henderson, J, Salzberg, S, and Fasman, KH. (1997) Finding genes in DNA with a Hidden Markov Model, J Comput Biol 4, 127–141.

    Article  PubMed  CAS  Google Scholar 

  49. Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding, Proc Int Conf Intell Syst Mol Biol 5, 179–186.

    PubMed  CAS  Google Scholar 

  50. Salamov, AA, and Solovyev, VV. (2000) Ab initio gene finding in Drosophila genomic DNA, Genome Research 10, 516–522.

    Article  PubMed  CAS  Google Scholar 

  51. Baum, LE, Petrie, T, Soules, G, and Weiss, N. (1970) A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, The Annals of Mathematical Statistics 41, 164–171.

    Article  Google Scholar 

  52. Dempster, AP, Laird, NM, and Rubin, DB. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38.

    Google Scholar 

  53. Korf, I, Flicek, P, Duan, D, and Brent, MR. (2001) Integrating genomic homology into gene structure prediction, Bioinformatics (Oxford, England) 17 Suppl 1, S140–148.

    Google Scholar 

  54. Majoros, WH, Pertea, M, and Salzberg, SL. (2005) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding, Bioinformatics 21, 1782–1788.

    Article  PubMed  CAS  Google Scholar 

  55. Meyer, IM, and Durbin, R. (2002) Comparative ab initio prediction of gene structures using pair HMMs, Bioinformatics (Oxford, England) 18, 1309–1318.

    Google Scholar 

  56. Hasegawa, M, Kishino, H, and Yano, T. (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA, J Mol Evol 22, 160–174.

    Article  PubMed  CAS  Google Scholar 

  57. McAuliffe, JD, Pachter, L, and Jordan, MI. (2004) Multiple-sequence functional annotation and the generalized hidden Markov phylogeny, Bioinformatics (Oxford, England) 20, 1850–1860.

    Google Scholar 

  58. Gross, SS, and Brent, MR. (2006) Using multiple alignments to improve gene prediction, Journal of computational biology: a journal of computational molecular cell biology 13, 379–393.

    Article  CAS  Google Scholar 

  59. Ng, AY, and Jordan, MI. (2002) On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes, In Advances in Neural Information Processing Systems (NIPS) (Dietterich, T, Becker, S, and Ghahramani, Z, Eds.) 2, 841–848.

    Google Scholar 

  60. Ratsch, G, Sonnenburg, S, Srinivasan, J, Witte, H, Muller, KR, Sommer, RJ, and Scholkopf, B. (2007) Improving the Caenorhabditis elegans genome annotation using machine learning, PLoS Comput Biol 3, e20.

    Article  PubMed  Google Scholar 

  61. Sonnenburg, S, Schweikert, G, Philips, P, Behr, J, and Ratsch, G. (2007) Accurate splice site prediction using support vector machines, BMC Bioinformatics 8 Suppl 10, S7.

    Google Scholar 

  62. Sarawagi, S, and Cohen, W. (2005) Semi-Markov Conditional Random Fields for Information Extraction, In Advances in Neural Information Processing Systems 17 (Saul, LK, Weiss, Y, and Bottou, L, Eds.), pp 1185–1192, MIT Press, Cambridge, MA.

    Google Scholar 

  63. Bernal, A, Crammer, K, Hatzigeorgiou, A, and Pereira, F. (2007) Global discriminative learning for higher-accuracy computational gene prediction, PLoS Comput Biol 3, e54.

    Article  PubMed  Google Scholar 

  64. DeCaprio, D, Vinson, JP, Pearson, MD, Montgomery, P, Doherty, M, and Galagan, JE. (2007) Conrad: gene prediction using conditional random fields, Genome Res 17, 1389–1398.

    Article  PubMed  CAS  Google Scholar 

  65. Gross, SS, Do, CB, Sirota, M, and Batzoglou, S. (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction, Genome Biol 8, R269.

    Article  PubMed  Google Scholar 

  66. Howe, K, Chothia, T, and Durbin, R. (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming, Genome Research 12, 1418–1427.

    Article  PubMed  CAS  Google Scholar 

  67. Allen, JE, Majoros, WH, Pertea, M, and Salzberg, SL. (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions, Genome Biol 7 Suppl 1, S9 1–13.

    Google Scholar 

  68. Elsik, CG, Mackey, AJ, Reese, JT, Milshina, NV, Roos, DS, and Weinstock, GM. (2007) Creating a honey bee consensus gene set, Genome Biology 8, R13.

    Google Scholar 

  69. Coghlan, A, and Durbin, R. (2007) Genomix: a method for combining gene-finders’ predictions, which uses evolutionary conservation of sequence and intron-exon structure, Bioinformatics (Oxford, England) 23, 1468–1475.

    Google Scholar 

  70. Foissac, S, and Schiex, T. (2005) Integrating alternative splicing detection into gene prediction, BMC bioinformatics 6, 25–25.

    Article  PubMed  Google Scholar 

  71. Elsik, CG, Tellam, RL, Worley, KC, Gibbs, RA, Muzny, DM, Weinstock, GM, Adelson, DL, Eichler, EE, Elnitski, L, Guigo, R, et al. (2009) The genome sequence of taurine cattle: a window to ruminant biology and evolution, Science 324, 522–528.

    Article  PubMed  Google Scholar 

  72. Burset, M, and Guigo, R. (1996) Evaluation of gene structure prediction programs, Genomics 34, 353–367.

    Article  PubMed  CAS  Google Scholar 

  73. Rogic, S, Mackworth, AK, and Ouellette, FB. (2001) Evaluation of gene-finding programs on mammalian sequences, Genome Res 11, 817–832.

    Article  PubMed  CAS  Google Scholar 

  74. Reese, M, Hartzell, G, Harris, N, Ohler, U, Abril, J, and Lewis, S. (2000) Genome annotation assessment in Drosophila melanogaster, Genome Research 10, 483–501.

    Article  PubMed  CAS  Google Scholar 

  75. Guigó, R, Flicek, P, Abril, J, Reymond, A, Lagarde, J, Denoeud, F, Antonarakis, S, Ashburner, M, Bajic, V, Birney, E, Castelo, R, Eyras, E, Ucla, C, Gingeras, T, Harrow, J, Hubbard, T, Lewis, S, and Reese, M. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project, Genome Biology 7 Suppl 1, 2–1.

    Google Scholar 

  76. Coghlan, A, Fiedler, T, McKay, S, Flicek, P, Harris, T, Blasiar, D, Consortium, tn, and Stein, L. (2008) nGASP – the nematode genome annotation assessment project, BMC Bioinformatics 9, 549.

    Google Scholar 

  77. Alioto, T. (2007) U12DB: a database of orthologous U12-type spliceosomal introns, Nucleic acids research 35, 110–115.

    Article  Google Scholar 

  78. Kryukov, GV, Castellano, S, Novoselov, SV, Lobanov, AV, Zehtab, O, Guigo, R, and Gladyshev, VN. (2003) Characterization of mammalian selenoproteomes, Science 300, 1439–1443.

    Article  PubMed  CAS  Google Scholar 

  79. Castellano, S, Gladyshev, VN, Guigo, R, and Berry, MJ. (2008) SelenoDB 1.0: a database of selenoprotein genes, proteins and SECIS elements, Nucleic Acids Res 36, D332–338.

    Article  PubMed  CAS  Google Scholar 

  80. Majoros, WH (2007) Methods for Computational Gene Prediction, Cambridge University Press.

    Google Scholar 

  81. Harrow, J, Nagy, A, Reymond, A, Alioto, T, Patthy, L, Antonarakis, SE, and Guigo, R. (2009) Identifying protein-coding genes in genomic sequences, Genome Biol 10, 201.

    Article  PubMed  Google Scholar 

  82. Abril, JF, and Guigo, R. (2000) gff2ps: visualizing genomic annotations, Bioinformatics 16, 743–744.

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tyler Alioto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Alioto, T. (2012). Gene Prediction. In: Anisimova, M. (eds) Evolutionary Genomics. Methods in Molecular Biology, vol 855. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-61779-582-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-582-4_6

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-61779-581-7

  • Online ISBN: 978-1-61779-582-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics