Gene Prediction Methods



Most computational gene-finding methods in current use are derived from the fields of natural language processing and speech recognition. These latter fields are concerned with parsing spoken or written language into functional components such as nouns, verbs, and phrases of various types. The parsing task is governed by a set of syntax rules that dictate which linguistic elements may immediately follow each other in well-formed sentences – for example,
$$subject \rightarrow verb,\, verb \rightarrow direct\, object,\, etc\ldots$$
The problem of gene-finding is rather similar to linguistic parsing in that we wish to partition a sequence of letters into elements of biological relevance, such as exons, introns, and the intergenic regions separating genes. That is, we wish to not only find the genes, but also to predict their internal exon-intron structure so that the encoded protein(s) may be deduced. Figure 5.1 illustrates this internal structure for a typical gene.


Input Sequence Conditional Random Field Target Genome Related Genome Compositional Bias 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Adams MD, Celniker SE, Holt RA, 194 co-authors et al (2000) The genome sequence of Dosophila melanogaster. Science 287:2185–2195CrossRefPubMedGoogle Scholar
  2. Alexandersson M, Cawley S, Pachter L (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 13(3):496–502CrossRefPubMedGoogle Scholar
  3. Allen JE, Salzberg SL (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics (Oxford, England) 21(18)):3596–3603CrossRefGoogle Scholar
  4. Allen JE, Salzberg SL (2006) A phylogenetic generalized hidden Markov model for predicting alternatively spliced exons. Algorithms Mol Biol 1:14CrossRefPubMedGoogle Scholar
  5. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410PubMedGoogle Scholar
  6. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402CrossRefPubMedGoogle Scholar
  7. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol 3(3):e54CrossRefPubMedGoogle Scholar
  8. Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14(5):988–995CrossRefPubMedGoogle Scholar
  9. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF et al (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715CrossRefPubMedGoogle Scholar
  10. Bray N, Pachter L (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14(4):693–699CrossRefPubMedGoogle Scholar
  11. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):78–94CrossRefPubMedGoogle Scholar
  12. Cantarel BL, Korf I, Robb SM, Parra G, Ross E et al (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1):188–196CrossRefPubMedGoogle Scholar
  13. Castellano S, Lobanov AV, Chapple C, Novoselov SV, Albrecht M et al (2005) Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family. Proc Natl Acad Sci USA 102(45):16188–16193CrossRefPubMedGoogle Scholar
  14. Cawley SE, Wirth AI, Speed TP (2001) Phat – a gene finding program for Plasmodium falciparum. Mol Biochem Parasitol 118(2):167–174CrossRefPubMedGoogle Scholar
  15. Cawley SL, Pachter L (2003) HMM sampling and applications to gene finding and alternative splicing. Bioinformatics (Oxford, England) 19(Suppl 2):ii36–ii41Google Scholar
  16. Cormen TH, Leiserson CE, Rivest RL (1992) Introduction to algorithms. MIT, Cambridge, MAGoogle Scholar
  17. DeCaprio D, Vinson JP, Pearson MD, Montgomery P, Doherty M et al (2007) Conrad: gene prediction using conditional random fields. Genome Res 17(9):1389–1398CrossRefPubMedGoogle Scholar
  18. Durbin R, Eddy SR, Mitchison AKG (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, p 356Google Scholar
  19. Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, Wortman JR, Badger JH, Ren Q, Amedeo P, Jones KM, Tallon LJ, Delcher AL, Salzberg SL, Silva JC, Haas BJ, Majoros WH, Farzad M, Carlton JM, Smith RK, Garg J, Pearlman RE, Karrer KM, Sun L, Manning G, Elde NC, Turkewitz AP, Asai DJ, Wilkes DE, Wang Y, Cai H, Collins K, Stewart BA, Lee SR, Wilamowska K, Weinberg Z, Ruzzo WL, Wloga D, Gaertig J, Frankel J, Tsao CC, Gorovsky MA, Keeling PJ, Waller RF, Patron NJ, Cherry JM, Stover NA, Krieger CJ, Del Toro C, Ryder HF, Williamson SC, Barbeau RA, Hamilton EP, Orias E (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol 4(9):e286CrossRefPubMedGoogle Scholar
  20. Fariselli P, Martelli PL, Casadio R (2005) A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics 6(Suppl 4):S12CrossRefPubMedGoogle Scholar
  21. Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376CrossRefPubMedGoogle Scholar
  22. Florea L, Di Francesco V, Miller J, Turner R, Yao A, Harris M, Walenz B, Mobarry C, Merkulov GV, Charlab R, Dew I, Deng Z, Istrail S, Li P, Sutton G (2005) Gene and alternative splicing annotation with AIR. Genome Res 15:54–66CrossRefPubMedGoogle Scholar
  23. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8(9):967–974PubMedGoogle Scholar
  24. Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ et al (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428(6982):493–521CrossRefPubMedGoogle Scholar
  25. Gish W, States DJ (1993) Identification of protein coding regions by database similarity search. Nat Genet 3:266–272CrossRefPubMedGoogle Scholar
  26. Gross SS, Do CB, Sirota M, Batzoglou S (2008) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269CrossRefGoogle Scholar
  27. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S21–S31CrossRefGoogle Scholar
  28. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr et al (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31(19):5654–5666CrossRefPubMedGoogle Scholar
  29. Holmes I, Bruno WJ (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820CrossRefPubMedGoogle Scholar
  30. Hubbard T, Barker D, Birney E, Cameron G, Chen Y et al (2002) The Ensembl genome database project. Nucleic Acids Res 30(1):38–41CrossRefPubMedGoogle Scholar
  31. Jaakkola TS, Haussler D (1999) Exploiting generative models in discriminative classifiers. Adv Neural Inf Process Syst 11:487–493Google Scholar
  32. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. Genome Biology 2006, 7(Suppl):S9Google Scholar
  33. Jukes T, Cantor C (1969) Evolution of protein molecules. In: Munro H (ed) Mammalian protein metabolism. Academic, New York, NY, pp 21–132Google Scholar
  34. Kall L, Krogh A, Sonnhammer EL (2005) An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics (Oxford, England) 21(Suppl 1):251–257CrossRefGoogle Scholar
  35. Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S et al (2004) Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res 14(3):331–342CrossRefPubMedGoogle Scholar
  36. Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12(4):656–664PubMedGoogle Scholar
  37. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12:996–1006PubMedGoogle Scholar
  38. Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59CrossRefPubMedGoogle Scholar
  39. Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics (Oxford, England) 17(Suppl 1):S140–S148Google Scholar
  40. Krogh A (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5:179–186PubMedGoogle Scholar
  41. Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142PubMedGoogle Scholar
  42. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5:R12CrossRefPubMedGoogle Scholar
  43. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc 18th International Conf on Machine LearningGoogle Scholar
  44. Li L, Wang X, Sasidharan R, Stolc V, Deng W et al (2007) Global identification and characterization of transcriptionally active regions in the rice genome. PLoS ONE 2(3):e294CrossRefPubMedGoogle Scholar
  45. Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2(3):417–439CrossRefPubMedGoogle Scholar
  46. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506CrossRefPubMedGoogle Scholar
  47. Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26(4):1107–1115CrossRefPubMedGoogle Scholar
  48. Majoros W (2007) Methods for Computational Gene Prediction: Cambridge University Press.Google Scholar
  49. Majoros W (2007) Conditional random fields. Supplement to: methods for computational gene prediction.
  50. Majoros WH, Salzberg SL (2004) An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5:206CrossRefPubMedGoogle Scholar
  51. Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics (Oxford, England) 20(16):2878–2879CrossRefGoogle Scholar
  52. Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16CrossRefPubMedGoogle Scholar
  53. Majoros WH, Pertea M, Salzberg SL (2005) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics (Oxford, England) 21(9):1782–1788CrossRefGoogle Scholar
  54. Meyer IM, Durbin R (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics (Oxford, England) 18(10):1309–1318CrossRefGoogle Scholar
  55. Meyer IM, Durbin R (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res 32:776–783CrossRefPubMedGoogle Scholar
  56. Meyers BC, Vu TH, Tej SS, Ghazal H, Matvienko M et al (2004) Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing. Nat Biotechnol 22(8):1006–1011CrossRefPubMedGoogle Scholar
  57. Moses AM, Chiang DY, Eisen MB (2004) Phylogenetic motif detection by expectation maximization on evolutionary mixtures. Pac Symp Biocomput 9:325–335Google Scholar
  58. Ng AY, Jordan MI (2002) On discriminative vs generative classifiers: a comparison of logistic regression and naive Bayes. In: Dietterich T, Becker S, Ghahramani Z (eds.), Advances in Neural Information Processing Systems (NIPS) 14Google Scholar
  59. Normark S, Bergstrom S, Edlund T, Grundstrom T, Jaurin B, Lindberg FP, Olsson O (1983) Overlapping genes. Annual Review of Genetics 17:499–525CrossRefPubMedGoogle Scholar
  60. Parra G, Blanco E, Guigo R (2000) GeneID in Drosophila. Genome Res 10(4):511–515CrossRefPubMedGoogle Scholar
  61. Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW et al (2003) Comparative gene prediction in human and mouse. Genome Res 13(1):108–117CrossRefPubMedGoogle Scholar
  62. Pavesi A, De Iaco B, Granero MI, Porati A (1997) On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. Journal of Molecular Evolution 44:625–631CrossRefPubMedGoogle Scholar
  63. Pearson WR, Wood T, Zhang Z, Miller W (1997) Comparison of DNA sequences with protein sequences. Genomics 46:24–36CrossRefPubMedGoogle Scholar
  64. Pedersen JS, Hein J (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics (Oxford, England) 19(2):219–227CrossRefGoogle Scholar
  65. Reese MG, Kulp D, Tammana H, Haussler D (2000) Genie–gene finding in Drosophila melanogaster. Genome Res 10(4):529–538CrossRefPubMedGoogle Scholar
  66. Roest Crollius H, Jaillon O, Bernot A, Dasilva C, Bouneau L et al (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet 25(2):235–238CrossRefPubMedGoogle Scholar
  67. Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10(4):516–522CrossRefPubMedGoogle Scholar
  68. Schulze U, Hepp B, Ong CS, Ratsch G (2007) PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics (Oxford, England) 23(15):1892–1900CrossRefGoogle Scholar
  69. Seki M, Narusaka M, Kamiya A, Ishida J, Satou M et al (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science (New York, NY) 296(5565):141–145CrossRefGoogle Scholar
  70. Siepel A, Haussler D (2004) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol 21(3):468–488CrossRefPubMedGoogle Scholar
  71. Siepel A, Diekhans M, Brejova B, Langton L, Stevens M et al (2007) Targeted discovery of novel human exons by comparative genomics. Genome Res 17(12):1763–1773CrossRefPubMedGoogle Scholar
  72. Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31CrossRefPubMedGoogle Scholar
  73. Sonnhammer EL, Eddy SR, Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28(3):405–420CrossRefPubMedGoogle Scholar
  74. Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics (Oxford, England) 19(Suppl 2):ii215–ii225Google Scholar
  75. Stanke M, Schoffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62CrossRefPubMedGoogle Scholar
  76. Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD et al (2002) Generation and initial analysis of more than 15, 000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci USA 99(26):16899–16903CrossRefPubMedGoogle Scholar
  77. Tong S, Koller D (2000) Restricted Bayes optimal classifiers. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp 658-664.Google Scholar
  78. Uniprot Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197CrossRefGoogle Scholar
  79. Vinson J, DeCaprio D, Luoma S, Galagan JE (2006) Gene prediction using conditional random fields (abstract). In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10-14, 2006.Google Scholar
  80. Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB (2007) Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol 144(1):32–42CrossRefPubMedGoogle Scholar
  81. Wei C, Brent MR (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7:327CrossRefPubMedGoogle Scholar
  82. Yandell M, Bailey AM, Misra S, Shu S, Wiel C et al (2005) A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. Proc Natl Acad Sci USA 102(5):1566–1571CrossRefPubMedGoogle Scholar
  83. Yu P, Ma D, Xu M (2005) Nested genes in the human genome. Genomics 86:414–422.CrossRefGoogle Scholar
  84. Zhang M, Gish W (2006) Improved spliced alignment from an information theoretic approach. Bioinformatics (Oxford, England) 22(1):13–20CrossRefGoogle Scholar
  85. Zhang MQ, Marr GT (1993) A weight array method for splicing signal analysis. Comput Appl Biosci 9:499–509PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Institute for Genome Sciences & PolicyDuke UniversityDurhamUSA

Personalised recommendations