Skip to main content

Finding Genes in Genome Sequence

  • Protocol

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 452))

Abstract

Gene-finding is concerned with the identification of stretches of DNA in a genomic sequence that encode biologically active products, such as proteins or functional non-coding RNAs. This is usually the first step in the analysis of any novel piece of genomic sequence, which makes it a very important issue, as all downstream analyses depend on the results. This chapter focuses on the biological basis, computational approaches, and corresponding programs that are available for the automated identification of protein-coding genes. For prokaryotic and eukaryotic genomes, as well as the novel, multi-species sequence data originating from environmental community studies, the state of the art in automated gene finding is described.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Dong, H., Nilsson, L., Kurland, C. G. (1996) Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J Mol Biol 260, 649–663.

    Article  PubMed  CAS  Google Scholar 

  2. Ikemura, T. (1981) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151, 389–409.

    Article  PubMed  CAS  Google Scholar 

  3. Sharp, P. M., Bailes, E., Grocock, R. J., et al. (2005) Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res 33, 1141–1153.

    Article  PubMed  CAS  Google Scholar 

  4. Rocha, E. P. (2004) Codon usage bias from tRNA's point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome Res 14, 2279–2286.

    Article  PubMed  CAS  Google Scholar 

  5. Hooper, S. D., Berg, O. G. (2000) Gradients in nucleotide and codon usage along Escherichia coli genes. Nucleic Acids Res 28, 3517–3523.

    Article  PubMed  CAS  Google Scholar 

  6. Fickett, J. W., Tung, C. S. (1992) Assessment of protein coding measures. Nucleic Acids Res 20, 6441–6450.

    Article  PubMed  CAS  Google Scholar 

  7. Besemer, J., Lomsadze, A., Borodovsky, M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29, 2607–2618.

    Article  PubMed  CAS  Google Scholar 

  8. Larsen, T. S., Krogh, A. (2003) EasyGene— a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics 4, 21.

    Article  PubMed  Google Scholar 

  9. Lukashin, A. V., Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26, 1107–1115.

    Article  PubMed  CAS  Google Scholar 

  10. Delcher, A. L., Harmon, D., Kasif, S., et al. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27, 4636–4641.

    Article  PubMed  CAS  Google Scholar 

  11. Krause, L., McHardy, A. C., Nattkemper, T. W., et al. (2007) GISMO—gene identification using a support vector machine for ORF classification. Nucleic Acids Res 35, 540–549.

    Article  PubMed  CAS  Google Scholar 

  12. Mahony, S., McInerney, J. O., Smith, T. J., et al. (2004) Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models. BMC Bioin-formatics 5, 23.

    Article  Google Scholar 

  13. Ochman, H., Lawrence, J. G., and Groisman, E. A. (2000) Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304.

    Article  PubMed  CAS  Google Scholar 

  14. Hayes, W. S., Borodovsky, M. (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res 8, 1154–1171.

    PubMed  CAS  Google Scholar 

  15. Ou, H. Y., Guo, F. B., Zhang, C. T. (2004) GS-Finder: a program to find bacterial gene start sites with a self-training method. Int J Biochem Cell Biol 36, 535–544.

    Article  PubMed  CAS  Google Scholar 

  16. Suzek, B. E., Ermolaeva, M. D., Sch-reiber, M., et al. (2001) A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics 17, 1123–1130.

    Article  PubMed  CAS  Google Scholar 

  17. Tech, M., Pfeifer, N., Morgenstern, B., et al. (2005) TICO: a tool for improving predictions of prokaryotic translation initiation sites. Bioinformatics 21, 3568–3569.

    Article  PubMed  CAS  Google Scholar 

  18. Zhu, H. Q., Hu, G. Q., Ouyang, Z. Q., et al. (2004) Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 20, 3308–3317.

    Article  PubMed  CAS  Google Scholar 

  19. Shibuya, T., Rigoutsos, I. (2002) Dictionary-driven prokaryotic gene finding. Nucleic Acids Res 30, 2710–2725.

    Article  PubMed  CAS  Google Scholar 

  20. Badger, J. H., Olsen, G. J. (1999) CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16, 512–524.

    PubMed  CAS  Google Scholar 

  21. Frishman, D., Mironov, A., Mewes, H. W., et al. (1998) Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 26, 2941–2947.

    Article  PubMed  CAS  Google Scholar 

  22. McHardy, A. C., Goesmann, A., Puhler, A., et al. (2004) Development of joint application strategies for two microbial gene finders. Bioinformatics 20, 1622–1631.

    Article  PubMed  CAS  Google Scholar 

  23. Tech, M., Merkl, R. (2003) YACOP: Enhanced gene prediction obtained by a combination of existing methods. In Silico Biol 3, 441–451.

    PubMed  CAS  Google Scholar 

  24. Guo, F. B., Ou, H. Y., Zhang, C. T. (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31, 1780–1789.

    Article  PubMed  CAS  Google Scholar 

  25. Venter, J. C., Remington, K., Heidelberg, J. F., et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74.

    Article  PubMed  CAS  Google Scholar 

  26. Hugenholtz, P. (2002) Exploring prokaryotic diversity in the genomic era. Genome Biol 3, REVIEWS0003.

    Google Scholar 

  27. Rappe, M. S., Giovannoni, S. J. (2003) The uncultured microbial majority. Annu Rev Microbiol 57, 369–394.

    Article  PubMed  CAS  Google Scholar 

  28. Chen, K., Pachter, L. (2005) Bioinformat-ics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol 1, 106–112.

    Article  PubMed  CAS  Google Scholar 

  29. Krause, L., Diaz, N. N., Bartels, D., et al. (2006) Finding novel genes in bacterial communities isolated from the environment. Bioinformatics 22, e281–289.

    Article  PubMed  CAS  Google Scholar 

  30. Sandberg, R., Branden, C. I., Ernberg, I., et al. (2003) Quantifying the species-specificity in genomic signatures, synonym ous codon choice, amino acid usage and G+C content. Gene 311, 35–42.

    Article  PubMed  CAS  Google Scholar 

  31. Brent, M. R., Guigo, R. (2004) Recent advances in gene structure prediction. Curr Opin Struct Biol 14, 264–272.

    Article  PubMed  CAS  Google Scholar 

  32. Mathe, C., Sagot, M. F., Schiex, T., et al. (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30, 4103–4117.

    Article  PubMed  CAS  Google Scholar 

  33. Curwen, V., Eyras, E., Andrews, T. D., et al. (2004) The Ensembl automatic gene annotation system. Genome Res 14, 942–950.

    Article  PubMed  CAS  Google Scholar 

  34. Birney, E., Clamp, M., Durbin, R. (2004) GeneWise and Genomewise. Genome Res 14, 988–995.

    Article  PubMed  CAS  Google Scholar 

  35. Burge, C., Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94.

    Article  PubMed  CAS  Google Scholar 

  36. Slater, G. S., Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31.

    Article  PubMed  Google Scholar 

  37. Korf, I. (2004) Gene finding in novel genomes. BMC Bioinformatics 5, 59.

    Article  PubMed  Google Scholar 

  38. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O., et al. (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33, 6494–6506.

    Article  PubMed  CAS  Google Scholar 

  39. Tenney, A. E., Brown, R. H., Vaske, C., et al. (2004) Gene prediction and verification in a compact genome with numerous small introns. Genome Res 14, 2330–2335.

    Article  PubMed  CAS  Google Scholar 

  40. Wei, C., Lamesch, P., Arumugam, M., et al. (2005) Closing in on the C. elegans ORFe-ome by cloning TWINSCAN predictions. Genome Res 15, 577–582.

    Article  PubMed  CAS  Google Scholar 

  41. Guigo, R., Reese, M. G. (2005) EGASP: collaboration through competition to find human genes. Nat Methods 2, 575–577.

    Article  PubMed  CAS  Google Scholar 

  42. Guigo, R., Flicek, P., Abril, J. F., et al. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7 Suppl 1, S21–31.

    Article  Google Scholar 

  43. Nielsen, P., Krogh, A. (2005) Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21, 4322–4329.

    Article  PubMed  CAS  Google Scholar 

  44. Linke, B., McHardy, A. C., Krause, L., et al. (2006) REGANOR: A gene prediction server for prokaryotic genomes and a database of high quality gene predictions for prokaryotes. Appl Bioinformatics 5, 193–198.

    Article  PubMed  CAS  Google Scholar 

  45. Skovgaard, M., Jensen, L. J., Brunak, S., et al. (2001) On the total number of genes and their length distribution in complete micro-bial genomes. Trends Genet 17, 425–428.

    Article  PubMed  CAS  Google Scholar 

  46. Osterman, A., Overbeek, R. (2003) Missing genes in metabolic pathways: a comparative genomics approach. Curr Opin Chem Biol 7, 238–251.

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

The author thanks Lutz Krause, Alan Grossfield, and Augustine Tsai for their comments.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

McHardy, A.C. (2008). Finding Genes in Genome Sequence. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 452. Humana Press. https://doi.org/10.1007/978-1-60327-159-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-159-2_8

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-58829-707-5

  • Online ISBN: 978-1-60327-159-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics