Skip to main content

An Integrated Machine-Learning Model to Predict Prokaryotic Essential Genes

  • Protocol
  • First Online:
Book cover Gene Essentiality

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1279))

Abstract

Essential genes are indispensable for the target organism’s survival. Large-scale identification and characterization of essential genes has shown to be beneficial in both fundamental biology and medicine fields. Current existing genome-scale experimental screenings of essential genes are time consuming and costly, also sometimes confer erroneous essential gene annotations. To circumvent these difficulties, many research groups turn to computational approaches as the alternative to identify essential genes. Here, we developed an integrative machine-learning based statistical framework to accurately predict essential genes in microorganisms. First we extracted a variety of relevant features derived from different aspects of an organism’s genomic sequences. Then we selected a subset of features have high predictive power of gene essentiality through a carefully designed feature selection system. Using the selected features as input, we constructed an ensemble classifier and trained the model on a well-studied microorganism. After fine-tuning the model parameters in cross-validation, we tested the model on the other microorganism. We found that the tenfold cross-validation results within the same organism achieves a high predictive accuracy (AUC ~0.9), and cross-organism predictions between distant related organisms yield the AUC scores from 0.69 to 0.89, which significantly outperformed homology mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kobayashi K, Ehrlich SD, Albertini A et al (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci U S A 100(8):4678–4683. doi:10.1073/pnas.0730515100

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  2. Itaya M (1995) An estimation of minimal genome size required for life. FEBS Lett 362(3):257–260

    Article  CAS  PubMed  Google Scholar 

  3. Dowell RD, Ryan O, Jansen A et al (2010) Genotype to phenotype: a complex problem. Science 328(5977):469. doi:10.1126/science.1189015

    Article  CAS  PubMed  Google Scholar 

  4. Haselbeck R, Wall D, Jiang B et al (2002) Comprehensive essential gene identification as a platform for novel anti-infective drug discovery. Curr Pharm Des 8(13):1155–1172

    Article  CAS  PubMed  Google Scholar 

  5. Judson N, Mekalanos JJ (2000) TnAraOut, a transposon-based approach to identify and characterize essential bacterial genes. Nat Biotechnol 18(7):740–745. doi:10.1038/77305

    Article  CAS  PubMed  Google Scholar 

  6. Baba T, Ara T, Hasegawa M et al (2006) Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol 2(2006):0008. doi:10.1038/msb4100050

    PubMed  Google Scholar 

  7. Pucci MJ (2006) Use of genomics to select antibacterial targets. Biochem Pharmacol 71(7):1066–1072. doi:10.1016/j.bcp.2005.12.004

    Article  CAS  PubMed  Google Scholar 

  8. Chen Y, Xu D (2005) Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 21(5):575–581. doi:10.1093/bioinformatics/bti058

    Article  CAS  PubMed  Google Scholar 

  9. Saha S, Heber S (2006) In silico prediction of yeast deletion phenotypes. Genet Mol Res 5(1):224–232

    CAS  PubMed  Google Scholar 

  10. Gustafson AM, Snitkin ES, Parker SC et al (2006) Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 7:265. doi:10.1186/1471-2164-7-265

    Article  PubMed Central  PubMed  Google Scholar 

  11. Seringhaus M, Paccanaro A, Borneman A et al (2006) Predicting essential genes in fungal genomes. Genome Res 16(9):1126–1135. doi:10.1101/gr.5144106

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  12. Deng J, Deng L, Su S et al (2011) Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res 39(3):795–807. doi:10.1093/nar/gkq784

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  13. Winsor GL, Lam DK, Fleming L et al (2011) Pseudomonas genome database: improved comparative analysis and population genomics capability for Pseudomonas genomes. Nucleic Acids Res 39(Database issue):D596–D600. doi:10.1093/nar/gkq869

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  14. Kato J, Hashimoto M (2007) Construction of consecutive deletions of the Escherichia coli chromosome. Mol Syst Biol 3:132. doi:10.1038/msb4100174

    Article  PubMed Central  PubMed  Google Scholar 

  15. Zhang R, Lin Y (2009) DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 37(Database issue):D455–D458. doi:10.1093/nar/gkn858

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  16. Chen WH, Minguez P, Lercher MJ et al (2012) OGEE: an online gene essentiality database. Nucleic Acids Res 40(Database issue):D901–D906. doi:10.1093/nar/gkr986

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  17. Barrett T, Troup DB, Wilhite SE et al (2007) NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res 35(Database issue):D760–D765. doi:10.1093/nar/gkl887

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  18. Parkinson H, Kapushesky M, Shojatalab M et al (2007) ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue):D747–D750. doi:10.1093/nar/gkl995

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  19. Lu Z, Szafron D, Greiner R et al (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4):547–556. doi:10.1093/bioinformatics/bth026

    Article  CAS  PubMed  Google Scholar 

  20. Krogh A, Larsson B, von Heijne G et al (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3):567–580. doi:10.1006/jmbi.2000.4315

    Article  CAS  PubMed  Google Scholar 

  21. Yip KY, Yu H, Kim PM et al (2006) The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. Bioinformatics 22(23):2968–2970. doi:10.1093/bioinformatics/btl488

    Article  CAS  PubMed  Google Scholar 

  22. Sharp PM, Li WH (1987) The codon adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15(3):1281–1295

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  23. Fuglsang A (2004) The ‘effective number of codons’ revisited. Biochem Biophys Res Commun 317(3):957–964. doi:10.1016/j.bbrc.2004.03.138

    Article  CAS  PubMed  Google Scholar 

  24. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132

    Article  CAS  PubMed  Google Scholar 

  25. Lu LJ, Xia Y, Paccanaro A et al (2005) Assessing the limits of genomic data integration for predicting protein networks. Genome Res 15(7):945–953. doi:10.1101/gr.3610305

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  26. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA

    Google Scholar 

  27. Zhang CT, Zhang R (2008) Gene essentiality analysis based on DEG, a database of essential genes. Methods Mol Biol 416:391–400. doi:10.1007/978-1-59745-321-9_27

    Article  CAS  PubMed  Google Scholar 

  28. Giaever G, Chu AM, Ni L et al (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418(6896):387–391. doi:10.1038/nature00935

    Article  CAS  PubMed  Google Scholar 

  29. Jordan IK, Rogozin IB, Wolf YI et al (2002) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12(6):962–968

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  30. Bork P (1991) Shuffled domains in extracellular proteins. FEBS Lett 286(1–2):47–54

    Article  CAS  PubMed  Google Scholar 

  31. Yu H, Greenbaum D, Xin Lu H et al (2004) Genomic analysis of essentiality within protein networks. Trends Genet 20(6):227–231. doi:10.1016/j.tig.2004.04.008

    Article  CAS  PubMed  Google Scholar 

  32. Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37–46. doi:10.1101/gr.205602

    Article  CAS  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingyuan Deng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this protocol

Cite this protocol

Deng, J. (2015). An Integrated Machine-Learning Model to Predict Prokaryotic Essential Genes. In: Lu, L. (eds) Gene Essentiality. Methods in Molecular Biology, vol 1279. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-2398-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-2398-4_9

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-2397-7

  • Online ISBN: 978-1-4939-2398-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics