An Integrated Machine-Learning Model to Predict Prokaryotic Essential Genes

Deng, Jingyuan

doi:10.1007/978-1-4939-2398-4_9

Jingyuan Deng³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1279))

2386 Accesses
3 Citations

Abstract

Essential genes are indispensable for the target organism’s survival. Large-scale identification and characterization of essential genes has shown to be beneficial in both fundamental biology and medicine fields. Current existing genome-scale experimental screenings of essential genes are time consuming and costly, also sometimes confer erroneous essential gene annotations. To circumvent these difficulties, many research groups turn to computational approaches as the alternative to identify essential genes. Here, we developed an integrative machine-learning based statistical framework to accurately predict essential genes in microorganisms. First we extracted a variety of relevant features derived from different aspects of an organism’s genomic sequences. Then we selected a subset of features have high predictive power of gene essentiality through a carefully designed feature selection system. Using the selected features as input, we constructed an ensemble classifier and trained the model on a well-studied microorganism. After fine-tuning the model parameters in cross-validation, we tested the model on the other microorganism. We found that the tenfold cross-validation results within the same organism achieves a high predictive accuracy (AUC ~0.9), and cross-organism predictions between distant related organisms yield the AUC scores from 0.69 to 0.89, which significantly outperformed homology mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kobayashi K, Ehrlich SD, Albertini A et al (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci U S A 100(8):4678–4683. doi:10.1073/pnas.0730515100
Article CAS PubMed Central PubMed Google Scholar
Itaya M (1995) An estimation of minimal genome size required for life. FEBS Lett 362(3):257–260
Article CAS PubMed Google Scholar
Dowell RD, Ryan O, Jansen A et al (2010) Genotype to phenotype: a complex problem. Science 328(5977):469. doi:10.1126/science.1189015
Article CAS PubMed Google Scholar
Haselbeck R, Wall D, Jiang B et al (2002) Comprehensive essential gene identification as a platform for novel anti-infective drug discovery. Curr Pharm Des 8(13):1155–1172
Article CAS PubMed Google Scholar
Judson N, Mekalanos JJ (2000) TnAraOut, a transposon-based approach to identify and characterize essential bacterial genes. Nat Biotechnol 18(7):740–745. doi:10.1038/77305
Article CAS PubMed Google Scholar
Baba T, Ara T, Hasegawa M et al (2006) Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol 2(2006):0008. doi:10.1038/msb4100050
PubMed Google Scholar
Pucci MJ (2006) Use of genomics to select antibacterial targets. Biochem Pharmacol 71(7):1066–1072. doi:10.1016/j.bcp.2005.12.004
Article CAS PubMed Google Scholar
Chen Y, Xu D (2005) Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 21(5):575–581. doi:10.1093/bioinformatics/bti058
Article CAS PubMed Google Scholar
Saha S, Heber S (2006) In silico prediction of yeast deletion phenotypes. Genet Mol Res 5(1):224–232
CAS PubMed Google Scholar
Gustafson AM, Snitkin ES, Parker SC et al (2006) Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 7:265. doi:10.1186/1471-2164-7-265
Article PubMed Central PubMed Google Scholar
Seringhaus M, Paccanaro A, Borneman A et al (2006) Predicting essential genes in fungal genomes. Genome Res 16(9):1126–1135. doi:10.1101/gr.5144106
Article CAS PubMed Central PubMed Google Scholar
Deng J, Deng L, Su S et al (2011) Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res 39(3):795–807. doi:10.1093/nar/gkq784
Article CAS PubMed Central PubMed Google Scholar
Winsor GL, Lam DK, Fleming L et al (2011) Pseudomonas genome database: improved comparative analysis and population genomics capability for Pseudomonas genomes. Nucleic Acids Res 39(Database issue):D596–D600. doi:10.1093/nar/gkq869
Article CAS PubMed Central PubMed Google Scholar
Kato J, Hashimoto M (2007) Construction of consecutive deletions of the Escherichia coli chromosome. Mol Syst Biol 3:132. doi:10.1038/msb4100174
Article PubMed Central PubMed Google Scholar
Zhang R, Lin Y (2009) DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 37(Database issue):D455–D458. doi:10.1093/nar/gkn858
Article CAS PubMed Central PubMed Google Scholar
Chen WH, Minguez P, Lercher MJ et al (2012) OGEE: an online gene essentiality database. Nucleic Acids Res 40(Database issue):D901–D906. doi:10.1093/nar/gkr986
Article CAS PubMed Central PubMed Google Scholar
Barrett T, Troup DB, Wilhite SE et al (2007) NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res 35(Database issue):D760–D765. doi:10.1093/nar/gkl887
Article CAS PubMed Central PubMed Google Scholar
Parkinson H, Kapushesky M, Shojatalab M et al (2007) ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue):D747–D750. doi:10.1093/nar/gkl995
Article CAS PubMed Central PubMed Google Scholar
Lu Z, Szafron D, Greiner R et al (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4):547–556. doi:10.1093/bioinformatics/bth026
Article CAS PubMed Google Scholar
Krogh A, Larsson B, von Heijne G et al (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3):567–580. doi:10.1006/jmbi.2000.4315
Article CAS PubMed Google Scholar
Yip KY, Yu H, Kim PM et al (2006) The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. Bioinformatics 22(23):2968–2970. doi:10.1093/bioinformatics/btl488
Article CAS PubMed Google Scholar
Sharp PM, Li WH (1987) The codon adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15(3):1281–1295
Article CAS PubMed Central PubMed Google Scholar
Fuglsang A (2004) The ‘effective number of codons’ revisited. Biochem Biophys Res Commun 317(3):957–964. doi:10.1016/j.bbrc.2004.03.138
Article CAS PubMed Google Scholar
Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132
Article CAS PubMed Google Scholar
Lu LJ, Xia Y, Paccanaro A et al (2005) Assessing the limits of genomic data integration for predicting protein networks. Genome Res 15(7):945–953. doi:10.1101/gr.3610305
Article CAS PubMed Central PubMed Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA
Google Scholar
Zhang CT, Zhang R (2008) Gene essentiality analysis based on DEG, a database of essential genes. Methods Mol Biol 416:391–400. doi:10.1007/978-1-59745-321-9_27
Article CAS PubMed Google Scholar
Giaever G, Chu AM, Ni L et al (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418(6896):387–391. doi:10.1038/nature00935
Article CAS PubMed Google Scholar
Jordan IK, Rogozin IB, Wolf YI et al (2002) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12(6):962–968
Article CAS PubMed Central PubMed Google Scholar
Bork P (1991) Shuffled domains in extracellular proteins. FEBS Lett 286(1–2):47–54
Article CAS PubMed Google Scholar
Yu H, Greenbaum D, Xin Lu H et al (2004) Genomic analysis of essentiality within protein networks. Trends Genet 20(6):227–231. doi:10.1016/j.tig.2004.04.008
Article CAS PubMed Google Scholar
Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37–46. doi:10.1101/gr.205602
Article CAS PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Division of Epidemiology and Biostatistics, Department of Environmental Health, Cincinnati Children’s Hospital, University of Cincinnati Medical Center, 3223 Eden Avenue, ML 56, Cincinnati, OH, 45267-0056, USA
Jingyuan Deng

Authors

Jingyuan Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingyuan Deng .

Editor information

Editors and Affiliations

Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, USA
Long Jason Lu

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Deng, J. (2015). An Integrated Machine-Learning Model to Predict Prokaryotic Essential Genes. In: Lu, L. (eds) Gene Essentiality. Methods in Molecular Biology, vol 1279. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-2398-4_9

Download citation

DOI: https://doi.org/10.1007/978-1-4939-2398-4_9
Published: 10 January 2015
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-2397-7
Online ISBN: 978-1-4939-2398-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics