Abstract
Essential genes are indispensable for the target organism’s survival. Large-scale identification and characterization of essential genes has shown to be beneficial in both fundamental biology and medicine fields. Current existing genome-scale experimental screenings of essential genes are time consuming and costly, also sometimes confer erroneous essential gene annotations. To circumvent these difficulties, many research groups turn to computational approaches as the alternative to identify essential genes. Here, we developed an integrative machine-learning based statistical framework to accurately predict essential genes in microorganisms. First we extracted a variety of relevant features derived from different aspects of an organism’s genomic sequences. Then we selected a subset of features have high predictive power of gene essentiality through a carefully designed feature selection system. Using the selected features as input, we constructed an ensemble classifier and trained the model on a well-studied microorganism. After fine-tuning the model parameters in cross-validation, we tested the model on the other microorganism. We found that the tenfold cross-validation results within the same organism achieves a high predictive accuracy (AUC ~0.9), and cross-organism predictions between distant related organisms yield the AUC scores from 0.69 to 0.89, which significantly outperformed homology mapping.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kobayashi K, Ehrlich SD, Albertini A et al (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci U S A 100(8):4678–4683. doi:10.1073/pnas.0730515100
Itaya M (1995) An estimation of minimal genome size required for life. FEBS Lett 362(3):257–260
Dowell RD, Ryan O, Jansen A et al (2010) Genotype to phenotype: a complex problem. Science 328(5977):469. doi:10.1126/science.1189015
Haselbeck R, Wall D, Jiang B et al (2002) Comprehensive essential gene identification as a platform for novel anti-infective drug discovery. Curr Pharm Des 8(13):1155–1172
Judson N, Mekalanos JJ (2000) TnAraOut, a transposon-based approach to identify and characterize essential bacterial genes. Nat Biotechnol 18(7):740–745. doi:10.1038/77305
Baba T, Ara T, Hasegawa M et al (2006) Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol 2(2006):0008. doi:10.1038/msb4100050
Pucci MJ (2006) Use of genomics to select antibacterial targets. Biochem Pharmacol 71(7):1066–1072. doi:10.1016/j.bcp.2005.12.004
Chen Y, Xu D (2005) Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 21(5):575–581. doi:10.1093/bioinformatics/bti058
Saha S, Heber S (2006) In silico prediction of yeast deletion phenotypes. Genet Mol Res 5(1):224–232
Gustafson AM, Snitkin ES, Parker SC et al (2006) Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 7:265. doi:10.1186/1471-2164-7-265
Seringhaus M, Paccanaro A, Borneman A et al (2006) Predicting essential genes in fungal genomes. Genome Res 16(9):1126–1135. doi:10.1101/gr.5144106
Deng J, Deng L, Su S et al (2011) Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res 39(3):795–807. doi:10.1093/nar/gkq784
Winsor GL, Lam DK, Fleming L et al (2011) Pseudomonas genome database: improved comparative analysis and population genomics capability for Pseudomonas genomes. Nucleic Acids Res 39(Database issue):D596–D600. doi:10.1093/nar/gkq869
Kato J, Hashimoto M (2007) Construction of consecutive deletions of the Escherichia coli chromosome. Mol Syst Biol 3:132. doi:10.1038/msb4100174
Zhang R, Lin Y (2009) DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 37(Database issue):D455–D458. doi:10.1093/nar/gkn858
Chen WH, Minguez P, Lercher MJ et al (2012) OGEE: an online gene essentiality database. Nucleic Acids Res 40(Database issue):D901–D906. doi:10.1093/nar/gkr986
Barrett T, Troup DB, Wilhite SE et al (2007) NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res 35(Database issue):D760–D765. doi:10.1093/nar/gkl887
Parkinson H, Kapushesky M, Shojatalab M et al (2007) ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue):D747–D750. doi:10.1093/nar/gkl995
Lu Z, Szafron D, Greiner R et al (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4):547–556. doi:10.1093/bioinformatics/bth026
Krogh A, Larsson B, von Heijne G et al (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3):567–580. doi:10.1006/jmbi.2000.4315
Yip KY, Yu H, Kim PM et al (2006) The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. Bioinformatics 22(23):2968–2970. doi:10.1093/bioinformatics/btl488
Sharp PM, Li WH (1987) The codon adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15(3):1281–1295
Fuglsang A (2004) The ‘effective number of codons’ revisited. Biochem Biophys Res Commun 317(3):957–964. doi:10.1016/j.bbrc.2004.03.138
Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132
Lu LJ, Xia Y, Paccanaro A et al (2005) Assessing the limits of genomic data integration for predicting protein networks. Genome Res 15(7):945–953. doi:10.1101/gr.3610305
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA
Zhang CT, Zhang R (2008) Gene essentiality analysis based on DEG, a database of essential genes. Methods Mol Biol 416:391–400. doi:10.1007/978-1-59745-321-9_27
Giaever G, Chu AM, Ni L et al (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418(6896):387–391. doi:10.1038/nature00935
Jordan IK, Rogozin IB, Wolf YI et al (2002) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12(6):962–968
Bork P (1991) Shuffled domains in extracellular proteins. FEBS Lett 286(1–2):47–54
Yu H, Greenbaum D, Xin Lu H et al (2004) Genomic analysis of essentiality within protein networks. Trends Genet 20(6):227–231. doi:10.1016/j.tig.2004.04.008
Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37–46. doi:10.1101/gr.205602
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this protocol
Cite this protocol
Deng, J. (2015). An Integrated Machine-Learning Model to Predict Prokaryotic Essential Genes. In: Lu, L. (eds) Gene Essentiality. Methods in Molecular Biology, vol 1279. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-2398-4_9
Download citation
DOI: https://doi.org/10.1007/978-1-4939-2398-4_9
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-2397-7
Online ISBN: 978-1-4939-2398-4
eBook Packages: Springer Protocols