Abstract
It is becoming more evident that computational methods are needed for the identification and the mapping of pathways in new genomes. We introduce an automatic annotation system (ARBA4Path Association Rule-Based Annotator for Pathways) that utilizes rule mining techniques to predict metabolic pathways across wide range of prokaryotes. It was demonstrated that specific combinations of protein domains (recorded in our rules) strongly determine pathways in which proteins are involved and thus provide information that let us very accurately assign pathway membership (with precision of 0.999 and recall of 0.966) to proteins of a given prokaryotic taxon. Our system can be used to enhance the quality of automatically generated annotations as well as annotating proteins with unknown function. The prediction models are represented in the form of human-readable rules, and they can be used effectively to add absent pathway information to many proteins in UniProtKB/TrEMBL database.
*These authors contributed equally to this work.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Kretschmann E, Fleischmann W, Apweiler R (2001) Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on swiss-prot. Bioinformatics 17(10):920–926. doi:10.1093/bioinformatics/17.10.920
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA
The UniProt Consortium (2015) Uniprot: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. doi:10.1093/nar/gku989
Biswas M, O’Rourke JF, Camon E, Fraser G, Kanapin A, Karavidopoulou Y, Kersey P, Kriventseva E, Mittard V, Mulder N, Phan I, Servant F, Apweiler R (2002) Applications of interpro in protein annotation and genome analysis. Brief Bioinform 3(3):285–295. doi:10.1093/bib/3.3.285
Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A, The UniProt Consortium (2013) Hamap in 2013, new developments in the protein family classification and annotation system. Nucleic Acids Res 41(D1):D584–D589. doi:10.1093/nar/gks1157
Muller S, Leser U, Fleischmann W, Apweiler R (1999) Edittotrembl: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 15(3):219–227. doi:10.1093/bioinformatics/15.3.219
Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu ZZ, Ledley RS, Lewis KC, Mewes HW, Orcutt BC, Suzek BE, Tsugita A, Vinayaka CR, Yeh LSL, Zhang J, Barker WC (2002) The protein information resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res 30(1):35–37. doi:10.1093/nar/30.1.35
Campbell N, Reece J (2002) Biology. In: Addison-Wesley world student series, vol 1. Benjamin Cummings, San Francisco, CA, USA
Chen X, Xu J, Huang B, Li J, Wu X, Ma L, Jia X, Bian X, Tan F, Liu L, Chen S, Li X (2011) A sub-pathway-based approach for identifying drug response principal network. Bioinformatics 27(5):649–654. doi:10.1093/bioinformatics/btq714
Chen Y, Hu Y, Zhou T, Zhou KK, Mott R, Wu M, Boulton M, Lyons TJ, Gao G, Ma JX (2009) Activation of the wnt pathway plays a pathogenic role in diabetic retinopathy in humans and animal models. Am J Pathol 175(6):2676–2685. doi:10.2353/ajpath.2009.080945
Silberberg Y, Gottlieb A, Kupiec M, Ruppin E, Sharan R (2012) Large-scale elucidation of drug response pathways in humans. J Comput Biol 19(2):163–174. doi:10.1089/cmb.2011.0264
Parkes M, Cortes A, van Heel DA, Brown MA (2013) Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat Rev Genet 14(9):661–673. doi:10.1038/nrg3502
Bebek G, Yang J (2007) Pathfinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC Bioinformatics 8(1):335. doi:10.1186/1471-2105-8-335
Klopman G, Tu M, Talafous J (1997) Meta. 3. A genetic algorithm for metabolic transform priorities optimization. J Chem Inf Comput Sci 37(2):329–334. doi:10.1021/ci9601123
Jaworska J, Dimitrov S, Nikolova N, Mekenyan O (2002) Probabilistic assessment of biodegradability based on metabolic pathways: catabol system. SAR QSAR Environ Res 13(2):307–323. doi:10.1080/10629360290002794
Hou B, Ellis L, Wackett L (2004) Encoding microbial metabolic logic: predicting biodegradation. J Ind Microbiol Biotechnol 31(6):261–272. doi:10.1007/s10295-004-0144-7
Button WG, Judson PN, Long A, Vessey JD (2003) Using absolute and relative reasoning in the prediction of the potential metabolism of xenobiotics. J Chem Inf Comput Sci 43(5):1371–1377. doi:10.1021/ci0202739
Karp P, Latendresse M, Caspi R (2011) The pathway tools pathway prediction algorithm. Stand Genomic Sci 5(3):424–429
Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A (2000) The ecocyc and metacyc databases. Nucleic Acids Res 28(1):56–59. doi:10.1093/nar/28.1
Dale J, Popescu L, Karp P (2010) Machine learning methods for metabolic pathway prediction. BMC Bioinformatics 11(1):15. doi:10.1186/1471-2105-11-15
Creighton C, Hanash S (2003) Mining gene expression databases for association rules. Bioinformatics 19(1):79–86. doi:10.1093/bioinformatics/19.1.79
Georgii E, Richter L, Rckert U, Kramer S (2005) Analyzing microarray data using quantitative association rules. Bioinformatics 21(suppl 2):ii123–ii129. doi:10.1093/bioinformatics/bti1121
Bodenreider O, Aubry M, Burgun A (2005) Non-lexical approaches to identifying associative relations in the gene ontology. In: Altman RB, Jung TA, Klein TE, Dunker AK, Hunter L (eds) Pacific symposium on biocomputing, World Scientific, pp 104–115
Artamonova II, Frishman G, Gelfand MS, Frishman D (2005) Mining sequence annotation databanks for association patterns. Bioinformatics 21(Suppl 3):iii49–iii57. doi:10.1093/bioinformatics/bti1206
Boudellioua I, Saidi R, Hoehndorf R, Martin MJ, Solovyev V (2016) Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining. PLOS ONE 11(7)
The InterPro Consortium, Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley R, Courcelle E, Durbin R, Falquet L, Fleischmann W, Gouzy J, Griffith-Jones S, Haft D, Hermjakob H, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Orchard S, Pagni M, Peyruc D, Ponting CP, Servant F, Sigrist CJA (2002) Interpro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3(3):225–235. doi:10.1093/bib/3.3.225
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) VLDB 94, proceedings of 20th international conference on very large data bases, September 12–15, 1994, Morgan Kaufmann, Santiago de Chile, Chile, pp 487–499
Bouker S, Saidi R, Yahia SB, Nguifo EM (2012) Ranking and selecting association rules based on dominance relationship. In: IEEE 24th international conference on tools with artificial intelligence, ICTAI 2012, Athens, Greece, November 7–9, 2012, pp 658–665. doi:10.1109/ICTAI.2012.94
Bouker S, Saidi R, Yahia SB, Nguifo EM (2014) Mining undominated association rules through interestingness measures. Int J Artif Intell Tools 23(4). doi:10.1142/S0218213014600112
Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of the 15th conference on computational statistics (COMPSTAT), Physica Verlag, pp 395–400
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB 94, Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 487–499
Borgelt C (2003) Efficient implementations of apriori and eclat. In: Proceedings of the 1st IEEE ICDM workshop on frequent item set mining implementations (FIMI 2003, Melbourne, FL). CEUR workshop proceedings 90, p 90
Borgelt C (2004) Recursion pruning for the apriori algorithm. In: Bayardo RJ Jr., Goethals B, Zaki MJ (eds) FIMI, CEUR workshop proceedings, vol. 126. CEUR-WS.org
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, SIGMOD 97, ACM, New York, NY, pp 265–276. doi:10.1145/253260.253327
Kirsch A, Mitzenmacher M, Pietracaprina A, Pucci G, Upfal E, Vandin F (2009) An efficient rigorous approach for identifying statistically significant frequent itemsets. In: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 09, ACM, New York, NY, pp 117–126. doi:10.1145/1559795.1559814
Huntley RP, White O, Blake JA, Lewis SE, Giglio M (2014) Standardized description of scientific evidence using the evidence ontology (eco). Database 2014. doi:10.1093/database/bau075
Pesquita C, Faria D, Falco AO, Lord P, Couto FM (2009) Semantic similarity in biomedical ontologies. PLoS Comput Biol 5(7):e1000443. doi:10.1371/journal.pcbi.1000443
The Gene Ontology Consortium (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43(D1):D1049–D1056. doi:10.1093/nar/gku1179
Harispe S, Ranwez S, Janaqi S, Montmain J (2014) The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies. Bioinformatics 30(5):740–742. doi:10.1093/bioinformatics/btt581
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI’95, vol 1, Morgan Kaufmann Publishers Inc., San Francisco, CA, pp. 448–453
Acknowledgments
The second author conducted this work as part of a research internship at the European Bioinformatics Institute, UniProt team. The funding for this internship was provided by King Abdullah University of Science and Technology. The authors would also like to thank UniProt Consortium for their valuable support and feedback on the development of this work.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media LLC
About this protocol
Cite this protocol
Saidi, R., Boudellioua, I., Martin, M.J., Solovyev, V. (2017). Rule Mining Techniques to Predict Prokaryotic Metabolic Pathways. In: Tatarinova, T., Nikolsky, Y. (eds) Biological Networks and Pathway Analysis. Methods in Molecular Biology, vol 1613. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7027-8_12
Download citation
DOI: https://doi.org/10.1007/978-1-4939-7027-8_12
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7025-4
Online ISBN: 978-1-4939-7027-8
eBook Packages: Springer Protocols