Abstract
Understanding metabolic pathways is one of the most important fields in bioscience in the post-genomic era, but curating metabolic pathways requires considerable man-power. As such there is a lack of reliable, experimentally verified metabolic pathways in databases and databases are forced to predict all but the most immediately useful pathways.
Text-mining has the potential to solve this problem, but while sophisticated text-mining methods have been developed to assist the curation of many types of biomedical networks, such as protein–protein interaction networks, the mining of metabolic pathways from the literature has been largely neglected by the text-mining community. In this chapter we describe a pipeline for the extraction of metabolic pathways built on freely available open-source components and a heuristic metabolic reaction extraction algorithm.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsAbbreviations
- NER :
-
Named entity recognition
- NLP :
-
Natural language processing
- PPI :
-
Protein–protein interaction
References
PubMed Help [Internet] (2005) National Center for Biotechnology Information (US), Bethesda, MD. Available from https://www.ncbi.nlm.nih.gov/books/NBK3830/
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN et al (2000) The protein data bank. Nucleic Acids Res 28:235–242
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB et al (1997) Cath–a hierarchic classification of protein domain structures. Structure 5:1093–1108
Schomburg I, Chang A, Placzek S, Söhngen C, Rother M et al (2013) Brenda in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucleic Acids Res 41:D764–D772
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H et al (1999) Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 27:29–34
Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA et al (2010) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 38:D473–D479
McQuilton P, FlyBase Consortium (2012) Opportunities for text mining in the flybase genetic literature curation workflow. Database (Oxford) 2012:bas039
Orchard S, Ammari M, Aranda B, Breuza L, Briganti L et al (2013) The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42:D358–D363
Orchard S, Kerrien S, Abbani S, Aranda B, Bhate J et al (2012) Protein interaction data curation: the international molecular exchange (imex) consortium. Nat Methods 9:345–350
Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A (2008) Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome Biol 9(Suppl 2):S4
Kabiljo R, Clegg AB, Shepherd AJ (2009) A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinf 10:233
Miyao Y, Sagae K, Saetre R, Matsuzaki T, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25:394–400
Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL et al (2008) Opendmap: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinf 9:78
Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (2008) Text processing through web services: calling Whatizit. Bioinformatics 24:296–298
Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A et al (2011) The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinf 12(Suppl 8):S3
Kwon D, Kim S, Shin SY, Chatr-aryamontri A, Wilbur WJ (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database 2014:bau067
Jamieson DG, Gerner M, Sarafraz F, Nenadic G, Robertson DL (2012) Towards semi-automated curation: using text mining to recreate the hiv-1, human protein interaction database. Database (Oxford) 2012:bas023
Leaman R, Gonzalez G (2008) Banner: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput 13:652–663
Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T (2010) Complex event extraction at pubmed scale. Bioinformatics 26:i382–i390
Miwa M, Saetre R, Kim JD, Tsujii J (2010) Event extraction with complex event classification using rich features. J Bioinform Comput Biol 8:131–146
Li L, Zhang P, Zheng T, Zhang H, Jiang Z et al (2014) Integrating semantic information into multiple kernels for protein-protein interaction extraction from biomedical literatures. PLoS One 9:e91898
Quan C, Wang M, Ren F (2014) An unsupervised text mining method for relation extraction from biomedical literature. PLoS One 9:e102039
Kim J, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Overview of bionlp’09 shared task on event extraction. In: Proceedings of the BioNLP 2009 workshop companion volume for shared task. Association for Computational Linguistics, Boulder, CO, pp 1–9. http://www.aclweb.org/anthology-new/W/W09/W09-1401.bib
Blaschke C, Valencia A (2002) The frame-based module of the SUISEKI information extraction system. IEEE Intell Syst 17:14–20
Iossifov I, Krauthammer M, Friedman C, Hatzivassiloglou V, Bader JS et al (2004) Probabilistic inference of molecular networks from noisy data sources. Bioinformatics 20:1205–1213
Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P et al (2004) Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43–53
Santos C, Eggle D, States DJ (2005) Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 21:1653–1658
Yuryev A, Mulyukov Z, Kotelnikova E, Maslov S, Egorov S et al (2006) Automatic pathway building in biological association networks. BMC Bioinf 7:171
Marshall B, Su H, McDonald D, Eggers S, Chen H (2006) Aggregating automatically extracted regulatory pathway relations. IEEE Trans Inf Technol Biomed 10:100–108
Rodríguez-Penagos C, Salgado H, Martínez-Flores I, Collado-Vides J (2007) Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinf 8:293
Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of biocreative: critical assessment of information extraction for biology. BMC Bioinf 6(Suppl 1):S1
Smith L, Tanabe LK, nee Ando RJ, Kuo CJ, Chung IF et al (2008) Overview of biocreative ii gene mention recognition. Genome Biol 9(Suppl 2):S2
Lu Z, Kao HY, Wei CH, Huang M, Liu J et al (2011) The gene normalization task in biocreative iii. BMC Bioinf 12(Suppl 8):S2
Humphreys K, Demetriou G, Gaizauskas R (2000) Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac Symp Biocomput 5:505–516
Novichkova S, Egorov S, Daraselia N (2003) MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 19:1699–1706
Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. Pac Symp Biocomput 12:245–256
Wiegers TC, Davis AP, Cohen KB, Hirschman L, Mattingly CJ (2009) Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinf 10:326
Winnenburg R, Wächter T, Plake C, Doms A, Schroeder M (2008) Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 9:466–478
Kottmann J, Margulies B, Ingersoll G, Drost I, Kosin J, Baldridge J, Goetz T, Morton T, Silva W, Autayeu A, Galitsky B (2011) Apache opennlp. Online. www.opennlp.apache.org
Clegg AB, Shepherd AJ (2007) Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinf 8:24
Buyko E, Wermter J, Poprat M, Hahn U (2006) Automatically adapting an NLP core engine to the biology domain. In: Proceedings of the ISMB 2006 joint linking literature, information and knowledge for biology and the 9th bio-ontologies meeting.
Kim JD, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus–semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182
Kulick S, Bies A, Liberman M, Mandel M, Mcdonald R et al (2004) Integrated annotation for biomedical information extraction. In: Biolink: linking biological literature, ontologies and databases, proceedings of HLT-NAACL, pp 61–68
Hahn U, Matthies F, Faessler E, Hellrich J (2016) UIMA-based JCoRe 2.0 goes GitHub and Maven central―state-of-the-art software resource engineering and distribution of NLP pipelines. In: Calzolari N (Conference Chair), Choukri K, Declerck T, Grobelnik M, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds.) Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), Portorož, Slovenia
Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S et al (2010) Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 17:507–513
Corbett P, Murray-Rust P (2006) High throughput identification of chemistry in life science texts. In: Proceedings of the 2nd international symposium on computational life science (CompLife ’06), pp 107–118
Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) Oscar4: a flexible architecture for chemical text-mining. J Cheminform 3:41
Rocktäschel T, Weidlich M, Leser U (2012) Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics 28:1633–1640
Kolarik C, Klinger R, Friedrich CM, Hofmann-Apitius M, Fluck J (2008) Chemical names: Terminological resources and corpora annotation. In: Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference). Marrakech, Morocco
Gerner M, Nenadic G, Bergman CM (2010) Linnaeus: a species name identification system for biomedical literature. BMC Bioinf 11:85
Yepes AJ, Verspoor K (2014) Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database (Oxford) 2014:bau003
de Matos P, Ennis M, Darsow M, Guedj M, Degtyarenko K et al (2006) Chebi — chemical entities of biological interest. Database Summary Paper 646, EMBL Outstation - The European Bioinformatics Institute
Czarnecki J, Nobeli I, Smith AM, Shepherd AJ (2012) A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinf 13:172
Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12 PubChem: integrated platform of small molecules and biological activities. Annu Rep Comput Chem 4:217–241
de Matos P, Alcantara R, Dekker A, Ennis M, Hastings J et al (2010) Chemical entities of biological interest: an update. Nucleic Acids Res 38:D249–D254
(2006) Porter stemming algorithm implementations. http://tartarus.org/~martin/PorterStemmer/
Porter M (1980) An algorithm for suffix stripping. Program 14:130–137
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media New York
About this protocol
Cite this protocol
Czarnecki, J.M., Shepherd, A.J. (2017). Metabolic Pathway Mining. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1526. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6613-4_8
Download citation
DOI: https://doi.org/10.1007/978-1-4939-6613-4_8
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6611-0
Online ISBN: 978-1-4939-6613-4
eBook Packages: Springer Protocols