Skip to main content

Metabolic Pathway Mining

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1526))

Abstract

Understanding metabolic pathways is one of the most important fields in bioscience in the post-genomic era, but curating metabolic pathways requires considerable man-power. As such there is a lack of reliable, experimentally verified metabolic pathways in databases and databases are forced to predict all but the most immediately useful pathways.

Text-mining has the potential to solve this problem, but while sophisticated text-mining methods have been developed to assist the curation of many types of biomedical networks, such as protein–protein interaction networks, the mining of metabolic pathways from the literature has been largely neglected by the text-mining community. In this chapter we describe a pipeline for the extraction of metabolic pathways built on freely available open-source components and a heuristic metabolic reaction extraction algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

Abbreviations

NER :

Named entity recognition

NLP :

Natural language processing

PPI :

Protein–protein interaction

References

  1. PubMed Help [Internet] (2005) National Center for Biotechnology Information (US), Bethesda, MD. Available from https://www.ncbi.nlm.nih.gov/books/NBK3830/

  2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN et al (2000) The protein data bank. Nucleic Acids Res 28:235–242

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB et al (1997) Cath–a hierarchic classification of protein domain structures. Structure 5:1093–1108

    Article  CAS  PubMed  Google Scholar 

  4. Schomburg I, Chang A, Placzek S, Söhngen C, Rother M et al (2013) Brenda in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucleic Acids Res 41:D764–D772

    Article  CAS  PubMed  Google Scholar 

  5. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H et al (1999) Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 27:29–34

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA et al (2010) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 38:D473–D479

    Article  CAS  PubMed  Google Scholar 

  7. McQuilton P, FlyBase Consortium (2012) Opportunities for text mining in the flybase genetic literature curation workflow. Database (Oxford) 2012:bas039

    Google Scholar 

  8. Orchard S, Ammari M, Aranda B, Breuza L, Briganti L et al (2013) The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42:D358–D363

    Article  PubMed  PubMed Central  Google Scholar 

  9. Orchard S, Kerrien S, Abbani S, Aranda B, Bhate J et al (2012) Protein interaction data curation: the international molecular exchange (imex) consortium. Nat Methods 9:345–350

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A (2008) Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome Biol 9(Suppl 2):S4

    Article  PubMed  PubMed Central  Google Scholar 

  11. Kabiljo R, Clegg AB, Shepherd AJ (2009) A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinf 10:233

    Article  Google Scholar 

  12. Miyao Y, Sagae K, Saetre R, Matsuzaki T, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25:394–400

    Article  CAS  PubMed  Google Scholar 

  13. Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL et al (2008) Opendmap: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinf 9:78

    Article  Google Scholar 

  14. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (2008) Text processing through web services: calling Whatizit. Bioinformatics 24:296–298

    Article  CAS  PubMed  Google Scholar 

  15. Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A et al (2011) The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinf 12(Suppl 8):S3

    Article  Google Scholar 

  16. Kwon D, Kim S, Shin SY, Chatr-aryamontri A, Wilbur WJ (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database 2014:bau067

    Google Scholar 

  17. Jamieson DG, Gerner M, Sarafraz F, Nenadic G, Robertson DL (2012) Towards semi-automated curation: using text mining to recreate the hiv-1, human protein interaction database. Database (Oxford) 2012:bas023

    Google Scholar 

  18. Leaman R, Gonzalez G (2008) Banner: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput 13:652–663

    Google Scholar 

  19. Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T (2010) Complex event extraction at pubmed scale. Bioinformatics 26:i382–i390

    Article  PubMed  PubMed Central  Google Scholar 

  20. Miwa M, Saetre R, Kim JD, Tsujii J (2010) Event extraction with complex event classification using rich features. J Bioinform Comput Biol 8:131–146

    Article  CAS  PubMed  Google Scholar 

  21. Li L, Zhang P, Zheng T, Zhang H, Jiang Z et al (2014) Integrating semantic information into multiple kernels for protein-protein interaction extraction from biomedical literatures. PLoS One 9:e91898

    Article  PubMed  PubMed Central  Google Scholar 

  22. Quan C, Wang M, Ren F (2014) An unsupervised text mining method for relation extraction from biomedical literature. PLoS One 9:e102039

    Article  PubMed  PubMed Central  Google Scholar 

  23. Kim J, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Overview of bionlp’09 shared task on event extraction. In: Proceedings of the BioNLP 2009 workshop companion volume for shared task. Association for Computational Linguistics, Boulder, CO, pp 1–9. http://www.aclweb.org/anthology-new/W/W09/W09-1401.bib

  24. Blaschke C, Valencia A (2002) The frame-based module of the SUISEKI information extraction system. IEEE Intell Syst 17:14–20

    Google Scholar 

  25. Iossifov I, Krauthammer M, Friedman C, Hatzivassiloglou V, Bader JS et al (2004) Probabilistic inference of molecular networks from noisy data sources. Bioinformatics 20:1205–1213

    Article  CAS  PubMed  Google Scholar 

  26. Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P et al (2004) Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43–53

    Article  CAS  PubMed  Google Scholar 

  27. Santos C, Eggle D, States DJ (2005) Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 21:1653–1658

    Article  CAS  PubMed  Google Scholar 

  28. Yuryev A, Mulyukov Z, Kotelnikova E, Maslov S, Egorov S et al (2006) Automatic pathway building in biological association networks. BMC Bioinf 7:171

    Article  Google Scholar 

  29. Marshall B, Su H, McDonald D, Eggers S, Chen H (2006) Aggregating automatically extracted regulatory pathway relations. IEEE Trans Inf Technol Biomed 10:100–108

    Article  PubMed  Google Scholar 

  30. Rodríguez-Penagos C, Salgado H, Martínez-Flores I, Collado-Vides J (2007) Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinf 8:293

    Article  Google Scholar 

  31. Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of biocreative: critical assessment of information extraction for biology. BMC Bioinf 6(Suppl 1):S1

    Article  Google Scholar 

  32. Smith L, Tanabe LK, nee Ando RJ, Kuo CJ, Chung IF et al (2008) Overview of biocreative ii gene mention recognition. Genome Biol 9(Suppl 2):S2

    Google Scholar 

  33. Lu Z, Kao HY, Wei CH, Huang M, Liu J et al (2011) The gene normalization task in biocreative iii. BMC Bioinf 12(Suppl 8):S2

    Article  Google Scholar 

  34. Humphreys K, Demetriou G, Gaizauskas R (2000) Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac Symp Biocomput 5:505–516

    Google Scholar 

  35. Novichkova S, Egorov S, Daraselia N (2003) MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 19:1699–1706

    Article  CAS  PubMed  Google Scholar 

  36. Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. Pac Symp Biocomput 12:245–256

    Google Scholar 

  37. Wiegers TC, Davis AP, Cohen KB, Hirschman L, Mattingly CJ (2009) Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinf 10:326

    Article  Google Scholar 

  38. Winnenburg R, Wächter T, Plake C, Doms A, Schroeder M (2008) Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 9:466–478

    Article  CAS  PubMed  Google Scholar 

  39. Kottmann J, Margulies B, Ingersoll G, Drost I, Kosin J, Baldridge J, Goetz T, Morton T, Silva W, Autayeu A, Galitsky B (2011) Apache opennlp. Online. www.opennlp.apache.org

  40. Clegg AB, Shepherd AJ (2007) Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinf 8:24

    Article  Google Scholar 

  41. Buyko E, Wermter J, Poprat M, Hahn U (2006) Automatically adapting an NLP core engine to the biology domain. In: Proceedings of the ISMB 2006 joint linking literature, information and knowledge for biology and the 9th bio-ontologies meeting.

    Google Scholar 

  42. Kim JD, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus–semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182

    Article  PubMed  Google Scholar 

  43. Kulick S, Bies A, Liberman M, Mandel M, Mcdonald R et al (2004) Integrated annotation for biomedical information extraction. In: Biolink: linking biological literature, ontologies and databases, proceedings of HLT-NAACL, pp 61–68

    Google Scholar 

  44. Hahn U, Matthies F, Faessler E, Hellrich J (2016) UIMA-based JCoRe 2.0 goes GitHub and Maven central―state-of-the-art software resource engineering and distribution of NLP pipelines. In: Calzolari N (Conference Chair), Choukri K, Declerck T, Grobelnik M, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds.) Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), Portorož, Slovenia

    Google Scholar 

  45. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S et al (2010) Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 17:507–513

    Article  PubMed  PubMed Central  Google Scholar 

  46. Corbett P, Murray-Rust P (2006) High throughput identification of chemistry in life science texts. In: Proceedings of the 2nd international symposium on computational life science (CompLife ’06), pp 107–118

    Google Scholar 

  47. Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) Oscar4: a flexible architecture for chemical text-mining. J Cheminform 3:41

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Rocktäschel T, Weidlich M, Leser U (2012) Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics 28:1633–1640

    Article  PubMed  Google Scholar 

  49. Kolarik C, Klinger R, Friedrich CM, Hofmann-Apitius M, Fluck J (2008) Chemical names: Terminological resources and corpora annotation. In: Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference). Marrakech, Morocco

    Google Scholar 

  50. Gerner M, Nenadic G, Bergman CM (2010) Linnaeus: a species name identification system for biomedical literature. BMC Bioinf 11:85

    Article  Google Scholar 

  51. Yepes AJ, Verspoor K (2014) Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database (Oxford) 2014:bau003

    Google Scholar 

  52. de Matos P, Ennis M, Darsow M, Guedj M, Degtyarenko K et al (2006) Chebi — chemical entities of biological interest. Database Summary Paper 646, EMBL Outstation - The European Bioinformatics Institute

    Google Scholar 

  53. Czarnecki J, Nobeli I, Smith AM, Shepherd AJ (2012) A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinf 13:172

    Article  Google Scholar 

  54. Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12 PubChem: integrated platform of small molecules and biological activities. Annu Rep Comput Chem 4:217–241

    Article  CAS  Google Scholar 

  55. de Matos P, Alcantara R, Dekker A, Ennis M, Hastings J et al (2010) Chemical entities of biological interest: an update. Nucleic Acids Res 38:D249–D254

    Article  PubMed  Google Scholar 

  56. (2006) Porter stemming algorithm implementations. http://tartarus.org/~martin/PorterStemmer/

  57. Porter M (1980) An algorithm for suffix stripping. Program 14:130–137

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrian J. Shepherd .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media New York

About this protocol

Cite this protocol

Czarnecki, J.M., Shepherd, A.J. (2017). Metabolic Pathway Mining. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1526. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6613-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-6613-4_8

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-6611-0

  • Online ISBN: 978-1-4939-6613-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics