OREMPdb: a semantic dictionary of computational pathway models
- 7.5k Downloads
The information coming from biomedical ontologies and computational pathway models is expanding continuously: research communities keep this process up and their advances are generally shared by means of dedicated resources published on the web. In fact, such models are shared to provide the characterization of molecular processes, while biomedical ontologies detail a semantic context to the majority of those pathways. Recent advances in both fields pave the way for a scalable information integration based on aggregate knowledge repositories, but the lack of overall standard formats impedes this progress. Indeed, having different objectives and different abstraction levels, most of these resources "speak" different languages. Semantic web technologies are here explored as a means to address some of these problems.
Employing an extensible collection of interpreters, we developed OREMP (Ontology Reasoning Engine for Molecular Pathways), a system that abstracts the information from different resources and combines them together into a coherent ontology. Continuing this effort we present OREMPdb; once different pathways are fed into OREMP, species are linked to the external ontologies referred and to reactions in which they participate. Exploiting these links, the system builds species-sets, which encapsulate species that operate together. Composing all of the reactions together, the system computes all of the reaction paths from-and-to all of the species-sets.
OREMP has been applied to the curated branch of BioModels (2011/04/15 release) which overall contains 326 models, 9244 reactions, and 5636 species. OREMPdb is the semantic dictionary created as a result, which is made of 7360 species-sets. For each one of these sets, OREMPdb links the original pathway and the link to the original paper where this information first appeared.
KeywordsOrdinary Differential Equation Semantic Context System Biology Markup Language Biomedical Ontology Model Repository
List of abbreviations used
Application Programming Interface
European Bioinformatics Institute Biological Model Database
National Center for Biomedical Ontology Biology Portal
Cell Markup Language
Chemical Entities of Biological Interest
Encyclopedia of Escherichia coli K-12 MG1655 Genes and Metabolism
Epidermal Growth Factor
Epidermal Growth Factor Receptor
RAS p21 Protein Activator 1
Encyclopedia of Homo sapiens Genes and Metabolism
Kyoto Encyclopedia of Genes and Genomes
- MAP kinase
Mitogen-Activated Protein kinase
Mathematical Markup Language
Minimum Information Required In the Annotation of Models
Mathematical Modeling Language
Ordinary Differential Equation
Ontology Reasoning Engine for Molecular Pathways database
Ontology Reasoning Engine for Molecular Pathways
Web Ontology Language, version 2
Resource Description Framework
Systems Biology Markup Language
SHC Transforming Protein 2
Universal Protein Resource
Extensible Markup Language.
The data access facility collects information about multiple pathways and existing biological databases.
A parser module reads different file formats (i.e., XML, RDF, SBML, CellML, etc.) and extracts relevant information.
The core module assembles the knowledge, parsed from different sources, into a coherent ontology (based on the meta-format, cf. Table 1).
The logic module annotates all of the species from a collection of reactions and performs automated comparisons, identification of common species, and duplicate reactions.
Main components of the minimalistic quantitative ontology
type:STRING, uri:STRING, information:STRING.
name:STRING, internalId:STRING, initialValue:REAL, inPathway:PATHWAY, hooks:SET_OF_ANNOTATIONS.
internalId:STRING, kinetics:FORMULA, kineticParameters:SET_OF_PARAMETERS, inPathway:PATHWAY, reactants:SET_OF_SPECIES, catalysts:SET_OF_SPECIES, products:SET_OF_SPECIES, hooks:SET_OF_ANNOTATIONS.
It is worth noting that different versions of each module can in fact be used. Employing a plugin pattern, an internal algorithm chooses the proper component implementation according to the current task (e.g., to read an SBML file, the system will invoke the SBML parser from its extensible list of parser modules). This means that whenever a new modeling format is introduced, a new parser (either provided as API together with the format definition, or developed in house by the OREMP team) can be connected to OREMP to interface with it as well. Similarly, different users can define different versions of the core module, for example, according to their understanding about how the knowledge coming from different pathways should be aggregated. This is of particular interest in domain-specific applications: according to different curators, different resources might be more valuable than others.
A key part of this approach is the designed meta-format. Around the latter the information is collated and merged together while preserving model identity; meaning that all reactions coming from all models are collated together, but, despite such fusion, each reaction preserves internally the link to the original model file it belongs to (Cf. the attribute inPathway in Table 1). This meta-format has been designed to embed the minimalistic and quantitative MIRIAM  information derived from different pathways. Model annotations are preserved and extended with supplemental quantitative data (coming from model reaction kinetics, for instance, and exported onto the attribute kinetics in the meta-format) to achieve a common description that can be represented as a single ontology. The structure of this ontology is presented in Table 1.
The logic module computes N-order species set-set reachability of all the reactions within the loaded and aligned models (connecting inter-model common species-sets, and filtering for instance the duplicate reactions, identified in the previous step, preventing then the creation of a multi-graph).
In empirical models, as said for model repositories, the detection of duplicates is extremely important because (for instance) a duplicate reaction may lead to erroneous results. The duplicates are revealed to the user, allowing individuals to retain editorial power over their models. It also assists researchers in understanding how the resulting models of their work fit into models produced by others. The N-order reachability (duplicate reaction detection) among species-sets builds a reaction composition analysis by constructing a matrix which represents a directed graph. Each vertex is a set of species and each edge is a reaction, which abstracts the overall species-set connectivity. This graph does not become a multi-graph for each set of duplicate reactions (first-order duplicate) because only one reaction is taken as a representative of its duplicate reactions. Through this reachability computation, a dictionary of potentially equivalent reaction compositions is built: candidate paths of the same starting and ending sets of species, but involving alternative intermediate paths. Figure 3 presents a case where first-order (N = 1, R1 and R2) and N-order (R*) duplicate reaction paths overlap: the dashed arc means that R* traverses more species-set apart from X and Y. The last computational step is the following:
The system has been tested in three real-world applications. (i) In a simple example, we demonstrated  the system's power to detect a first order duplicate reaction in the EGFR model  that has been factored up, but overlaps in one reaction, producing differences in quantitative results. Application (ii) consists in the fact that Cytosolve, which is a new computational environment for parallel simulation of multiple pathways, embeds a version of the OREMP system; there it is assigned to the task of identification of common molecular species and duplicated reactions with minimal human intervention. Last application (iii) is the combined analysis of the entire BioModels.net curated collection (currently 326 computational pathway models); OREMP has presented an aggregated view of the collection and brought to the identification of thousands of biological equivalent reaction chains; contextually, a dictionary of biological building blocks has been extracted. It is worth noting that we chose BioModels.net collection for the variety of models included and for its wide adoption; potentially, we could have imported two or more model databases, given that models contained were properly annotated. Relying on automatic annotation of models  perhaps we could work even with not annotated models in the future.
OREMP in combining pathways for parallel solution
This system is embedded in the latest release of Cytosolve , which can be accessed at http://cytosolve.mit.edu. System contribution to the integration of computational pathway models is the detection of duplicated reactions among different models (in Cytosolve website, follow >Remote Solving, select or upload two or more models, >Align Models). No matter the models chosen for simulation, once the species are aligned, the system identifies duplication problems in the reaction-models. From the user point of view this process is transparent: he/she receives a warning message that details the duplicated reactions and is prompted to confirm conflict elimination, and to resolve any inconsistency in reaction kinetic rate constants. What follows is an example outline of the process that starts at http://cytosolve.mit.edu and moves from isolated pathways to their coherent parallel solution, employing OREMP to detect reaction duplicates.
Cytosolve, step 1, Remote Solving: Multiple Simulation begins
Step 2, Select Models: models BIOMD..1 and BIOMD..2 are selected
Step 3, Align Models: OREMP points out the overlaps among the two models, Figure 4
Step 4, Choose Initial Conditions: the user silences the reaction in conflict and possibly re-uploads BIOMD..1
Step 5, Simulate: the simulation takes place and the results are visualized, Figure 5
The prototype can be executed on groups of arbitrary models within Cytosolve homepage, repeating Steps 1-5 above, and choosing Upload as model source.
OREMP in querying large, independent sources of pathways
The system has been tested against the entire BioModels.net curated collection  that contains 326 computational pathway models (release of 2011/04/15, which is the latest official release, at the time of paper writing). The result of the analysis is an overall view of the database and a list of about 500 groups of overlapping reactions, employing 7360 species-sets. This analysis took about one minute on a dual-core 2 GHz Intel CPU. The previously described knowledge-discovery-step involving N-order reachability has been taken on these resources as well. For each species configuration in the database, all alternative circuit paths have been computed. This took about 2 hours on a quad-core 2 GHz AMD CPU and resulted in a dictionary of thousands "biological equivalent" circuits (i.e., equivalent reaction compositions). The latter dictionary namely OREMPdb, is composed of:
An ordered dictionary of pathway building blocks
The list of equivalent reactions overall used
All of the potentially equivalent N-order reaction compositions
With this method the observed edge/vertex ratio for the BioModels.net curated DB is 1.19, similarly to other biological pathway databases - HumanCyc DB  has a ratio of 1.01 and EcoCyc DB  one of 1.25 . These ratios suggest that the analysis we performed on BioModels.net reveals a connectivity density comparable with other biological pathway databases. A basic example of pathway building blocks extracted from the BioModels DB processing follows; this example includes only one species in each species-set. In the context of another EGFR model  (i.e., MAP kinase cascade activated by surface and internalized EGF receptors), as detailed in biomodel no. 19 in , the system detected that the EGF - EGFR 2 - GAP - Shc species can directly become EGF - EGFR 2 - GAP - Shc* or, alternatively, the former can first become EGF - EGFRi 2 - GAP - Shc, then EGF - EGFRi 2 - GAP - Shc*, and finally EGF - EGFR 2 - GAP - Shc*. This is just one and very simple example of the N-order analysis. In general, given two species-sets, OREMPdb provides all of the alternative ways to traverse from one to the other; as stated in the coming section, Protégé is the ideal tool to browse and query OREMPdb and perform this kind of queries.
Ontologies from pathways: practical advantages
From a logic point of view, the system is constructed of three layers. The bottom layer represents the original biochemical pathways, read in their primitive format (such as SBML and CellML). The second layer abstracts (through the work-flow 1-6 detailed in §Methods) the pathways into a minimalistic and quantitative meta-format (sketched in Table 1) that includes MIRIAM components. Annotations are preserved and extended with additional quantitative data to achieve a common description that can be represented as a single ontology. It is at this level that the extended ontology is primarily created. Entities and relations created in this manner are homogeneous in the ontological sense. This implies that several collections of annotated pathways can be combined in OREMPdb while maintaining a common semantic, meaning that the following advantages are achieved:
Despite disparate initial data formats, the biochemical information described in each pathway is now homogeneously represented in OWL2. This enables the direct reuse of componets (such as species or reactions) coming from different sources.
The system ensures a consistent merging of the resources, automatically aligning the species and showing the end-user possible duplications among reactions in the different pathways.
Once the species alignment is done and duplicate reaction have been detected, the N-order reachability step is taken: for each reaction in each pathway the set of "alternative circuits" is computed. This means that given an arbitrary number of pathways, the system will identify all of the alternative ways to traverse from a species-set to another, employing all of the available reactions. In the last layer, all the information gathered is exported in OWL2, and Protégé is employed to visually edit, compare, and finalize the biochemical information exported. Protégé query interface allows the user to formulate "semantically-enabled" queries that were impractical when dealing with previously heterogeneous, unaligned models, Figure 1.
We described OREMP, but other tools are available too in the context of data integration; a major distinction in this context has to be done between those softwares that filter out the dynamics of computational pathway models, such as , and those that are kinetics-oriented. In this section is detailed the comparison with a leading one which is, as our work, kinetics-aware: SemanticSBML . SemanticSBML provides the state of the art tools to obtain a monolithic merged model starting from different molecular pathways. Where Cytosolve is concerned, one key component of its approach is the fact that it does not produce a monolithic model. This preserves the curation process of independent models and allows independent research laboratories to continue investigation and improvement of their own model without being forced to prematurely publish an authoritative merged resource; the independent curation process is preserved by maintaining the pathway identity, since the primitive element-pathway network is not destroyed by integration. Basically, this approach is different from SemanticSBML because it provides the user the opportunity to exploit his/her understanding to define a consistent method of knowledge integration across ontologies. Another point of strength is the fact that once the system has read all of the 326 models from BioModels.net curated collection and the pathway building block dictionary is written (feeding step), the end-users can exploit this functionality to accelerate their research by taking advantage of other modelers efforts simply by consulting the OREMPdb dictionary. By specifying the initial and ending set of species, modelers can use the building block dictionary to gain ideas about how other people investigated and modeled a similar problem and how cross-pathway reactions could be composed to fit their needs. The experiment detailed in previous sections provided an interesting overview of the BioModels.net collection that brought also the following achievement: from the prospective of those who curate collections of biochemical pathways, this framework can be used to find inconsistencies and redundancies within their repository since the system highlights common bricks shared among multiple models.
To our knowledge, this is the first time that the information coming from different biological data sources has been aggregated into a single quantitative ontology. OREMP application can combine several pathways, merge and combine pathways, or revert to the original pathways, and inspect single-model details and employ external biological annotations. The system is independent of the different file formats in which the pathways are written and contains an extensible collection of parser modules. These advantages are fully transferred to Cytosolve, which employs this system to ensure semantically-correct parallel simulations. We selected OWL2 as export format and we adopted Protégé as default "Data Warehouse" for information storage, retrieval and reasoning; there OREMPdb, provides a single semantic access point to the whole Biomodel.net database of curated models (currently 326 models, release of 2011/04/15).
Funding and acknowledgments
This research was supported by a PhD student fellowship from the University of Calabria (RU), general support from the MIT-Singapore Alliance Program in Computational and Systems Biology, and by the University of Catania. We thank Beracah Yankama, Andrew Koo, Shiva Ayyadurai and Christina Pujol for their support and for their advice during the project development. We also thank the three reviewers that helped us with improving the quality of our manuscript.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 4, 2012: Italian Society of Bioinformatics (BITS): Annual Meeting 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S4.
- 6.Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, Jupe S, Kalatskaya I, Mahajan S, May B, Ndegwa N, Schmidt E, Shamovsky V, Yung C, Birney E, Hermjakob H, D'Eustachio P, Stein L: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res 2011, (39 Database):D691–7. [http://www.ncbi.nlm.nih.gov/pubmed/21067998]
- 7.Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA: BioPor tal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res 2011, (39 Web Server):W541–5. [http://www.ncbi.nlm.nih.gov/pubmed/21672956]
- 11.VCell The Virtual Cell: Virtual Cell repository.[http://www.nrcam.uchc.edu/vcell_models/published_models.html]
- 13.SBML Converters[http://www.ebi.ac.uk/compneur-srv/sbml/converters/]
- 15.Krause F, Uhlendorf J, Lubitz T, Schulz M, Klipp E, Liebermeister W: Annotation and merging of SBML models with semanticSBML. Bioinformatics 2009, btp642. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp642v1]Google Scholar
- 17.Noy NF, Sintek M, Decker S, Crubezy M, Fergerson RW, Musen MA: Creating Semantic Web contents with Protege-2000. Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications] 2001, 16(2):60–71. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=920601]CrossRefGoogle Scholar
- 18.Mathematical Markup Language (MathML)[http://www.w3.org/TR/MathML/]
- 19.Le Novére N, Finney A, Hucka M, Bhalla US, Campagne F, Collado-Vides J, Crampin EJ, Halstead M, Klipp E, Mendes P, Nielsen P, Sauro H, Shapiro B, Snoep JL, Spence HD, Wanner BL: Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology 2005, 23(12):1509–1515. 10.1038/nbt1156CrossRefPubMedGoogle Scholar
- 20.OWL 2 Web Ontology Language[http://www.w3.org/TR/owl2-overview/]
- 21.Umeton R, Yankama B, Nicosia G, Dewey C: A Cross-format Framework for Consistent Information Integration among Molecular Pathways and Ontologies. In Proceedings of WCB 2010, 6th World Congress on Biomechanics, August 1 - 6, 2010, Singapore, Volume 31 of IFMBE Proceedings. Edited by: Magjarevic R, Lim CT, Goh JCH. Springer Berlin Heidelberg; 2010:1595–1598.Google Scholar
- 22.Biomodel no. 19[http://www.ebi.ac.uk/biomodels-main/BIOMD0000000019]
- 24.Ayyadurai VAS, Dewey CF: CytoSolve: A Scalable Computational Method for Dynamic Integration of Multiple Molecular Pathway Models. Cellular and Molecular Bioengineering 2011, 4: 28–45. [http://www.springerlink.com/content/1t445r0h7jt77t83/] 10.1007/s12195-010-0143-xPubMedCentralCrossRefPubMedGoogle Scholar
- 25.Li C, Donizelli M, Rodriguez N, Dharuri H, Endler L, Chelliah V, Li L, He E, Henry A, Stefan MI, Snoep JL, Hucka M, Le Novére N, Laibe C: BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Syst Biol 2010, 4: 92. [http://www.ncbi.nlm.nih.gov/pubmed/20587024] 10.1186/1752-0509-4-92PubMedCentralCrossRefPubMedGoogle Scholar
- 26.HumanCyc Pathways DB[http://www.humancyc.org]
- 27.EcoCyc Pathway DB[http://www.ecocyc.org]
- 28.Wang H, He H, Yang J, Yu PS, Yu JX: Dual Labeling: Answering Graph Reachability Queries in Constant Time. In 22nd International Conference on Data Engineering. Los Alamitos, CA, USA: IEEE Computer Society; 2006:75.Google Scholar
- 29.Schoeberl B, Eichler-Jonsson C, Gilles ED, Müller G: Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nature Biotechnology 2002, 20(4):370–375. [http://www.ncbi.nlm.nih.gov/pubmed/11923843] 10.1038/nbt0402-370CrossRefPubMedGoogle Scholar
- 31.Biomodel no. 49[http://www.ebi.ac.uk/biomodels-main/BIOMD0000000049]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.