Systematic integration of experimental data and models in systems biology
The behaviour of biological systems can be deduced from their mathematical models. However, multiple sources of data in diverse forms are required in the construction of a model in order to define its components and their biochemical reactions, and corresponding parameters. Automating the assembly and use of systems biology models is dependent upon data integration processes involving the interoperation of data and analytical resources.
Taverna workflows have been developed for the automated assembly of quantitative parameterised metabolic networks in the Systems Biology Markup Language (SBML). A SBML model is built in a systematic fashion by the workflows which starts with the construction of a qualitative network using data from a MIRIAM-compliant genome-scale model of yeast metabolism. This is followed by parameterisation of the SBML model with experimental data from two repositories, the SABIO-RK enzyme kinetics database and a database of quantitative experimental results. The models are then calibrated and simulated in workflows that call out to COPASIWS, the web service interface to the COPASI software application for analysing biochemical networks. These systems biology workflows were evaluated for their ability to construct a parameterised model of yeast glycolysis.
Distributed information about metabolic reactions that have been described to MIRIAM standards enables the automated assembly of quantitative systems biology models of metabolic networks based on user-defined criteria. Such data integration processes can be implemented as Taverna workflows to provide a rapid overview of the components and their relationships within a biochemical system.
KeywordsSystem Biology Metabolic Network System Biology Markup Language Saccharomyces Genome Database System Biology Model
Mathematical models are key in systems biology  where they typically describe the topology of biological networks, listing biochemical entities and their relationships with one another. The behaviour of a biological system can also be deduced from mathematical models. For example, simulations with a model of a metabolic network can predict how variables in the form of metabolite fluxes and concentrations are influenced by parameters such as an enzyme's maximum catalytic rate. Diverse types of data are required in the construction of mathematical models of biological systems and these are typically held in multiple sources. Information about the metabolites and enzymes involved in a reaction can be found in databases such as KEGG  and Reactome , as well as in spreadsheet files that have been used to disseminate re-constructed models of metabolism from a number of organisms [4, 5]. Curated information on metabolic enzymes and their kinetic properties can be found in various generic and model organism-specific databases including Uniprot , SABIO-RK  and the Saccharomyces Genome Database (SGD) . Details about metabolites such as their representation in SMILES or InChI formats are available from various databases including ChEBI  and PubChem .
The assembly of mathematical models of biological systems normally requires a combination of tools . For example, the process may begin by mapping the information for each biochemical reaction and its parameters from its source into a model design tool such as Cell Designer . Network analysis tools such as COPASI  can then be used to calibrate parameters by fitting them to a set of experimental observations made from the biological system so that a more accurate response of the model can be attained in simulations . Like other data analysis processes in bioinformatics, the combination of network construction, parameterisation and calibration of systems biology models are in silico experiments involving the interoperation of distributed information repositories and computational tools. In systems biology, these in silico experiments form an iterative series of model building and hypothesis-driven simulation processes which are employed to understand how biological systems function as a network of biochemical reactions. Such in silico experiments can be implemented as workflows consisting of a series of computational tasks that are performed on data from its access, integration and analysis, to the presentation and visualisation of the results. These data processes can be designed and enacted by workflow systems, such as Taverna, which manage the flow of data between computational resources [15, 16].
As with other sub-disciplines in the life sciences, a number of data standards in systems biology have been developed for exchanging information within the community. The Systems Biology Markup Language (SBML) is a format which is widely used to represent biochemical reactions in biological models . However, ambiguity in the use of identifiers and names signifying the same entities can impede the exchange and comparison of SBML models. This issue has been addressed by MIRIAM, a project to standardise the Minimal Information Requested In the Annotation of biochemical Models. The exchange of models is facilitated by following the guidelines set out by MIRIAM by annotating components with Uniform Resource Identifiers associated with recognised data types from controlled vocabularies and specific biochemical entities referenced by bioinformatics databases . The popularity of SBML has led to a need to communicate the results of operations performed on models. The Systems Biology Results Markup Language (SBRML) has been proposed as a format that complements SBML by specifying quantitative data in the context of a systems biology model . Several sets of SBRML data can be associated with a model each consisting of a series of values associated with model variables and their corresponding parameter values. SBRML provides a flexible way of indexing simulation results as well as experiment data that come in spreadsheet-like form or multidimensional data cubes to model parameter values according to a reference SBML model.
The adoption and compliance with data standards in systems biology provides an opportunity for mathematical models to be constructed in an automated and systematic fashion. In this paper, we present a workflow strategy for systematically representing and managing the necessary data, and for automating data integration processes in the construction of mathematical models of metabolic networks which adhere to systems biology data standards. Yeast glycolysis is used as an example of a parameterised metabolic network that is constructed by these workflows. These workflows aggregate information from a number of online repositories which are used to disseminate data generated by the Manchester Centre for Integrative Systems Biology (MCISB), and are also available for download for use with other biological systems. Furthermore, the calibration of parameterised models is undertaken by these workflows prior to their simulation using COPASIWS, a web service that provides a programmatic interface to COPASI .
Qualitative modelling of metabolic networks
A mathematical model of a biological system is dependent on information describing its components and their relationships with one another. Workflows were designed which, given a pathway term or a list of yeast enzymes identified by their open reading frame numbers, automatically retrieves information based on these criteria from the yeast consensus network. This retrieved information includes reactant and product metabolites for reactions associated with a pathway term or a metabolic enzyme (Figure 3). The enactment of the workflow integrates this information and produces a qualitative model containing populated lists of compartments, species and reactions in an SBML file (Figure 2A). The procedure of integrating data into an SBML model uses methods from libSBML  which have been exposed as workflow components in Taverna using the API consumer application  (Figure 3). This workflow also retrieves various annotations for each compartment, species and reaction which are incorporated into the SBML model so that they are semantically-annotated to MIRIAM guidelines .
Parameterisation of qualitative models
A qualitative SBML model has to be parameterised before it can be used in simulating the quantitative systems behaviour of the metabolic network. This requires quantification of the components in the model, as well as their relationships with one another, by parameterising the starting concentrations of metabolite and enzyme species, and their reaction kinetics. Since these data are stored in distributed databases, the process is reliant on integrating the model with quantitative data. To this end, a parameterisation workflow was developed to automate the mapping of proteomics and metabolomics measurements from the key results database onto the starting concentrations of the enzymes and source metabolites (Figure 2 and 4). The reactions catalysed by the enzymes are also parameterised in order to calculate the rate by which metabolite products are converted from reactant metabolites. Reactions are characterised by a kinetic law and associated parameters in SBML, and these are obtained by this workflow from the SABIO-RK database using its web service interface (Figure 2 and 4).
The key to integrating data between model and databases is the MIRIAM-compliant nature of the SBML model that was generated by the qualitative model construction workflow. Metabolite and enzyme species in the SBML model were labelled with identifiers from external databases such as Uniprot or ChEBI. This feature enabled the parameterisation workflow to integrate kinetics from SABIO-RK into the SBML model by querying the database with sets of reactant and product metabolites, and modifier enzymes as described by their database identifiers. In cases where there are multiple reaction instances associated with a given reaction, the parameterisation workflow allows the user to select which particular rate law and kinetics are inserted as part of the workflow. If kinetics could not be found, a mass action rate law is automatically inserted into the reaction in which its rate constants are set to one.
Model calibration and simulation using COPASIWS
Prior to the use of the parameterised model in predictive studies, the accuracy of its simulations can be improved by calibration with measurements of variables obtained from real biological systems . This process of calibration modifies the parameters of a model until its output matches the given set of real biological measurements. To this end, a workflow was developed to calibrate an SBML model using the parameter estimation feature in COPASI. This feature, along with others in COPASI, has been exposed as web services by COPASIWS  (Figure 5A). Calibration of the model with this web service is an interactive process within the workflow, whereby the user defines which parameters in the model and within what range of values they are to be optimized. This was achieved in the workflow through the use of a pop-up window that guides the user through the calibration of the model. The experimental data used to fit the parameters in the SBML model were obtained from the database of key results. In order for parameter estimation to occur, there is a need for the COPASI web service to know how variables in the experimental data map onto entities in the SBML model. This was facilitated by transforming the experimental data into SBRML  using a utility web service (Figure 5A).
The resulting calibrated SBML model can be used in simulations for predicting the behaviour of metabolic networks. The COPASIWS provides access to the simulation capabilities of COPASI. This was used in a workflow to derive and solve a series of coupled ordinary differential equations representing the reactions in a SBML model to predict the concentrations of metabolites at various time points (Figure 5B). The results are returned by the COPASIWS in SBRML format and are presented graphically using R as part of the simulation workflow (Figure 2, 5B and 5C).
The systems biology workflows shown in Figure 3, 4, 5 were evaluated for their ability to generate a quantitative metabolic model of yeast glycolysis. This is a well-understood pathway [28, 29] which is being used within the MCISB to assess its different strategies for modelling metabolic systems. Proteomics and metabolomics measurements were made using coupled chromatography and mass spectrometry platforms from samples of Saccharomyces cerevisiae grown in continuous culture under turbidostat conditions  in a defined minimal medium . The full data set of proteomics and metabolomics measurements was stored in databases implementing PRIDE XML  and MeMo , respectively. The final concentrations for the metabolites and enzymes in glycolysis were stored in the key results database (Figure 1). Kinetic measurements of two yeast glycolysis enzymes, aldolase (FBA1) and pyruvate decarboxylase (PDC1) were submitted to SABIO-RK for public dissemination.
List of enzymes used to generate a model of glycolysis using the qualitative model construction workflow.
Fructose 1,6-bisphosphate aldolase
Tetrameric phosphoglycerate mutase
Hexokinase isoenzyme 1
Major of three pyruvate decarboxylase isozymes
Alpha subunit of phosphofructokinase
Beta subunit of phosphofructokinase
1 3-phosphoglycerate kinase
Glyceraldehyde-3-phosphate dehydrogenase, isozyme 1
Triose phosphate isomerase
The construction of mathematical models of metabolic networks involving the integration of distributed data can be implemented as Taverna workflows. Automation of these processes provides systematic support for model creation, parameterisation, calibration and simulation, and thus reduces errors or inconsistencies occurring from the manual mapping and tracking of data between information repositories and models. These workflows rely on reaction data which were provided by a community effort to develop a consensus network of metabolism in yeast which met established systems biology standards in the form of SBML and MIRIAM .
The construction of models is normally a lengthy and labour-intensive process requiring the manual input of data for each biochemical reaction . This is also true when use is made of applications such as Cell Designer and COPASI which support the modelling of biological systems. Parameterised models can be semi-automatically created using online tools such as SYCAMORE, Systems biology's Computational Analysis and Modeling Research Environment, based on the selection of a set of reactions from SABIO-RK , which can then be used in simulations. The way models are constructed in these tools differs from our workflows, which relieve the need for manual entry of data by automatically building an SBML model based on some criteria, such as a list of metabolic enzymes, provided by the user (Figure 3). The resulting SBML model is annotated according to MIRIAM guidelines and this makes it possible for kinetics from SABIO-RK to be systematically integrated into SBML models by the parameterisation workflow (Figure 4). These SBML models provide a starting point for the construction of mathematical models for biological systems, and adherence to standards means that the workflows can consume models developed using other approaches, and that the models produced can be consumed by existing tools.
Previously, the manual assembly of models in systems biology has been preferred due to issues with combining distributed data sources and tools . However, online and downloadable applications can integrate the use of tools and data, for example, the BioModels database  can run simulations of the SBML models stored in it via an interface to JWS online . Models constructed using SYCAMORE can also be used in simulations by way of its interoperation with COPASI and ProMOT . A set of Java programs have also been developed by Radrich et al., (2010) to integrate data from KEGG and AraCyc to reconstruct qualitative genome-scale models of Arabidopsis thaliana. In addition, a Java application called MetaCrop has been developed by Weise et al., (2009) to reconstruct quantitative models of metabolic pathways for plants which can then be simulated using COPASI . Furthermore, a software tool called GRaPe can parameterise the kinetics of reactions and integrate gene expression and protein levels into models for simulation using the SBML ODE Solver in CellDesigner . This current work appears to be a novel application of using computational workflows for the construction, parameterisation, calibration and simulation of metabolic models. The advantage of using workflows is the interoperability of tools and databases by the loose coupling offered through the use of computational resources which have been deployed as web services. Moreover, workflows provide an explicit record of the steps involved in the construction and parameterisation of a model that can be shared for use with the systems biology community.
The enactment of a workflow by Taverna generates provenance to provide a record of the intermediate data that have been integrated into a SBML model which is generally not recorded during the manual construction of models. Using this provenance, we have examined the performance of our workflows. The execution times for both the qualitative modelling and parameterisation workflows were found to increase in a broadly linear fashion with increasing number of reactions (Additional file 3). Using glycolysis as a model test case, the parameterisation workflow took the longest time to execute at 3 min 42 s, followed by the qualitative modelling workflow which took 44.9 s on average. The calibration workflow required approximately 22 seconds to complete, whilst the simulation workflow was the fastest to enact at 6 s. The reason as to why the parameterisation workflow is the bottleneck in these workflows is due to the fact that a large number of queries has to be made to the SABIO-RK database in order to retrieve identifiers to reactions for each metabolite and enzyme for every reaction in the qualitative SBML model. These reaction identifiers are then used to perform a query to identify reaction kinetics stored in SABIO-RK that can be mapped onto reactions in the qualitative SBML model.
Our system for implementing data integration processes as workflows highlighted various data integration issues in systems biology. For example, enzyme kinetics data were not available for every reaction even in a well-studied system such as yeast glycolysis. This required failsafe measures to be undertaken by the parameterisation workflow through the substitution of mass action kinetics in these reactions. Discrepancies were also found between the list of reactants and products in reactions from the consensus model of yeast metabolism compared with those in SABIO-RK. This appears to have arisen from the charge balancing of reactions in the consensus model which caused problems with integrating data from SABIO-RK in our workflows. Inconsistent referencing of metabolites with database identifiers between web services can also hinder the automatic assembly of models. This can lead to anomalous models being built which therefore requires the careful checking of results between each workflow enactment. Future work will enhance the current set of workflows. The criteria against which models can be constructed will be expanded to use, for example, terms from the Gene Ontology  so that models for specific biological processes can be generated. A set of workflows will also be developed for the validation of results from systems biology models by their comparison with experimental data.
Our computational resources and workflows are sufficiently generic that they can be applied to study the metabolic networks of other model organisms. Since there is a dependency of these workflows on reaction information described to MIRIAM standards , we have also been instrumental in promoting these efforts through the development of annotation tools  and the organisation of community annotation efforts . We are currently participating in ongoing work to deliver a consensus model of human metabolism by consolidation and curation of two existing models [5, 43]. It is hoped that the automation provided by these workflows can enable rapid construction and analysis of models in different organisms based on different sets of experimental data, thus enabling more comprehensive experimentation during model development, and more efficient reuse of experimental results.
Availability and Requirements
All workflows and accompanying documentation are available from http://www.mcisb.org/resources/taverna/sysbio and from myExperiment at http://www.myexperiment.org/packs/107 .The Taverna workbench (version 2.2) can be downloaded from http://www.taverna.org.uk to run workflows which make use of a key results database available from http://beaconw.cs.manchester.ac.uk:8780/mcisbkrdb/and SABIO-RK that is accessible at http://sabio.villa-bosch.de. The COPASI web service is available from http://www.comp-sys-bio.org/CopasiWS/.
PL and NS thank Wolfgang Mueller, Martin Golebiewski and Saqib Mir for providing technical support to SABIO-RK. PL and CAG would also like to acknowledge the support provided by Alan Williams, Alex Nenadic and Stian Soiland-Reyes on the Taverna workflow system. This work was funded by the Biotechnological and Biological Sciences Research Council, and the Engineering and Physical Sciences Research Council.
- 10.Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: Integrated Platform of Small Molecules and Biological Activities. In Annual Reports in Computational Chemistry. Volume 4. Edited by: Wheeler R, Spellmeyer D. Elsevier; 2008:217–241. 10.1016/S1574-1400(08)00012-1Google Scholar
- 16.Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006, (34 Web Server):W729-W732. 10.1093/nar/gkl320Google Scholar
- 17.Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003, 19(4):524–531. 10.1093/bioinformatics/btg015CrossRefPubMedGoogle Scholar
- 22.Herrgard M, Swainston N, Dobson P, Dunn W, Arga Y, Arvas M, Buthgen N, Borger S, Costenoble R, Heinemann M, et al.: A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat Biotechnol 2008, 26(10):1155–1160. 10.1038/nbt1492CrossRefPubMedPubMedCentralGoogle Scholar
- 23.Dada JO, Mendes P: Design and Architecture of Web Services for Simulation of Biochemical Systems. In Data Integration in the Life Sciences 6th International Workshop, DILS 2009, Manchester, UK, July 20–22, 2009. Proceedings: 2009. Springer Berlin/Heidelberg; 2009:182–195.Google Scholar
- 24.Ihaka R, Gentleman R: R: A language for data analysis and graphics. J Comput Graph Stat 1996, 5: 299–314. 10.2307/1390807Google Scholar
- 28.Teusink B, Passarge J, Reijenga C, Esgalhado E, van der Weijden C, Schepper M, Walsh M, Bakker B, van Dam K, Westerhoff H, et al.: Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Eur J Biochem 2000, 267: 5313–5329. 10.1046/j.1432-1327.2000.01527.xCrossRefPubMedGoogle Scholar
- 31.Hayes A, Zhang N, Wu J, Butler P, Hauser N, Hoheisel J, Lim F, Sharrocks A, Oliver S: Hybridization array technology coupled with chemostat culture: Tools to interrogate gene expression in Saccharomyces cerevisiae. Methods 2002, 26(3):281–290. 10.1016/S1046-2023(02)00032-4CrossRefPubMedGoogle Scholar
- 34.Weidemann A, Richter S, Stein M, Sahle S, Gauges R, Gabdoulline R, Surovtsova I, Semmelrock N, Besson B, Rojas I, et al.: SYCAMORE - A SYstems biology Computational Analysis and MOdeling Research Environment. Bioinformatics 2008, 1463–1464. 10.1093/bioinformatics/btn207Google Scholar
- 35.Le Novere N, Bornstein B, Broicher A, Courtot M, Donizelli M, Dharuri H, Li L, Sauro H, Schilstra M, Shapiro B, et al.: BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Res 2006, (34 Database):1362–4962.Google Scholar
- 38.Weise S, Colmsee C, Grafahrend-Belau E, Junker B, Klukas C, Lange M, Scholz U, Schreiber F: An Integration and Analysis Pipeline for Systems Biology in Crop Plant Metabolism. In Data Integration in the Life Sciences 6th International Workshop, DILS 2009, Manchester, UK, July 20–22, 2009. Proceedings. 2009. Springer-Verlag; 2009:196–203.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.