Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data
- 8.8k Downloads
There has been a dramatic increase in the amount of quantitative data derived from the measurement of changes at different levels of biological complexity during the post-genomic era. However, there are a number of issues associated with the use of computational tools employed for the analysis of such data. For example, computational tools such as R and MATLAB require prior knowledge of their programming languages in order to implement statistical analyses on data. Combining two or more tools in an analysis may also be problematic since data may have to be manually copied and pasted between separate user interfaces for each tool. Furthermore, this transfer of data may require a reconciliation step in order for there to be interoperability between computational tools.
Developments in the Taverna workflow system have enabled pipelines to be constructed and enacted for generic and ad hoc analyses of quantitative data. Here, we present an example of such a workflow involving the statistical identification of differentially-expressed genes from microarray data followed by the annotation of their relationships to cellular processes. This workflow makes use of customised maxdBrowse web services, a system that allows Taverna to query and retrieve gene expression data from the maxdLoad2 microarray database. These data are then analysed by R to identify differentially-expressed genes using the Taverna RShell processor which has been developed for invoking this tool when it has been deployed as a service using the RServe library. In addition, the workflow uses Beanshell scripts to reconcile mismatches of data between services as well as to implement a form of user interaction for selecting subsets of microarray data for analysis as part of the workflow execution. A new plugin system in the Taverna software architecture is demonstrated by the use of renderers for displaying PDF files and CSV formatted data within the Taverna workbench.
Taverna can be used by data analysis experts as a generic tool for composing ad hoc analyses of quantitative data by combining the use of scripts written in the R programming language with tools exposed as services in workflows. When these workflows are shared with colleagues and the wider scientific community, they provide an approach for other scientists wanting to use tools such as R without having to learn the corresponding programming language to analyse their own data.
KeywordsGene Ontology Life Science Community Taverna Workbench Data Analysis Expert
The advent of the post-genomic era in biology has led to a dramatic increase in the amount of multi-dimensional, quantitative data that must be analysed by the bioinformatician. This is especially true in the case of genome-scale analyses of the transcriptome, proteome and metabolome, particularly when such measurements have been made in parallel using high throughput technologies involving microarrays and mass spectrometry techniques [1, 2]. Analyses of these data rely on the performance of in silico experiments, involving the inductive detection of patterns in the data to which some phenotypic significance can be attributed . Such analyses usually rely on statistical testing and linking the results of these tests with information stored in biological databases to summarise and develop conclusions. For example, the analysis of gene expression data generated from microarray experiments consists of a number of steps. The process begins with the normalization and standardization of transcript data, followed by statistical evaluation, and finally, interpretation of the statistical results via the annotation of genes with information relating to their biological function .
There are a number of issues associated with the use of computational tools in the analysis of quantitative data. Firstly, learning how to use such tools for statistical analyses can require significant time and effort. This is especially true for mathematical tools such as MATLAB  and R  which require prior knowledge of their programming languages and the functions within them in order to implement statistical algorithms. Secondly, there is the overhead of transferring data between computational resources during each step of a data analysis pipeline which is made more difficult due to the inconsistent nature of the user interface to the tools. For example, a user may access R from the command line whilst the querying of online sequence databases is made through the use of a web browser. Piping the output of one resource to another will therefore require intermediate staging of the data so that they may be passed manually amongst multiple tools . Thirdly, the interoperability of computational tools can be awkward due to the heterogeneity of data in bioinformatics. The output data provided by a database service may be incompatible as input to the next analysis service both in terms of its structure and its semantics. In these cases, data have to be reconciled by a transformation step in order for them to be consumable by the next service.
In silico experiments on bioinformatics data may be realised as workflows consisting of a pre-defined series of tasks that are related to one another by the flow of data between them. Such workflows can be constructed and enacted using applications such as Kepler , Triana  and Pegasus  that automatically direct the flow of data between the information repositories and computational tools responsible for performing the tasks within an in silico experiment. These workflow systems enable the use of distributed resources which have been deployed using web services, a distributed computing architecture that uses existing Internet communication and data exchange standards to support interoperable application-to-application interaction over a network . Web service-enabled resources provide a web-based application programming interface (API) that is published in a machine-processable format such as Web Services Description Language (WSDL) . Interaction of client applications with the web service is independent of the computing platform used to host the service resource. Other systems interact with the web service in a manner prescribed by its interface using messages which may be enclosed in a SOAP envelope and are typically conveyed on the web in the form of XML.
The myGrid project has developed a workflow system called Taverna  which is capable of invoking several types of local and online tools that can perform the various tasks of a constructed workflow . Different processor implementations are used to invoke applications depending on the type of invocation mechanism including web services which are described in WSDL documents as well as those deployed using the Soaplab  and BioMoby  frameworks. Workflows consisting of these and other types of processors are composed in the Scufl workflow language using the Taverna workbench, typically by an expert user of analysis and data services . In this paper, we report on how the Taverna workflow system can be used for the statistical analysis of quantitative, post-genomic data. Using an example from the transcriptomics domain, we show a workflow which retrieves data using customised maxdBrowse web services from the maxdLoad2 microarray database . This workflow then performs statistical analysis on the gene expression data using R to generate a list of differentially-expressed genes which is followed by the annotation of the genes with information stored in biological databases. Furthermore, we show how extra functionality can be incorporated into Taverna using a plugin mechanism that has been developed into its new software architecture, thereby enabling it to be tailored for use in different scientific domains including transcriptomics.
Microarray data analysis workflow
Web service access to maxdBrowse
Use of Beanshell scripts for reconciling mismatches in data interoperability between services and implementing user interaction during workflow execution
It is often necessary to transform data during their transfer between services within workflows due to incompatible mismatches in their syntactic format. In such cases, manual intervention is required to convert the data into the correct form prior to their consumption by a service in order for the workflow to enact successfully. To this end, the ability to execute scripts written in the Beanshell programming language , an interpreted form of Java, was incorporated into a Taverna processor . Beanshell scripts can be written to provide a mechanism of transforming data so that the data are consumable to services within a workflow. Within the example microarray data analysis workflow, Beanshell scripts have been used to merge multiple sets of microarray data retrieved from the maxdLoad2 database prior to their analysis using R. A Beanshell script was also used to generate one of the final outputs of the workflow which was in the form of a text file of comma-separated values (CSV) containing a list of the differentially-expressed genes identified from t-tests combined with functional information obtained from public databases (Fig. 1).
An ad hoc, albeit basic, form of user interaction can also be implemented using Beanshell scripts through the use of Java Swing classes as part of a workflow enactment . This was used by the example workflow for the multiple selection of two sets of microarray data for t-test analysis by R (Fig. 1).
Invocation of R from Taverna workflows
Use of plugins for extending Taverna functionality
Both the Beanshell and RShell processors provide Taverna with access to generic tools for analysing different types of quantitative data. This is accomplished by these two processors without the need to develop and deploy web services, thereby enabling rapid prototyping without a web services infrastructure. Taverna can also be tailored for use in specific scientific domains by extending its functionality with additional source code through the use of software plugins. A mechanism for plugins has been incorporated into Taverna following a refactoring of its software architecture effected by the development of a tool called Raven. Originally inspired by Maven , Raven enables Taverna to be run according to information about the JAR libraries that Taverna components are dependent upon which reside in an XML file called the Project Object Model (POM).
An organisation can deploy a plugin site for Taverna using two types of XML files. Firstly, an XML file is required which lists the available plugins. Secondly, each listed plugin requires an XML file containing information about its version, functionality and from where the POM file for the plugin can be downloaded. Based on the information in this POM, Raven will then download and install the associated and dependent JAR library files for the plugin for use in Taverna workflows. Further information about the development and deployment of plugins is available online in the developers guide on the Taverna web site http://www.mygrid.org.uk/usermanual1.7/dev_guide.html.
The example workflow shown in Figure 1 was evaluated using microarray data originating from a study into the effects of growth rate on the transcriptome of the yeast, Saccharomyces cerevisiae, when grown under different nutrient-limiting conditions (see Additional file 1: workflow.xml) . The data from this study were normalized by GC Robust Multi-array Average background adjustment and then stored in the maxdLoad2 database . The enactment of the workflow started with the selection and retrieval of data from the microarray database using the maxdBrowse web service interface and Beanshell scripts. This was followed by the invocation of R with a script to perform a series of t-tests to identify the genes that are differentially-expressed between growth rates and culture conditions (Figs. 1 and 4). Common terms from the GO associated with these genes were then identified using the GOTermFinder web service. The GOTermFinder service returns PDF reports of these common GO terms which were browsed using the Taverna workbench with the PDFRenderer plugin (Fig. 5).
Using the example workflow, the gene expression levels of yeast cultures growing under carbon limitation were selected for comparison between two different growth rates; those genes whose expression differs at 0.07 h-1compared with 0.1 h-1. When the genes identified from t-test analyses at a p-value of less than 0.05 were subjected to GO analysis by the GOTermFinder tool with a p-value at the 1% level, only broad GO categories relating to the regulation of metabolic processes were identified (see Additional file 2: carbon-ttest-0_01.zip). However, when the same set of genes was analysed by the GO tool with a less stringent threshold (p-value < 0.05), more GO terms appeared relevant including those associated with gene expression (see Additional file 2: carbon-ttest-0_01.zip). The corresponding analysis of GO molecular functions for the same set of genes showed that some were involved in transcription regulation as well as encoding for a number of protein kinases.
The same analysis was applied to S. cerevisiae cultures growing at rates of 0.07 h-1 and at 0.10 h-1 under nitrogen limitation by selecting the appropriate data stored in the maxdLoad2 database during the enactment of the example workflow. A different pattern from that of the carbon limitation was observed since t-test analysis (p-value < 0.05) and GO analyses (p-value < 0.05) found a higher number of genes whose products are associated with the metabolism of nitrogen compounds (see Additional file 3: nitrogen-ttest-0_05.zip). The most over-represented GO molecular function categories for this gene set corresponded to a variety of catalytic activities such as oxidoreductases, transmembrane transporter activities, and transferases.
The full set of results from the above analyses can be downloaded from http://www.mcisb.org/software/taverna/microarray/gcrma.zip.
Statistical analyses of quantitative data can be constructed and performed in a generic fashion using the Taverna workflow system. This was demonstrated by the t-test identification of differentially-expressed genes from microarray data using the workflow shown in Figure 1. Such generic analyses were made possible by a number of new features present in Taverna and in particular the development of the RShell processor which acts as a client to R when deployed as a service using the RServe library. Standalone web services can offer specific types of statistical analysis such as clustering for data analysis. However, the development of the R processor means that any type of statistical analysis can be performed on data when implemented as a script in the R programming language within a Taverna workflow . This is of particular benefit to the transcriptomics domain since this processor provides Taverna with access to further tools based on R for analysing microarray data such as Bioconductor  which was used by Fisher and colleagues for the identification of candidate genes associated with Trypanosomiasis in workflows . A number of tools such as oneChannelGUI  and GenePublisher  already provide users with access to R and Bioconductor from a graphical user interface. However, the advantage of coupling the R tool with an application like Taverna is that the workflow system provides an interoperability platform for data to be fetched from distributed services for subsequent analysis using R in a workflow. Conversely, the results of an R analysis can be sent for further processing by services downstream in the workflow. The use of distributed data and analysis services in conjunction with R were demonstrated in the example workflow (Fig. 1). Data analysed by R were provided by the maxdLoad2 microarray database using its web service interface provided by maxdBrowse and the results of this were further analysed by the GOTermFinder tool.
The current work on analysing quantitative data extends that by Stevens and coworkers who investigated the implementation of gene annotation pipelines as workflows in a previous version of Taverna . Since then, ad hoc functionality can be incorporated into Taverna workflows through the use of a processor which executes scripts written in the Beanshell language Java . Such scripts provide a method for implementing shim services  which, in the example workflow, were used to reconcile data mismatches between services (Fig. 1). Beanshell scripts were also used to implement a basic form of user interaction using Java Swing classes to steer workflows during their enactment. In the example workflow, this technique was used to select different pair-wise combinations of classes for t-test analysis. This type of interaction is used by the workflow user to direct the workflow to analyse specific subsets of data which are not known at runtime.
Scripts written in the Beanshell and R languages provide generic functionality which can be used as tools to analyse specific types of quantitative data from different scientific domains. In addition, domain-specific functionality can be provided for use in scientific workflows through the plugin mechanism now present in Taverna's software architecture. The ability to incorporate plugins makes it easier for the scientific community to contribute and share functionality in Taverna without the need for their source code to be tightly coupled with the core. In the example workflow (Fig. 1), the plugin mechanism was demonstrated by the implementation of renderers for browsing PDF documents (Fig. 5) and the display of text files containing quantitative data in CSV format as tables. In the context of the current work, an opportunity for developing other plugins for analysing quantitative data might include the ability to invoke analyses in other applications such as MATLAB  and Mathematica  for those users who prefer these tools to perform their calculations.
An initial investment in time and effort is required to construct data analysis workflows. This was true in the case of the example microarray data analysis workflow since a working knowledge of the R programming language is required to devise the t-test analyses, as well as experience of Java programming for writing Beanshell scripts to implement user interaction and shim services. A lack of experience in these two languages can therefore prohibit the construction of complex data analysis workflows by entry-level users of Taverna. However, the fact that workflows can be saved as XML files in the Scufl language allows these analyses to be shared with colleagues and with the wider life sciences community. This is especially pertinent to multi-disciplinary research groups whose expert data analysts use R and other similar tools to develop data analysis protocols, which are then employed by colleagues who are laboratory scientists. This laboratory group of scientists understand the conceptual basis of the analysis performed by the R script but may not have the inclination (or the need) to learn the R programming language in order to understand how such scripts have implemented their analyses. Since it is often the case that there are many more producers of data than there are experts in data analysis within a research group, it is desirable for the laboratory scientists themselves to perform analyses of their own data by re-using R scripts, that may have been incorporated into workflows, written by the data analysis experts. These workflows can therefore be regarded as standard operating procedures providing a best-practice solution for analysing specific types of data. The automation of analyses afforded by Taverna and other workflow software such as Kepler  enables laboratory scientists to quickly test hypotheses on their data by guiding them through complex statistical analyses, especially when users can steer and interact with the enactment of a workflow. This enables a user to run multiple analyses with different parameterizations which was required for the analysis of the microarray data in this study in order to extract information about what was biologically significant. This benefit provided by a workflow approach to users has been implemented by providers of commercial microarray data analysis software such as GeneSpring .
There are still a number of issues highlighted by the current microarray data analysis study associated with the use of web services and workflows that still need to be addressed. Firstly, there is the problem of securing analysis services such as that provided by an R server used in the example workflow. Such services may have been deployed on powerful compute clusters and so their use by unauthorised users within workflows is undesirable. Using the RServe library, the R server can be configured to allow access for those users providing the appropriate username and password combination. However, this information is embedded within the Scufl file and so it is not suitable to share this file with anyone other than colleagues. Future work on the development of Taverna 2, involving the use of distributed agents to handle the security of services during workflow enactment will address this problem. The sharing of workflows within the life sciences community is also an issue that is currently being addressed by the myExperiment project http://www.myexperiment.org/. It introduces the concept of social networking web sites for workflows and aims to provide a collaborative environment for scientists to safely publish their workflows with supporting documentation, and share them with a wider group of people as well as addressing concerns relating to their attribution, credit and intellectual property on the web.
Statistical calculations implemented in the R programming language can be combined with the use of other distributed computational tools and databases in Taverna workflows for performing bespoke analyses on post-genomic data. In this fashion, the Taverna workflow system provides a generic tool for the analysis of quantitative data. Together with Beanshell scripts and specially designed plugins, workflows can be written for the analysis of data from different scientific domains by data analysis experts. These workflows act as standard operating procedures which guide users in the analysis of the data they have generated, enabling them to quickly test hypotheses against data.
Availability and requirements
Taverna (version 1.7.1) and its supporting documentation are available from http://taverna.sourceforge.net. Supplementary material including the example workflow, its accompanying documentation, and the results of the analyses described in this paper are available from http://www.mcisb.org/software/taverna/microarray/index.html. The example microarray data analysis workflow is also available from the myExperiment web site: http://www.myexperiment.org/workflows/181.
PL and DBK thank the BBSRC for financial support, and DBK and SGO acknowledge the financial support of the BBSRC and EPSRC in the Manchester Centre for Integrative Systems Biology http://www.mcisb.org. Systems Biology research in the Cambridge laboratory is supported by BBSRC grants to SGO. The work undertaken by IW was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). The authors thank Duncan Hull and Leo Zeef for helpful discussions and suggestions on the improvement of the manuscript. We also thank Thomas Down for his work on Raven.
- 1.Castrillo JI, Zeef LA, Hoyle DC, Zhang N, Hayes A, Gardner DCJ, Cornell MJ, Petty J, Hakes L, Wardleworth L, Rash B, Brown M, Dunn WB, Broadhurst D, O' Donoghue K, Hester SS, Dunkley TPJ, Hart SR, Swainston N, Li P, Gaskell SJ, Paton NW, Lilley KS, Kell DB, Oliver SG: Growth control of the eukaryote cell: A systems biology study in yeast. Journal of Biology 2007, 6: 4.PubMedCentralCrossRefPubMedGoogle Scholar
- 2.Ishii N, Nakahigashi K, Baba T, Robert M, Soga T, Kanai A, Hirasawa T, Naba M, Hirai K, Hoque A, Ho PY, Kakazu Y, Sugawara K, Igarashi S, Harada S, Masuda T, Sugiyama N, Togashi T, Hasegawa M, Takai Y, Yugi K, Arakawa K, Iwata N, Toya Y, Nakayama Y, Nishioka T, Shimizu K, Mori H, Tomita M: Multiple high-throughput analyses monitor the response of E. coli to perturbations. Science 2007, 316: 593–597.CrossRefPubMedGoogle Scholar
- 4.Hayes A, Castrillo JI, Oliver SG, Brass A, Zeef LAH: Transcript Analysis: A Microarray Approach. In Methods in Microbiology. Yeast Gene Analysis. Volume 36. Edited by: Stansfield I, Stark MJR. Elsevier; 2007:189–219.Google Scholar
- 6.Ihaka R, Gentleman R: R: A language for data analysis and graphics. J Compu Graph Statistics 1996, 5: 299–314.Google Scholar
- 7.Goble C, Stevens R: State of the nation in data integration for bioinformatics. J Biomed Inform, in press.Google Scholar
- 11.Web Services Activity[http://www.w3.org/2002/ws/]
- 12.Web Services Description Language[http://www.w3.org/TR/wsdl]
- 15.Senger M, Rice P, Oinn T: Soaplab – a unified sesame door to analysis tools. In Proceedings of the UK e-Science All Hands Meeting: 2–4 September 2003; Nottingham, UK. Cox SJ: EPSRC; 2003. ISBN 1–904425–11–9. ISBN 1-904425-11-9.Google Scholar
- 19.Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder – open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004, 20: 3710–3715.PubMedCentralCrossRefPubMedGoogle Scholar
- 20.Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics 2001, 29: 365–371.CrossRefPubMedGoogle Scholar
- 21.BeanShell – Lightweight Scripting for Java[http://www.beanshell.org/]
- 22.Wolstencroft K, Oinn T, Goble C, Ferris J, Wroe C, Lord P, Glover K, Stevens R: Panoply of utilities in Taverna. In Proceedings of the 1st International Conference on e-Science and Grid Computing: 5–8 December 2005; Washington, DC, USA. IEEE Computer Society, Washington, DC, USA;Google Scholar
- 23.Java Swing[http://java.sun.com/docs/books/tutorial/uiswing/]
- 24.Urbanek S: Rserve – A Fast Way to Provide R Functionality to Applications. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003); 20–22 March 2003; Vienna, Austria Edited by: Hornik K, Leisch F, Zeileis A.Google Scholar
- 26.Taverna API[http://www.mygrid.org.uk/taverna/api/]
- 28.Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A: A systematic strategy for large-scale analysis of genotype-phenotype correlations: identification of candidate genes involve in African trypanosomiasis. Nucleic Acids Research 2007, 35: 5625–5633.PubMedCentralCrossRefPubMedGoogle Scholar
- 31.Stevens R, Glover K, Greenhalgh C, Jennings C, Pearce S, Li P, Radenkovic M, Wipat A: Performing in silico experiments on the Grid: a users perspective. In Proceedings of the UK e-Science All Hands Meeting: 2–4 September 2003; Nottingham, UK. Cox SJ: EPSRC; 2003. ISBN 1–904425–11–9. ISBN 1-904425-11-9.Google Scholar
- 32.Hull D, Stevens R, Lord P, Wroe C, Goble C: Treating shimantic web syndrome with ontologies. Proceedings of the First Advanced Knowledge Technologies workshop on Semantic Web Services (AKT-SWS04); 8 December 2004; Milton Keynes, UKGoogle Scholar
- 35.Goble CA, De Roure DC: myExperiment: social networking for workflow-using e-scientists. In Proceedings of the 2nd workshop on Workflows in support of large-scale science: 25–29 June 2007; Monterey, California, USA. ACM, New York, USA;Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.