AnaBench: a Web/CORBA-based workbench for biomolecular sequence analysis
- 7.9k Downloads
Sequence data analyses such as gene identification, structure modeling or phylogenetic tree inference involve a variety of bioinformatics software tools. Due to the heterogeneity of bioinformatics tools in usage and data requirements, scientists spend much effort on technical issues including data format, storage and management of input and output, and memorization of numerous parameters and multi-step analysis procedures.
In this paper, we present the design and implementation of AnaBench, an interactive, Web-based bioinformatics Ana lysis workBench allowing streamlined data analysis. Our philosophy was to minimize the technical effort not only for the scientist who uses this environment to analyze data, but also for the administrator who manages and maintains the workbench. With new bioinformatics tools published daily, AnaBench permits easy incorporation of additional tools. This flexibility is achieved by employing a three-tier distributed architecture and recent technologies including CORBA middleware, Java, JDBC, and JSP. A CORBA server permits transparent access to a workbench management database, which stores information about the users, their data, as well as the description of all bioinformatics applications that can be launched from the workbench.
AnaBench is an efficient and intuitive interactive bioinformatics environment, which offers scientists application-driven, data-driven and protocol-driven analysis approaches. The prototype of AnaBench, managed by a team at the Université de Montréal, is accessible on-line at: http://malawimonas.bcm.umontreal.ca:8091/anabench. Please contact the authors for details about setting up a local-network AnaBench site elsewhere.
KeywordsBioinformatics Tool Common Object Request Broker Architecture User Connection Remote Application Interface Definition Language
To conduct sequence analysis, biologists use a variety of bioinformatics software in a sequential fashion. For example, phylogenetic analysis of newly sequenced protein-coding genes involves translation of the nucleotide sequence to protein sequence in six frames (e.g., using FLIP ), identification of ORFs that correspond to conserved proteins by similarity search (e.g., using FASTA ), retrieval of protein sequences from GenBank using ENTREZ , multiple protein alignment (e.g., using CLUSTALW ), extraction of well aligned sequence stretches, as well as tree inferences and testing (e.g., using PHYLIP ). A major bottleneck is that most software applications are incompatible with one another because they use different file formats. As a consequence, the output of one tool cannot be used as an input for another, without data format conversion. A further complication is that the user has to define a multitude of parameters and options according to the particular data or aim of the analysis. Finally, as bioinformatics tools are written in various programming languages and for different operating systems, installation, configuration, and maintenance of these numerous software components is time consuming, costly and requires IT expertise.
Current methods for bioinformatics software integration
In the past few years, several interactive environments have been developed to facilitate bioinformatics analyses. Some of these environments are commercial products (e.g., BIONAVIGATOR , ISYS [7, 8] and TURBOBENCH ); others are freely accessible (e.g., NCSA biology workbench  and GWFASTA ), or open-source (e.g., EMBOSS , ERATO SBW [13, 14], and APPLAB ). These environments may be classified into three main categories.
Bioinformatics tools are accessible through Web interfaces using HTML and various scripting languages (CGI, Perl, etc.). BIONAVIGATOR, NCSA biology workbench, and GWFASTA are examples for such environments. The advantage of these Web-based workbenches is that the user does not need to install bioinformatics tools on his/her computer, but only requires a Web browser to launch applications on the centralized server. However, the drawback of Web-based environments is that interactive bioinformatics tools cannot be easily integrated.
These environments do not have the above limitation, because the environment and bioinformatics software tools are installed and executed on the user's computer. The integration of tools is typically achieved by means of wrappers, and the interoperation between tools by a specific application programming interface (API) for the exchange of messages. ERATO SBW, ISYS, TURBOBENCH, EMBOSS, and APPLAB are examples for such environments. The drawback here is that the user requires IT expertise for the installation and configuration of the environment and tools.
These environments combine several advantages of the former two categories and resolve their major drawbacks. In this category, the environment and some bioinformatics applications are installed on the user's computer, while providing a gateway to a Web-based environment. BIONODE-BIONAVIGATOR is an example for such a system.
None of the current open-source bioinformatics analysis environments provides both accessibility from any platform and location and easy integration of biological databases and tools developed by us and others. Therefore, we set out to develop 'AnaBench' a new Web-based environment that combines these two features.
Our goal was to build a Web-based workbench bioinformatics infrastructure, which fulfills the following requirements:
Access to the workbench by employing only a Web browser without the need to download and configure software on his/her computer;
Access to individual user workspace on our central servers to save biological data and analysis results, and organization of this workspace in terms of projects;
Execution of bioinformatics tools offered in the workbench, with input data selected from the user's projects, and saving of results into selected projects.
Straightforward description of the bioinformatics tools in terms of parameters, data types, and data formats they can handle;
Easy integration of in-house and public biological databases;
Launching of applications hosted on our servers as well as remote applications available from third parties.
Users interact with the workbench through a Web browser using HTML and JSP (Java Server Pages) screen forms. JSP  is a technology for controlling the content and appearance of Web pages through the use of servlets, i.e, small programs that run on the Web server to dynamically generate the requested Web page before sending it to the user.
Application server level
The requests submitted by users at the presentation level are handled by a Web server equipped with a servlet and JSP container, and CORBA (Common Object Request Broker Architecture) middleware (Figure 1).
CORBA is the standard distributed object architecture  that is characterized by an open software bus called Object Request Broker (ORB) through which heterogeneous object components can interoperate across networks. The interfaces to CORBA objects are specified by the Interface Definition Language (IDL). CORBA objects differ from typical programming language objects in three ways: (1) they may reside anywhere on a network; (2) they are able to interoperate with objects written on other platforms; and (3) they may be written in any programming language for which there is a mapping from IDL to that particular language.
The CORBA server is responsible for launching the analysis tools and interfacing with the back-end databases. The naming service is a standard CORBA object service, which provides the mapping between object names and object references.
In this tier, we find the management database, the biological databases and an analysis tools server, which contains bioinformatics tools. The management database stores information about users, their projects and data, and descriptions of bioinformatics tools as outlined above. Workbench users will be able to save the results of database queries and data analysis in their workspace for further processing.
The biological databases that are currently accessible through the workbench include: GOBASE, a taxonomically broad organelle genome database that organizes and integrates diverse data related to organelles [18, 19], the Protist EST database (PEPdb), both developed in-house , and GenBank  via remote BLAST .
In this section, we describe the design and functionalities of AnaBench, as well as sustainability issues.
User Registration (UC1): The main page gives access to the registration of users, who have to provide their name, email, login name and password.
User Connection (UC2): In order to access individual workspace projects and the analysis tools, the user connects to the main Web page and identifies him/herself by entering login name and password. The system validates these data, rejects the connection if the identification is incorrect, and provides access to the 'User Registration' use-case (UC1), if the user is not yet registered.
Project Management (UC3): Users manage their own workspace projects, i.e., create a project, list all projects, edit or delete selected projects. The users have only access to this use-case after they have been identified by the system through 'User Connection' (UC2).
Data Management (UC4): This use-case is generated by 'Project Management' (UC3). It allows users to manage their sequence data within their projects, i.e., add data, list all data of a project, edit or delete selected data. Data can be added in a project using copy and paste, by uploading local files from the user's computer, and by saving results of analyses and external database queries. The users have only access to this use-case if they have selected one of their projects through UC3.
Tools Management (UC5): The workbench administrator describes with this use-case the parameters of bioinformatics applications to be integrated in the workbench.
Analysis (UC6): The user selects an analysis tool and provides the appropriate input data, an action that generates the 'Input' use-case (UC8). The results produced by a given tool may be saved in an existing or a new workspace project. Users have only access to this use-case if they have been identified by the system through 'User Connection' (UC2).
Deferred Execution (UC7): Users launch the execution of an application in the deferred mode, which means that they do not have to wait for the application to terminate before carrying out other tasks within the workbench. This mode is suited for tools that require intensive computation, as well as for remote applications managed by third parties. The system will inform the user by email once a deferred execution is completed.
Input (UC8): The user specifies the input data for a given analysis tool by selecting the data from the user's projects. This use-case is generated by 'Analysis' (UC6).
View Results (UC9): The results generated by an analysis tool are displayed on the user's Web browser. This use-case follows 'Analysis' (UC6) or 'Deferred Execution' (UC7).
Save (UC10): The user saves the results of an analysis in a selected workspace project, a use-case generated by 'View Results' (UC9).
Protocol Management (UC11): Protocols are fully interactive analysis pathways for addressing specific biological questions. Users manage their own protocols, i.e., create new protocols using the analysis tools provided by AnaBench, list all protocols, edit or delete selected protocols. The users have only access to this use-case after they have been identified by the system through 'User Connection' (UC2).
Most of these use-cases are divided into sub-use cases. For example, 'Project Management' (UC3) consists of 'Add a project', 'Modify a project', 'Delete a project', and 'List all projects of a given user'. In the following subsection, we describe in more details the 'Tools Management' use-case (UC5), which handles the installation of new bioinformatics software in AnaBench.
The Tools Management use-case
This use-case allows to save in the management database a description of the bioinformatics tools, including parameter descriptions, data types, and data formats. These descriptions are consulted for the design of screen forms that allow input of parameters values. Only the workbench administrator uses this use-case; it is invisible to the biologist. To facilitate the management of the tools, we classified them into several categories based on their functionality, for example:
Nucleotide sequence analysis (length, composition, molecular weight)
Nucleotide sequence comparison
Nucleotide sequence alignment
Search for promoters, transcription terminators and other motifs
RNA secondary structure prediction
Nucleotide to protein translation
Protein sequence analysis (composition, molecular weight, isoelectric point, length)
Protein sequence comparison
Protein sequence alignment
Search for protein domains, motifs, and signatures
Protein secondary structure prediction
Tools description entities
(category_id, category_name, category_description)
(application_id, application_name, category_id, application_description, application_versionnumber)
(parameter_id, parameter_name, application_id, parameter_label, parameter_description, parameter_prefix, parameter_abbreviation, parameter_valuetype, parameter_required, parameter_requiredvalues, parameter_syntax, parameter_min, parameter_max, parameter_default)
From the analysis of the workbench use-cases, we draw the core entities of the system: User, Project, Data (biological data and analysis results), DataType, DataFormat, Analysis_Category, Application, Parameter, Application_datatype, Application_dataformat, Execution, Input, Output, and Protocol. These entities constitute the management database tables.
The prototype of AnaBench uses the following platform:
Hardware: Two double-processor Linux servers,
Programming language: Java,
CORBA environment: ORBACUS 4.0 for Java,
JSP engine: Jakarta Tomcat 3.2,
Web server: Apache advanced extranet server 1.3.23,
Relational Database Management System: MYSQL with JDBC driver.
At the time of writing, the following programs had been integrated in AnaBench: CLUSTALW  for multiple alignments; FLIP  for translation of nucleotide sequence; Remote NCBI BLAST , BLASTPEP for local Blast searches in PEPdb , READSEQ  for format conversion of sequences, RNAmot  for finding secondary structures in nucleic acid sequences, TANDEMREPEATS  for identifying tandem repeats in DNA sequences, and TRNASCAN_SE  for finding tRNA genes. From the descriptions of the above tools, we generate, by using a JSP Generator program, the JSP screen forms, in which the user enters the parameter values of the tool, or, more precisely, modifies the default values we have set. As the output of most analysis tools is either in plain/text format or in HTML, it is easily displayed on the main frame of the workbench, without further processing.
An example for tools description
We have incorporated into AnaBench a 'Tools Description' module that streamlines addition or upgrades of analysis tools. The workbench administrator performs the description of tools by means of JSP pages that allow to specify application categories, applications, and their parameters. These descriptions are saved in the management database. In the following, we present, as an example, the addition of FLIP into AnaBench. The steps are as follows:
First, we create a category of applications called "Sequence Analysis" provided it does not already exist in the management database. This is performed through the "addCategory.jsp" page, which interacts with the CategoryManager interface of the CORBA server in order to add a new record to the category table.
Sequence Analysis approaches in AnaBench
We have built an interactive Web-based sequence analysis environment with a three-tier architecture including CORBA as a middleware to allow: (i) easy development and deployment of clients and servers, (ii) the ability to use heterogeneous languages and machines, and (iii) the implementation of light Internet clients in the presentation layer using HTML and JSP.
Automated integration of command-line bioinformatics tools in AnaBench has been achieved by two means: the tabular description of categories, applications, and parameters as shown in the previous section, and the utilization of the JSPGenerator. The programmatic effort to incorporate a new tool into AnaBench mainly consists in customizing the automatically generated JSP page for a given tool. As a future enhancement, we consider the integration of tools represented by Java classes that can be dynamically loaded into the system. We successfully tested this approach with the 'TVIEW' tool that displays the results of 'TRNASCAN_SE' and inserts feature annotations into nucleotide sequence files in MASTERFILE format . It should be noted that packages like EMBOSS and BioNavigator took a different approach by using configuration files in the ACD format to describe command-line tools and their parameters. In contrast to ACD configuration files, the tabular tool description we use is straightforward since it does not require a parser for the processing of configuration files.
The integration of local and remote applications in AnaBench is performed in the same manner except that remote applications require dedicated APIs offered by service providers. In the case of remote NCBI BLAST, we have implemented the access to this service using the QBLAST API provided by NCBI.
To avoid crashes at the application server level, the CORBA server could be replicated across the machines of a local network. If one of the replicas fails, the surviving ones can continue the processing of the business logic and thus provide continuous service to the clients. In such a scenario, the CORBA Naming Service will locate an appropriate available CORBA server. Likewise, to avoid crashes of the Web server and to handle a great number of simultaneous requests as demand increases, a cluster of Web servers in combination with a load balancer may be deployed. Several Web cluster solutions are now available.
In this paper, we present the design and implementation of a new Web-based biological workbench environment. AnaBench is innovative in three aspects: (i) its three-tier architecture allows easy integration of heterogeneous bioinformatics tools; (ii) the recent technologies that we employed for distributed server-side Web computing, such as CORBA, RMI, JDBC, and JSP, provide excellent performance with modest hardware; (iii) the user data are stored in a relational database rather than in flat files providing flexibility as to data extraction and interrogation. Another distinctive feature of AnaBench is that it offers three types of sequence analysis: application-driven, data-driven, and protocol-driven approaches.
This work is a part of the Protist EST program (PEP), funded by Genome-Canada and Genome-Quebec. Funding from CIHR (Canadian Institutes of Health Research, grant nr. GX-15331) and salary and interaction support from the CIAR (Canadian Institute for Advanced Research) to GB and BFL are acknowledged.
The authors thank Bruno Duclouet, Kun Wang, Sabrina Carpentier, Julien Rondeau, and Nicolas Moiroud for their contributions in the development and documentation of AnaBench, and Amy Hauth for her suggestions to the manuscript.
- 2.Pearson WR, Lipman DJ: Improved Tools for Biological Sequence Comparison. In Proc Natl Acad Sci USA 1988, 2444–2448.Google Scholar
- 5.Felsenstein J: PHYLIP Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- 6.Entigen inc.: BioNavigator, The Bioinformatics Workspace.1997. [http://www.entigen.com/library/white_papers/BioNavigator.pdf]Google Scholar
- 9.TurboGenomics inc.: TurboBench overview.[http://www.turboworx.com/files/gen_news.pdf]
- 13.Hucka M, Sauro H, Finney A, Bolouri H: Introduction to the Systems Biology Workbench.2001. [http://www.sbw.sourceforge.net/sbw/docs/intro/intro.pdf]Google Scholar
- 14.Hucka M, Sauro H, Finney A, Bolouri H, Doyle J, Kitano H: The ERATO Systems Biology Workbench: Enabling Interaction and Exchange between Software Tools for Computational Biology. In Proceeding of the Pacific Symposium on Biocomputing 2002, 450–461.Google Scholar
- 15.Senger M: AppLab – A CORBA-Java based Application Wrapper.1999. [http://www.hgmp.mrc.ac.uk/CCP11/CCP11newletters/CP11newsletterIsuue8.pdf]Google Scholar
- 16.Hanna P: JSP: The Complete Reference. McGraw-Hill 2001.Google Scholar
- 17.Object Management Group: The Common Object Request Broker: Architecture and Specification. 1999.Google Scholar
- 20.Burger G, Lang BF, Golding GB: PEPdb: The Protist EST Database.2002. [http://amoebidia.bcm.umontreal.ca/public/pepdb/welcome.php]Google Scholar
- 26.The Organelle Genome Megasequencing Program: The OGMP Masterfile Format.[http://megasun.bch.umontreal.ca/ogmp/masterfile/intro.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.