Snazer: the simulations and networks analyzer
 6.4k Downloads
 24 Citations
Abstract
Background
Networks are widely recognized as key determinants of structure and function in systems that span the biological, physical, and social sciences. They are static pictures of the interactions among the components of complex systems. Often, much effort is required to identify networks as part of particular patterns as well as to visualize and interpret them.
From a pure dynamical perspective, simulation represents a relevant wayout. Many simulator tools capitalized on the "noisy" behavior of some systems and used formal models to represent cellular activities as temporal trajectories. Statistical methods have been applied to a fairly large number of replicated trajectories in order to infer knowledge.
A tool which both graphically manipulates reactive models and deals with sets of simulation timecourse data by aggregation, interpretation and statistical analysis is missing and could add value to simulators.
Results
We designed and implemented Snazer, the simulations and networks analyzer. Its goal is to aid the processes of visualizing and manipulating reactive models, as well as to share and interpret timecourse data produced by stochastic simulators or by any other means.
Conclusions
Snazer is a solid prototype that integrates biological network and simulation timecourse data analysis techniques.
Keywords
Biological Network Internal Data Format Layout Algorithm Reaction Graph Propensity FunctionBackground
Proteins and genes play fundamental roles in most organic functions. They cooperate to keep the entire organism's machinery working and prevent breakdowns. Globally, their social relationships give rise to minutely organized networks, currently the target of meticulous studies [1]. Genetic Regulatory Networks (GRNs) [2, 3] and ProteintoProtein Interaction networks (PPIs) are the most representative classes of biological networks.
GRNs represent collections of DNA segments which functionally interact with each other as well as with the chemicals that govern the transcription of genes into RNA sequences. From another perspective, GRNs can be seen as inputoutput machineries which produce an output (i.e. the expression level of a gene) by a combined application of basic functions to input stimuli. PPIs are comprised of proteins that form chains in which each protein reacts to a stimulus of its predecessor to produce a signal directed to its successor. Reactions in such chains are usually seen as directionally oriented because often they are not reversible. A reaction chain acts as a connector of a triggering event to a final physiological response and then sets up a complex (and often cyclic) collaborative network.
Both kinds of networks have been collected into several data banks spread throughout the WWW [4] and modeled in a computerized fashion by means of some unambiguous and artificial formalisms. The most common one makes use of coupled Ordinary Differential Equations (ODEs) [5]. Alternatively, several other promising modeling techniques have been employed: (including) Boolean networks [6], Petri nets [7], Bayesian networks [8], graphical Gaussian models [9], Process Calculi [10, 11] and Automata Theory [12]. Standard languages, xml (SBML [13], CellML [14]) or graphical (SBGN [15], BlenX4Bio [16]) based, have been further proposed to allow knowledge sharing. All of them are (roughly) connected to graph theory, since a common simple principle holds: interacting agents (e.g. genes, proteins, enzymes, etc.) are represented as the graph vertices whereas interactions (e.g dimerization, phosphorylation, collision, etc.) constitute the graph edges. Moreover, strength of connections (if any) is usually modeled by weighing the edges (e.g. to quantitatively represent fluxes in metabolic networks). Proteintoprotein interaction networks, chemical structures, gene coexpression and contact graphs for protein structures are all examples of undirected graphs, whereas experimental protocols and taxonomies of species and traditional regulatory networks are examples of directed graphs. Other examples include: DNA, RNA or protein sequences (linear graphs), sequence fragment overlap graphs (interval graphs) for shotgun sequence assembly, genetic maps and multiple sequence alignments (partial orders). On top of them, a myriad of graphical and textual based tools has been developed with the aim to simultaneously make the process of networks design increasingly more intuitive and to speed up their functional description. In Sec. 2, we itemize the most representative tools and give some concise explanations of them.
Once a network is defined by its constituents and interactions, it is common practice to simulate the network by means of deterministic/stochastic solvers. The deterministic solvers cope with interlocking sets of differential or difference equations and require the specification of some initial conditions. Its response (y) is fixed, given the values of its input variables (x_{ i }), and implies that var(yx_{ i }) = 0. This property distinguishes deterministic models from reallife experiments [17]. Stochastic solvers use pseudorandom numbers (treated as if they were random numbers distributed uniformly and indpendently) to produce different y values, starting from the same initial parameters. Hence, the response y is a random variable with var(yx_{ i }) = g(x_{ i }), usually estimated through replication and fed with different random numbers. A userdefined accuracy level determines how many simulation runs are needed [18]. A stochastic trajectory or trace (corresponding to the output of a simulation run) is a sequence of temporal observations of the copynumber of the species in the state space.
Traces are curves generally made by double precision (64bit) floatingpoint numbers sampled over time. They sketch the trend of some system variables that are changing because of a predefined set of mathematical rules. Based on the continuous or discrete nature of such rules, the traces format changes accordingly. Continuous traces exhibit smooth trends of variables regularly observed at continuous timescales. The solution of an ODEbased biological system gives the trend of the concentration of their variables within a desired time interval. Therefore, variables change continuously according to a mathematical function which solves the ODE's set. Discrete traces only differ because of their nonhomogeneous temporal sampling. They are evaluated at uneven time instants that are calculated at runtime. Usually, biological discrete systems deal with chemicals populations (in place of chemicals concentrations) and, consequently, Markovian trajectories are made of streams of integer numbers. Sharing both kinds of simulation results is, to a great extent, hindered by the use of a variety of data formats by the existing simulation software packages. In fact, most of them output traces decorated by proprietary metainformation that contains private, not humanreadable simulator settings. Few standard formats exist, but although some of them seem to be very wellsuited and complete, their generalpurpose nature makes them little versatile for our needs. That is why Snazer has been equipped with an adhoc and lightweight internal data format.
The rest of the paper is organized as follows: Sec. 2 describes all the features of our tool compared to those of similar existing tools. Particularly, in Sec. 2.2 the graph layout algorithms and the analysis features of Snazer are presented, whereas in Sec. 2.3 we describe the implemented statistics routines. In Sec. 2.4, we present the internal data format together with the compression policy employed to enhance the storing of data. In Sec. 3 we test Snazer on a real casestudy with the aim of highlighting features and strength points. Then, we perform some benchmark tests on its compression capability and show the results. In Sec. 4 we conclude the paper and present future works.
Implementation
Snazer is a software prototype whose aim is threefold: (i) manipulating biological networks, (ii) providing advanced statistical analysis routines for simulated traces and (iii) improving the traces sharing and storing processes. Snazer can also be considered as a viewer package of Beta Workbench (BWB) [19]. By parsing its output files, Snazer imports two relevant pieces of information inherently bound to the modeled systems: the graphs of the simulated reactions and the simulated traces. It encloses them in a compact XML data structure, designed to work as an interchange data source/sink. Through this, networks and simulation data look tightly coupled.
2.1 State of the art
The rationale behind Snazer comes directly from the scientific community, which needs to manage together the biological models and their simulation results in a coherent manner. Several tools exist that unilaterally tackle such need. Pajek [20] is a standalone application for analysis and visualization of large networks having ten to hundreds of thousands of vertices. It is a highly interactive program and its main strength is the variety of layout routines (circular, energy, eigenvalues) that greatly facilitate exploration of patterns within large networks. It is equipped with several decomposition, connectivity, pattern searching algorithms. Pajek can detect clusters in a network, extract vertices that belong to the same clusters and show them separately, shrink vertices in clusters or show relationships among clusters. It loads several proprietary input files and writes network layouts as EPS, SVG, BMP files. Medusa [21] is a Java application (standalone or applet). It provides 2D representations of mediumsized networks, up to a few hundred nodes and edges. It shows multiedge connections and is optimized for proteinprotein interaction data. Medusa supports weighted graphs and represents the significance and importance of a connection by varying line thickness. Graphs can be drawn by means of some embedded springlike layout algorithms and exported to image or postscript files. Medusa is designed for accessing protein interaction data from STRING[22]. Cytoscape [23] is a standalone Java application for the visualization of largescale molecular interaction networks (directed, undirected and weighted) and their integration with gene expression profiles and other data. It also allows the manipulation and comparison of multiple networks. The great quantity of available plugins enriches Cytoscape with increasingly specialized analysis modules. The tool supports a variety of format files, including Gene Ontology (GO) [24], SBML[13] and KEGG[25]. BioLayout Express^{3D}[26] is another tool written in Java for the visualization and analysis of networks derived from biological systems. It supports both unweighted and weighted graphs together with edge annotation of pairwise relationships. It mainly employs the FruchtermanRheingold [27] layout algorithm for 2D and 3D graph positioning and display and makes use of a heavily optimized Cbased Markov Clustering algorithm for graph clustering. Its main goal is to offer different analytical approaches to microarray data analysis. It supports different file formats: Cytoscape Sif files, Reactome files [28], yEd GraphML files [29].
Osprey [30] visualizes complex interaction networks. It represents not only interactions in a flexible and rapidly expandable graphical format, but also provides options for functional comparisons between data sets. Osprey uses the General Repository for Interaction Datasets as a database (BioGRID) [31], from which the user can rapidly build interaction networks. Networks can be saved as tabdelimited text files for future manipulation or exported as JPEG, PNG, and SVG. ProViz [32] is a standalone open source application developed for the visualization of proteinprotein interaction networks. It provides facilities for navigating large graphs and exploring biologically relevant features. It has a plugin library with dozens of layout algorithms. Basically, they belong to three families: forcebased, hierarchical and geometric layout (only circular). ProViz adopts emerging standards such as GO and PSIMI[33]. BiologicalNetworks [34] models whole cell biochemical pathways and gene regulatory networks. It uses a specialized graph visualization engine to represent biological pathways, gene regulation networks and proteinprotein interaction maps for intuitive exploration and prediction. The tool can handle a variety of tasks, including graphic drawing and layout optimization, data filtering and pathway expansion, and classification and prioritization of proteins. BiologicalNetworks uses a proprietary file format (BNX) that stores information pertaining to the model and the corresponding simulation environment. It supports import and export of models from SBML, SIF and GML file formats. Finally, Tulip [35] is one of the forerunners of drawing packages for biological networks. It allows the visualization, drawing and editing of graphs up to a million elements. Such a visualization system allows navigation through geometric operations as well as extraction of subgraphs and enhancement of the results obtained by filtering. Its most interesting property is the underlying data structure used to inspect huge graph attributes. Tulip implements the wellknown "flyweight" and "chain of responsibility" patterns to access graphs through views. The real advantage is enabling a real sharing of the elements between graphs with a good memory management. All this software improve and obscure the firstgeneration tools from which they have drawn inspiration: Otter [36], a generalpurpose network visualization tool; Negopy [37], a discrete, linkagebased program for the analysis of networks; KrackPlot [38], a network visualization tool intended for social networks; MultiNet [39], a Windowsbased computer program designed for exploratory data analysis of social and other networks. Other tools exist that aim at providing advanced statistics routines for biological traces. Traviando [40] is a backend trace visualizer and analyzer. It interfaces the XML output file of Mö bius, a multiparadigm multisolution framework for the performance and dependability assessment of systems, to investigate the details of what happened in a simulation of a model. It verifies that certain events happen in a trace, that particular states are frequently reached or that certain conditions hold throughout a simulation. It further analyzes the cyclic behavior with graphics that show if states are repeatedly visited or how the length of the trace evolves if cycles are removed. Traviando also supports various statistics on traces as well as the model checking of a trace with respect to LTL formulas. SimWiz [41] is an old but still interesting project. It is a collection of Java tools that aims at visualizing data resulting from different kinds of biochemical simulation processes. It imports STODE[42] and COPASI[43] simulation output files as well as the relative reaction graphs as SBML files. Its main feature is animating the network graph through the information coming from the simulated traces. VANTED [44] loads and edits graphs, which may represent biological pathways or functional hierarchies. It allows the mapping of experimental data sets onto the graph elements and visualizes time series data or data of different genotypes or environmental conditions in the context of the underlying biological processes. Builtin statistic functions allow a fast evaluation of the data (e.g. tTest or correlation analysis). PopTools [45] is a versatile addin for Microsoft Excel that facilitates analysis of matrix population models and simulation of stochastic processes. Together with routines for iterating and resampling, this allows the calculation of bootstrap and other statistics for stochastic processes. Routines that facilitate calculation of some simple maximum likelihood and resampling statistics are supported as well.
Snazer addresses many of these issues, but additionally offers a solid framework that couples networks and traces analysis frameworks. It provides the user with the possibility both of browsing, dissecting and analyzing the networks components and of performing statistics on the inspected components. Snazer has been equipped with an efficient compression system to make huge amount of data manageable and treatable as well as to store and share them. Compared to the existing tools, it covers most of the network visualization layout routines. In addition, it implements TemporalCircuit (see Sec. 2.2), an original algorithm to layout networks according to the information coming from simulation. It is not as well equipped with decomposition and connectivity algorithms as Pajek and BioLayout Express^{3D}, because we mainly focused on original (e.g. ColorBlind filter and standard annotation inspection  see Sec. 2.2.2) rather than existing functionalities. Snazer is crossplatform, being implemented in Java (like Medusa, Cytoscape and BioLayout Express^{3D}). It structures input data in XML files (as BiologicalNetworks does) and makes use of an internal data structure to improve data access (like Tulip). At the best of our knowledge, Snazer is the only tool that provides data compression (Sec. 2.4). It supports some standard input data formats (as Medusa, Cytoscape, BioLayout Express^{3D}, ProViz and BiologicalNetworks) to be fully compatible with all the orbiting software packages. Moreover, Snazer offers some new trace analysis routines that deeply differ from those implemented in the presented softwares. In fact, VANTED fills network nodes with information coming from biological experiments (like microarray, proteomics, etc) to statistically analyze them. But it does not provide any means to analyze simulated traces. SimWiz, instead, makes use of traces to animate the related networks, but it does not provide any functionality in support of simulation. Traviando analyzes individual traces by using classical modelchecking methods, but lacks of any support for grouptraces. PopTools, offers some statistical facilities for wetdata (as VANTED) and stochastic processes, but disregards any grouptraces issue as well. Contrarily, Snazer covers these lacks by offering new analysis algorithms centered on multitraces analysis.
All these features are builtin in Snazer and make it a unique tool.
2.2 The reaction graph perspective
The focus of networklevel analysis in general is on properties of networks as a whole. These may reflect, e.g. typical or atypical traits relative to an application domain or similarities occurring in networks of entirely different origin. Both from a global and a local view, quantitative analyses have been conducted on networks to investigate these traits. Snazer provides the user with some clever layout algorithms to firstly give a global insight of a network, and with a centrality ranker index to highlight vertex importance.
2.2.1 Layout analysis
Graph layout methods are core techniques for applications based on graphlike network diagrams. Since such a perspective is becoming more and more pervasive, and since Snazer visualizes part of the data sets as reaction graphs, we equipped it with the fastest and most flexible layout algorithms. Most of them belong to the forcebased (or forcedirected) class of algorithms. They are easy to use, since they do not require special knowledge about the graph theory, and often their results look very good. They mainly assign forces to networks as if the nodes were electrically charged particles and the edges were springs. Thus, layouts are arranged according to real physical principles (e.g. Coulomb's law, Hooke's law, etc). As a result, forces repetitively applied to nodes push and pull them further apart in order to minimize the overall energy of the networks. When an equilibrium state is reached, the graph is drawn.
In particular, the task of bringing under control the problem of poor local minima raised the interest of many scientists. Such issue is based on the fact that the obtained minimum system energy could be considerably worse than a global minimum. Since such issue becomes more and more important as the number of vertices of the graph increases and, hence, that the overall quality of the drawing could result lower and lower, some scientists focused on the initial layout subproblem to solve it. They conjectured that any outcome of any forcebased algorithms results to be strongly influenced by the initial layout, that in most cases is randomly generated.
The KamadaKawai (KKLayout) algorithm represents the first attempt to solve this problem. It was designed with the aim to quickly generate advantageous initial configurations. Nodes are represented by steel rings and the edges are springs between them. The attractive force is analogous to the spring force and the repulsive one is analogous to the electrical force. The energy minimization in this algorithm is achieved by obtaining the derivative of the force equations. At the minimum energy, the derivatives of the force equations are zero. Since these equations are not independent, they cannot be independently brought to zero, and therefore, only the node that has the maximum gradient value is moved. This process is repeated until the total energy is minimized (fig. 2a).
A combined application of KKLayout with the FruchtermanReingold (FRLayout) algorithm represents today one of the most successful layout solutions. FRLayout contributes in meaningfully placing the neighbored nodes. It works with unweighted, undirected graphs, where attractive forces occur between adjacent nodes only, and repulsive forces occur between every pair of nodes. The movement of nodes is also function of the system temperature registered at each iteration. Generally, temperature decreases through successive iterations while nodes occupy their place (fig. 2f).
Some alternative ways of approaching this problem lie in searching more directly for the energy minimum, either instead of or in conjunction with physical simulation. Among such procedures, we consider a competitive learning method: the ISOM layout (ISOMLayout), which extends the Kohonen's selforganizing map. It evenly fills the space with vertices and lets them take place because of an heuristic function rather than of attracting forces. The algorithm selects a random point in the graph area and picks the closest vertex to that point. This vertex is moved toward that point as well as all vertices connected to that initial vertex by up to a set number of edge steps. The amount by which the vertices are moved decreases the greater the number of edges in the shortest path between the current and initial vertices. The initial number of edge steps is decreased during the layout process so that the later steps form local clusters of connected vertices (fig. 2c).
Unfortunately, such methodologies do not properly work with huge networks, where problems like nodenode or nodeedge clashes are often encountered. In these cases, simple algorithms like the Circle layout are employed with the aim to evenly space vertices on a geometric (circular, in this case) trajectory, irrespective of the network size. Circle attempts to minimize as many overlapping vertices as it can, by placing vertices next to each other that are adjacent in the graph. It is a fast algorithm, essentially because it is not optimal. That is, it does not resolve the problem of edge intersection, but propose a more familiar way to visualize the network. It does not undergo sequential arrangement since it is a static layout (fig. 2b). All the described layouts are general purpose and have been widely employed in heterogeneous scientific areas. We designed and implemented TemporalCircuit, an original and adhoc layout algorithm, aimed at drawing simulated biological networks. It locates network elements in an electric circuit. Nodes and reactions take progressively place into tabular spots from left to right. In particular, the species initially present into the system and the reactions which involve them are drawn on the leftmost. Afterward, if new species are created somewhere and somehow into the system (e.g. because of some synthesis reactions), they are depicted on the immediate right, together with the reactions which implicate them. This process progressively continues until any new species is placed. Therefore, the species that come last into the system occupy the rightmost side (fig. 2e). Since such a layout encodes and visualizes both the network and simulation information, it works only if both the graph of reactions and the simulated traces are provided by the user.
2.2.2 Quantitative analyses
The determination of important elements, group of elements or evident treats of networks is collectively known as quantitative analysis [46]. Since the 1950s, along with some global topological indices, many vertex centrality indices were introduced. These were to quantify an intuitive feeling that in most networks some vertices and edges are more central than other. In practice, they were to evaluate the 'reachability' of a vertex.
Given any network, these measures rank the vertices according to the number of neighbors or to the cost it takes to reach all the other vertices from it. These centralities are directly based on the notion of distances within a graph, or on the notion of neighborhood, as in the case of the degree centrality. Degree, eccentricity, closeness, centroid are only some among the most known and simple existing centrality indices. On their basis, several structural properties have been studied, e.g. the graph center, defined as the set of all the vertices of minimum eccentricity; the median graph, an undirected graph in which any three vertices a, b, and c have a unique median: a vertex m(a, b, c) that belongs to shortest paths between any two of a, b, and c.
Based on the set of shortest path in a graph, some other centrality indices are worth being mentioned: stress centrality, that is based on the enumeration of shortest paths; shortestpath betweenness centrality is a kind of stress centrality that accounts for the fraction of shortest paths between two nodes that contain a third node.
Vitality measures have been further used to determine the importance of vertices or edges in a graph. Given an arbitrary realvalued function on a graph, a vitality measure quantifies the difference between the value on the graph with or without the vertex or the edge. In particular, flow betweenness vitality is a measure for maxflow networks which is similar to the shortestpath betweenness, but that aims at measuring the degree that the maximum flow depends on a particular vertex; closeness vitality denotes instead how much the transport cost in an alltoall communication will increase if a vertex is removed from the graph.
These and all the other existing local methods try to answer to questions like: which are the most central members of a network and which the most peripheral? Which connections are crucial for the functioning of a subnetwork?
Snazer contributes to answer to these questions by providing a little set of interactive tools. Mainly, a degree ranker is implemented with the aim to highlight centrality of nodes within graphs. Ranks are calculated according to the corresponding nodes degree and, then, nodes color tonalities are tuned accordingly: the more a node is central, the darker will be its color tonality. Degree centrality is meant as the number of links incident upon a node (i.e., the number of ties that a node has). Degree is often interpreted in terms of the immediate risk of node for catching whatever is flowing through the network (such as a virus, or some information). If the network is directed (meaning that ties have direction), then we usually define two separate measures of degree centrality, namely indegree and outdegree. Indegree is a count of the number of ties directed to the node, and outdegree is the number of ties that the node directs to others. For positive relations such as friendship or advice, we normally interpret indegree as a form of popularity, and outdegree as gregariousness.
Another important feature is the support for colorblind users. Snazer takes care of the most common color vision deficiencies (protanopy and deuteranopy) that affect the 8% of the male and the 2% of the female populations [48]. The set of the graph colors can be tuned ondemand by means of our own procedure, referred to as ColorBlind filter [48]. It provides users with a sufficient color contrast between nodes and the foreground/background, by changing the hue, saturation and lightness values of each color in a proportional way.
In any case, nodes can be visited and highlighted by the intuitive relationships of neighborhood, or by user selection. Selected nodes can be sent to the analysis module to be further analyzed.
2.3 Statistical perspective
Many features have been ascribed to simulation traces over the years: (i) the trend component, namely the long term underlying direction (an upward or downward tendency) and rate of change; (ii) the irregular component (or 'noise'), namely the component that is left over when the other components of the series have been accounted for; (iii) the autocorrelation, namely the relationship between members of a time series of observations and the same values at a fixed time interval later. Some scientists make use of specialized algorithms to look for these and others properties in sets of traces, obtained by repetitive simulation of the same models. In their most basic form, multiple traces analysis algorithms treat all variables symmetrically without making reference to the issue of dependence versus independence and permit causality testing of all variables simultaneously. This is a major advantage of such algorithms compared to the multivariate time series algorithms. On the basis of these algorithms, we equipped Snazer with 7 analyzers, dealing with isolated as well as grouped time series, each with its own specific parameters.
They all tackle the common problem of dealing with differently sampled traces. This problem is essentially due to the mathematics behind the generation of the time vector. Indeed, whenever a Gillespieinspired algorithm for simulating a model is used, the time evolution of a wellstirred set {S_{1}, ..., S_{ N }} of biochemical species reacting through M ≥ 1 reaction channels (reactions the hereafter) {R_{1}, ..., R_{ M }} depends on:

the probability a_{ j }( Open image in new window )dt that, given Open image in new window , one reaction R_{ j }will occur in the next infinitesimal interval [t, t + dt). N.b. a_{ j }( Open image in new window ) is a function of the number of possible active instances of reaction R_{ j }. Consider a reaction R : S_{1} → S_{2}, then a( Open image in new window ) = cx_{1}, where x_{1} is the number of active R in the current state Open image in new window , and c is a constant that depends on the physical characteristics of S_{1};

the change v_{ ji }of the number of molecules of the specie S_{ i }produced or consumed by a reaction R_{ j }.
Our analysis routines overcome this issue. They all are equipped with an embedded preprocessing functionality that prepares traces for analysis. It takes a set of traces and applies an alignment procedure called binning [51, 52, 53, 54], which has a twofold effect. It resamples traces and meaningfully reduces their sizes. Binning is one of the most used preprocessing technique in the area of signal analysis. It groups adjacent points into bins and elects a representative member for each group. In particular, it takes a subset of N points from a generic signal, represented by the couples [(v_{1,}t_{1}), (v_{2,}t_{2}) ..., (v_{N,}t_{ N })], and substitutes them with an unique point (v*, t), whose value v* is an aggregate function of the N original values (e.g. their sum), and the time t is usually chosen among the original times (e.g. as the median time, or the time corresponding to the maximum value). Such basic operations are conducted by scanning all the signal through a sort of sliding window (of fixed width) chosen by the user (Here, it should be noted from fig. 4, a constant window could contain a variable number of points along a signal).
This technique has been successfully used in the area of Mass Spectrometry time course analysis, where the time warping problem equally exists.
2.3.1 Analysis routines

Series allows the application of a plethora of statistical calculations to traces. It elaborates an output data point for each timestamp present in any of the input traces and produces a time series as output, one for each selected statistical routine and for each selected chemical. The available statistical calculations are: mean, root mean square, variance, standard deviation, standard error, geometric mean, harmonic mean, skew and kurtosis.

The first hitting analyzer works like series. However, it samples traces only whenever a (userdefined) boolean condition becomes true. Indeed, for each run, it searches for the first timestamp where the condition is satisfied and performs the requested statistics on the selected species. Finally, it outputs the result as points on 2D charts, where time runs on the X axis.

Pointwise is a uniform analyzer which differs from series only because it performs whatever chosen statistics at regular, userdefined, intervals.

The steady state analyzer is applied to systems characterized by initial transient behaviors. Under the assumption that the chance of entering whatever possible state of the system after a transient period is timeindependent, this analyzer chooses a random timestamp (after the transient time) from each run, and performs there the requested statistics.

The cumulative analyzer calculates the maximum, minimum and mean (as the integral of a trace divided by the total time) values for each selected species within all the time intervals that verify a predefined boolean condition. In other words, it gives the possibility of performing statistics only within some limited slices of traces.

The time analyzer measures the duration of the time intervals that verify a userdefined condition. We call "front" the timestamp when a condition becomes true (or viceversa). Fronts can ascend or descend, whenever the related conditions flip from false to true (and viceversa). Thus, this analyzer performs statistics between any two consecutive fronts. Other than the usual statistics, it can further perform timeduration and timesum calculations.

The raw data analyzer displays data on the 2D chart, as they are.
Snazer gives the user the possibility to alternatively display results in the chart panel at runtime or to export them in SVG file format.
2.4 Data boxing
As quickly introduced above, Snazer makes use of an internal data representation. Before choosing it as our favorite data structure, we examined the three major alternatives:

HDF (Hierarchical Data Format [55]) is designed to assist users in the storage and manipulation of scientific data across diverse operating systems and machines. It is generally employed for managing very large (and complex) data sets with very fast access requirements. Its main aim is to standardize the format and descriptions of many types of commonly used data sets (such as computerized images and scientific data). It allows self description of data and accommodation for symbolic, numerical and graphical information. It is platform independent.

NetCDF (Network Common Data Form [56]) is a data format made for managing arrayoriented scientific data. As HDF, it is a standard, platformindependent format which allows self description of data and efficient information retrieving. It makes use of HDF to enhance managing of much larger files and multiple unlimited dimensions.

SBRML (Systems Biology Result Markup Language [57]) is a proposal for a new markup language whose aim is to give structure to the results of typical operations performed on SBML models. SBRML captures and describes operations, type of operations and the algorithms used to perform such operations. It is currently structured on three levels: (i) ontology terms, (ii) software, algorithm and result information and (iii) content of the result.
Unfortunately, all three data formats did not properly suite our aims. The first two are very generalpurpose and too complex for us. In fact, we only need to efficiently store/retrieve time course data and simultaneously keep them linked with the corresponding biological models. SBRML could work, but because it is still a proposal and it lacks of guidelines, it has never been adopted on concrete casestudies. Consequently, we discarded it as well.
2.4.1 The internal data format
The internal data format is a tradeoff between simplicity and functionality. It is adhoc made and it has not been meant to compete with the existing data formats. It makes the dialogue between both the visualization and the analysis modules possible. It is an intermediate and XMLencoded data structure (cf. Additional File 1). To be coherent with the content of both modules, it is logically structured in two tightly coupled subparts. The former takes inspiration from SBML in both inheriting and specializing some of its constructs. Thus, since Snazer agrees on the same policies of annotation and identification, the SBase and SId elements are inherited astheyare. Furthermore, whichever biological system is specified by a model entity and by a list of reactions, each containing a list of reactants, of products and of their chemical kinetics rates. Differently from SBML, each reaction is categorized by a type tag. Reactants and products reference three types of chemicals: entities, complexes and variables. The first two types allow representation of structured elementary or complex reactive chemicals. Variables are destructured elements which express scalar measurements (mass, temperature, density, volume, etc.) evolving over time.
The remaining part of the schema captures the information concerning the simulated traces and time. In particular, a (de)structured chemicals is here referenced and annotated by means of the tstart attribute, which specifies the time when it enters into the system, and of the structure attribute, which records its internal behavior (if any). Finally, streams of simulated timesteps are compressed and recorded together with the information about the first, last simulated timesteps and the overall number of timesteps. Traces are equally sampled over time.
2.4.2 The compression policy
Results
We successfully used Snazer to analyze many real casestudies (some of them are enclosed in the Additional File 2). Here we make use of a couple of them to show some of its functionalities. They are the cell cycle [58] and the circadian clock [59] mammalian models. The former is a complex network of biochemical phenomena that controls the duplication of cells. Such process is macroscopically subdivided into four phases which cyclically alternate because of some cyclindependent protein kinases (CDKs). When bound to specific cyclin partners, CDKs promote cellular progression along the cellular lifecycle. The latter model reproduces periodic critical triggering signals in charge to control the cellular size during the cell cycle. Merged together, they give raise to a very interesting model [60].
The statistics panel turns on whenever at least one trace (with chemicals labels matching the network nodes labels) is imported. We used three different software packages to perform multiple simulations of the model: BWB [19], CytoSim [61] and COPASI [43].
3.1 Benchmark tests
Table 1:
Xmlincapsulation routine benchmarks  

Name  BWB  XML (compression %)  Allocated memory 
CC(1)  55.619  7.974 (85.66%)  398.248 
UAT(1)  68.276  20.486 (70.00%)  403.136 
CC(2)  102.870  19.875 (80.68%)  503.344 
UAT(2)  106.271  21.100 (80.15%)  480.448 
MAPKc  577.962  50.618 (91.24%)  1.721.616 
Bycc(1)  909.064  172.266 (81.05%)  2.038.944 
GpcRS  8.293.036  190.131 (97.71%)  21.790.768 
Bycc(2)  130.821.452  3.735.872 (97.14%)  241.635.920 
Bycc(3)  136.219.582  3.913.680 (97.13%)  191.823.072 
Bycc(4)  148.322.373  4.318.744 (97.09%)  192.881.840 
The compression routine results to be more effective the larger the model and the data set. In particular, it works fine on stochastic and integer traces, essentially because any sequential and stochastic framework schedules always only one reaction to occur at a time, while all the others stay unchanged. Thus, since reactions occur because of their own propensity functions, some will fire more frequently than others and, hence, models with many reactions will generate some frequently changing traces and some rarely changing traces. Moreover, the probability for a reaction to stay unchanged generally increases with the number of the reactions, irrespective of its propensity function.
The semantic compressor defined in Sec. 2.4 works fine when reactions do not change frequently, since all the consecutive stationary values are embodied by just one persistence value. Moreover, the best case occurs when traces are made of integer numbers. In that case, the increment part of the tuple will be always made up of a few figures (stochastic frameworks based on SSA deal with molecular reactions which involve few species) and the resulting compressed string will be the shortest. Contrarily, traces of decimal numbers would see the increment value always made of as many figures as the numbers precision.
Conclusions
We implemented Snazer to help visualize biological networks and to handle and store their simulation traces. In particular, Snazer has been equipped with two main categories of tools. The visualization tools are in charge of drawing graphs according to some wellknown as well as new layout algorithms. The new Circuit layout has been presented in this context. Furthermore, some quantitative analyses routines have been implemented to allow the user to investigate some common traits of the biological graphs as (e.g.) node centrality and node importance. On the other hand, Snazer has been supplied with a set of statistics routines with the aim to analyze the time course files obtained by simulation. Finally, Snazer relies on a very efficient compression routine that allows for comfortable storing and sharing of the simulation results. We plan to extend Snazer with further capabilities, such as remote database interfacing for data collection, more sophisticated graph centrality calculations and analysis routines (e.g. Fast Fourier Transform). We finally aim at increasing compatibility with other standard languages (e.g. CellML) and I/O data formats.
Availability and requirements
Project name: Snazer; Project home page: https://www.cosbi.eu/index.php/research/prototypes/snazer; Operating system(s): Platform independent; Programming language: Java; Other requirements: Java 1.6 or higher, libSBML 3.3.1 or higher; Upon acceptance of the CoSBiSSLA license, Snazer is freely available for non commercial purposes.
Notes
Acknowledgements
The authors thank Alida Palmisano and Alessandro Romanel for their precious suggestions and the CoSBi team.
Supplementary material
References
 1.De Silva E, Stumpf M: Complex networks and simple models in biology. J R Soc Interface. 2005, 2: 419430.PubMedCentralCrossRefPubMedGoogle Scholar
 2.Kauffman S: The origins of Order. 1993, USA: Oxford University PressGoogle Scholar
 3.Dougherty E, Akutsu T, Cristea P: Genetic Regulatory Networks. 2007, Hindawi Publishing CorporationGoogle Scholar
 4.Schaefer C: Pathway Databases. Annals of New York Academy of Sciences. 2004, 1020: 7791.CrossRefGoogle Scholar
 5.Hartman P: Ordinary Differential Equations. 2002, New York: Society for Industrial Mathematics, 2CrossRefGoogle Scholar
 6.Kauffman SA: Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology. 1969, 22 (3): 437467.CrossRefPubMedGoogle Scholar
 7.Peterson J: Petri Nets. ACM Computing Surveys. 1977, 9 (3): 223252.CrossRefGoogle Scholar
 8.BenGal I: Bayesian Networks. Encyclopedia of Statistics in Quality and Reliability. Edited by: Sons JW. 2007Google Scholar
 9.Neumann A: Graphical Gaussian Shape Models and Their Application to Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003, 25 (3): 316329.CrossRefGoogle Scholar
 10.Milner R: A Calculus of Communicating Systems. 1982, Springer VerlagGoogle Scholar
 11.Dematté L, Priami C, Romanel A: The BlenX Language: A Tutorial. SFM. 2008, 313365. LNCS, SpringerVerlagGoogle Scholar
 12.Hopcroft JE, Motwani R, Ullman JD: Introduction to Automata Theory, Languages, and Computation. 2006, AddisonWesley, 3Google Scholar
 13.Hucka M, Finney A, M SH, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, CornishBowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novére N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics. 2003, 19 (4): 524531.CrossRefPubMedGoogle Scholar
 14.Cuellar AA, Lloyd CM, Nielsen PF, Bullivant DP, Nickerson DP, Hunter P: An Overview of CellML 1.1, a Biological Model Description Language. SIMULATION: Transactions of The Society for Modeling and Simulation International. 2003, 79 (12): 740747.CrossRefGoogle Scholar
 15.SBGN website. http://sbgn.org/
 16.Priami C, Ballarini P, Quaglia P: BlenX4Bio  BlenX for Biologists. CMSB '09: Proceedings of the 7th International Conference on Computational Methods in Systems Biology. 2009, 2651. Berlin, Heidelberg: SpringerVerlagGoogle Scholar
 17.Raj A, van Oudenaarden A: Nature, Nurture, or Chance: Stochastic Gene Expression and Its Consequences. Cell. 2008, 135 (2): 216226.PubMedCentralCrossRefPubMedGoogle Scholar
 18.Ballarini P, Forlin M, Mazza T, Prandi D: Efficient Parallel Statistical Model Checking of Biochemical Networks. Parallel and Distributed Methods in verifiCation (PDMC 09), EPTCS. 2009, 4761.Google Scholar
 19.Dematté L, Priami C, Romanel A: The Beta Workbench: a computational tool to study the dynamics of biological systems. Briefings in Bioinformatics. 2008, 9 (5): 437449.CrossRefPubMedGoogle Scholar
 20.Douglas R, Batagelj V, Mrvar A: Analyzing Large Kinship and Marriage Networks With Pgraph and Pajek. Social Science Computer Review. 1999, 17 (3): 245274.CrossRefGoogle Scholar
 21.Hooper S, Bork P: Medusa: a simple tool for interaction graph analysis. Bioinformatics. 2005, 21 (24): 44324433.CrossRefPubMedGoogle Scholar
 22.STRING  Known and Predicted ProteinProtein Interactions. http://string.embl.de/
 23.Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research. 2003, 13 (11): 24982504.PubMedCentralCrossRefPubMedGoogle Scholar
 24., : Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 2529.CrossRefGoogle Scholar
 25.Pavlopoulos G, Wegener A, Schneider R: A survey of visualization tools for biological network analysis. BioData Mining. 2008, 1: 112.CrossRefGoogle Scholar
 26.Freeman T, Goldovsky L, Brosch M, van Dongen S, Maziere P, Grocock R, Freilich S, Thornton J, Enright A: Construction, visualisation, and clustering of transcription networks from microarray expression data. PLoS computational biology. 2007, 3 (10): 20322042.CrossRefPubMedGoogle Scholar
 27.Fruchterman T, Reingold E: Graph Drawing by ForceDirected Placement. Software, Practice and Experience. 1991, 21: 11291164.CrossRefGoogle Scholar
 28.Reactome  a curated knowledgebase of biological pathways. http://www.reactome.org/
 29.The GraphML File Format. http://graphml.graphdrawing.org/
 30.Breitkreutz B, Stark C, Tyers M: Osprey: a network visualization system. Genome Biology. 2003, 4 (3): 14.Google Scholar
 31.General Repository for Interaction Datasets. http://www.thebiogrid.org/
 32.Iragne F, Nikolski M, Mathieu B, Auber D, Sherman D: ProViz: protein interaction visualization and exploration. Bioinformatics. 2005, 21 (2): 272274.CrossRefPubMedGoogle Scholar
 33.Hermjakob H, MontecchiPalazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, Roechert B, Poux S, Jung E, Mersch H, Kersey P, Lappe M, Li Y, Zeng R, Rana D, Nikolski M, Husi H, Brun C, Shanker K, Grant S, Sander C, Bork P, Zhu W, Pandey A, Brazma A, Jacq B, Vidal M, Sherman D, Legrain P, Cesareni G, Xenarios I, Eisenberg D, Steipe B, Hogue C, Apweiler R: The HUPO PSI's molecular interaction formata community standard for the representation of protein interaction data. Nat Biotechnol. 2004, 22: 177183.CrossRefPubMedGoogle Scholar
 34.Baitaluk M, Sedova M, Ray A, Gupta A: BiologicalNetworks: visualization and analysis tool for systems biology. Nucleic Acids Research. 2006, 34: W466W471.PubMedCentralCrossRefPubMedGoogle Scholar
 35.Auber D: Tulip: A huge graph visualisation framework. Graph Drawing Softwares, Mathematics and Visualization. Edited by: Mutzel P, Jünger M. 2003, 105126. SpringerVerlagGoogle Scholar
 36.Claffy K, Huffaker B, Nemeth E: Otter: A generalpurpose network visualization tool. International Networking Conference (INET99). 1999Google Scholar
 37.Kim J, Palmore J: Personal networks and the adoption of family planning in rural Korea. Journal of population and health studies. 1984, 4 (2): 125145.PubMedGoogle Scholar
 38.Krackhardt D, Blythe J, McGrath C: KrackPlot 3.0: An Improved Network Drawing Program. Connections. 1994, 17 (2): 5355.Google Scholar
 39.Seary A: Network Analysis with MultiNet. International Network for Social Network Analysis. 2002, 1: 111.Google Scholar
 40.Lamprecht R, Kemper P: Möbius Trace Analysis with Traviando. Fifth International Conference on Quantitative Evaluation of Systems, IEEE Computer Society. 2008, 4142.Google Scholar
 41.
 42.Van Gend K, Kummer U: Stodeautomatic stochastic simulation of systems described by differential equations. Second International Conference on Systems Biology. Edited by: Yi TM, Hucka M, Morohasi M, Kitano H. 2001, Omnipress, MD, USAGoogle Scholar
 43.COPASI: Complex Pathway Simulator. http://www.copasi.org
 44.Junker B, Klukas C, Schreiber F: VANTED: A system for advanced data analysis and visualization in the context of biological networks. BMC Bioinformatics. 2006, 7 (109):Google Scholar
 45.Poptools. http://www.cse.csiro.au/poptools/
 46.Brandes U, Erlebach T: Network Analysis. Methodological Foundations. 2005, New York: Springer Berlin/HeidelbergCrossRefGoogle Scholar
 47.Laibe C, Le Novère N: MIRIAM Resources: tools to generate and resolve robust crossreferences in Systems Biology. BMC Systems Biology. 2007, 1 (58):Google Scholar
 48.Erra U, Iaccarino G, Malandrino D, Scarano V: Personalizable edge services for Web accessibility. Universal Access in the Information Society. 2007, 6 (3): 285306.CrossRefGoogle Scholar
 49.Ballarini P, Guido R, Mazza T, Prandi D: Taming the complexity of biological pathways through parallel computing. Briefings in Bioinformatics. 2009, 10 (3): 278288.CrossRefPubMedGoogle Scholar
 50.Dematté L, Mazza T: On Parallel Stochastic Simulation of Diffusive Systems. 6th International Conference on Computational Methods in Systems Biology (CMSB 2008). 2008, 5307: 191210.Google Scholar
 51.Cannataro M, Guzzi PH, Mazza T, Tradigo G, Veltri P: Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE International Symposium on ComputerBased Medical Systems (CBMS05). 2005, 549554.Google Scholar
 52.Cannataro M, Guzzi PH, Mazza T, Veltri P: Using Ontologies for Preprocessing and Mining Spectra Data on the Grid. Future Generation Computer System. 2007, 23: 5560.CrossRefGoogle Scholar
 53.Cannataro C, Guzzi PH, Mazza T, Tradigo G, Veltri P: On the Preprocessing of Mass Spectrometry Protemics Data. XV Italian Workshop on Neural Networks (WIRN 2005), LNCS, Volume 3931/2006. 2006, 127131. Springer Berlin/HeidelbergGoogle Scholar
 54.Baudi F, Cannataro M, Casadonte R, Costanzo FS, Cuda G, Faniello C, Gaspari M, Guzzi PH, Mazza T, Quaresima B, Tagliaferri P, Tradigo G, Veltri P, Venuta S: Mass Spectrometry Data Analysis for Early Detection of Inherited Breast Cancer. Biological and Artificial Intelligence Environments 15th Italian Workshop on Neural Nets, WIRN VIETRI 2004. Edited by: Apolloni B, Marinaro M, Tagliaferri R. 2004, 2128. Springer NetherlandsGoogle Scholar
 55.HDF website. http://www.hdfgroup.org/
 56.NetCDF website. http://www.unidata.ucar.edu/software/netcdf/
 57.SBRML website. http://www.compsysbio.org/tikiindex.php?page=SBRML/
 58.Novak B, Tyson J: A model for restriction point control of the mammalian cell cycle. J Theor Biol. 2004, 230 (4): 563579.CrossRefPubMedGoogle Scholar
 59.Zámborszky J, Hong C, Csikasz Nagy A: Computational Analysis of Mammalian Cell Division Gated by a Circadian Clock: Quantized Cell Cycles and Cell Size Control. Journal of Biological Rhythms. 2007, 22: 542CrossRefPubMedGoogle Scholar
 60.Romanel A, Ballarini P, Jordán F, Larcher R, Lecca P, Mazza T, Mura I, Palmisano A, Sedwards S, Zámborszky J, CsikászNagy A, Tradigo G: Analyzing the effect of noise on various models of Circadian Clock and Cell Cycle coupling. 2007, International Modeling Competition at LeibnizZentrum für Informatik, Germany's Schloss (Dagstuhl)Google Scholar
 61.Sedwards S, Mazza T: CytoSim: A Formal Language Model and Stochastic Simulator of MembraneEnclosed Biochemical Processes. Bioinformatics. 2007, 23 (20): 28002802.CrossRefPubMedGoogle Scholar
 62.SabouriGhomi M, Ciliberto A, Kar S, Novak B, Tyson JJ: Antagonism and bistability in protein interaction networks. J Theor Biol. 2008, 250: 209218.CrossRefPubMedGoogle Scholar
 63.Katanaev VL, Chornomorets M: Kinetic diversity in Gproteincoupled receptor signalling. Biochem J. 2007, 401 (2): 485495.PubMedCentralCrossRefPubMedGoogle Scholar
 64.Tyson J, Novak B: Cell Cycle Controls. Origins of Plastids. 2003, 261284. Springer VerlagGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.