1 Introduction

Common approaches to computational experimentationFootnote 1 span a spectrum. On one side, we find quick, informative experiments intended for fast iteration. These often involve a single researcher, working on consumer-scale hardware, and can take as little as a few minutes to run. The aim is to get quick results to inform further experiments and to build towards larger results in an iterative manner. The environment that is used for this type of experimentation is usually designed around quick iteration, and quick inspection of results: MATLAB, R, or a simple UNIX command line. More recently, this is often done within interactive notebook environments such as IPython notebooks [1], Knitr [2] or Mathematica [3].

On the other side of the spectrum we find large-scale experimentation. Well-prepared, thoroughly designed experiments, intended to run for long amounts of time on powerful hardware. These experiments are often implemented by scientific programmers, separate from the researchers designing the experiment. The chosen environment is often a workflow system [4], providing features like monitoring of execution, robustness against hardware failure and provenance tracking. The downside is that each experiment must be carefully prepared, and purpose-written for the workflow system.

Experimentation usually starts with quick iterations in an interactive system, and progresses towards the more robust environments as the experiments become more involved, often at the expense of a re-implementation step as the code is ported to a more robust environment. At the larger scales, iterations invariably become slower.

Finally, once results have been produced that are expected to be fit for publication, the researchers must translate and summarize their approach to allow for peer-review, reproduction and reuse. The ideal is to publish the datasets, the code and to provide instructions for reproducing the experiment. In the small-scale iterative end of the spectrum, this can be very cumbersome: gathering unversioned code, unstructured datasets and documenting all idiosyncratic steps required to execute it. In the large-scale end, experiments tend to be more structured, as enforced by the workflow system, but the description of the workflow is still tied to the workflow platform. Even a provenance trace, which is intended to illustrate the source of the results, can be difficult to interpret in its raw form.

1.1 Main Idea

In this paper, we present a concept for generating notebook documentation for computational experiments from provenance information. Our approach aims to retain some of the iteration speed of the small-scale experimentation at the large-scale end of the spectrum. This documentation generation process is built on three ideas.

  1. 1.

    After a large-scale experiment has finished, many questions raised by its output can, theoretically, be answered without re-running the experiment. Unfortunately, these questions were not the ones which the experiment was originally designed to answer, so the required data was not collected during the run. Output representing as much information about the original run as possible can help to postpone the need for a new run of a (redesigned) experiment.

  2. 2.

    While provenance is often seen as a kind of semantically annotated log file—helping for keeping track of the origins of data, and for finding answers in the case of unforeseen errors—a complete provenance trace will actually contain all information about a run of a computational experiment: all data produced, and the semantic links between them [5]. Any output required from the experiment, such as tables, graphs and statistical analysis, can be reconstructed from the provenance trace.

  3. 3.

    A semantically annotated representation of a run of an experiment (such as a provenance trace) allows us to make intelligent guesses at default modes of reporting. Thus, we can automatically create reasonable scientific documentation; reporting not only the results of the experiment, but a human-readable representation of how the results emerged: which datasets were used, where they can be found, what code was used, using which versions and in what configuration. An interactive environment allows the researcher to tweak this documentation to filter out less relevant information.

In short, we propose to put provenance at the heart of computational experimentation, rather than the sidelines, to combine the best of both worlds. A large-scale experiment is run on a workflow system, producing mainly a provenance trace. This trace is then loaded into an interactive environment, allowing a researcher to investigate the questions that inspired the experiment, and any further questions that these results raise. The researcher can filter, plot and analyze the results at length, with much greater depth than a non-semantic output, such as a CSV file, could offer. Only when all information produced by the original run is exhausted, does a new experiment need to be started.

When the time comes for the results to be shared, e.g. via a publication, the provenance trace provides all required information. All that is needed is a means to convert it to human readable form. The semantic annotations allow us to create reasonable default documentation, while anybody interested in the experiment can load the provenance trace into an interactive system and study the details.

1.2 Contributions

Interactive notebooks provide both a good format for presenting default documentation and an interactive environment to study experimental results. The proof-of-concept implementation presented in this paper uses provenance, in the W3C PROV-O [6] format, generated by our own workflow system DucktapeFootnote 2, to automatically create IPython notebooks. We chose IPython Notebooks as this system is becoming widely used in data processing. Additionally, they provide a web-based environment, independent of the underlying language. This means that future versions of our system could also support R, Julia and other programming languages. Our notebooks have result tables and graphs, visualization of the provenance and links to the software and datasets used. Furthermore, they are interactive and editable, so that the user can explore and analyze the results of the experiment without re-running the workflow. As a running example use-case, we take the documentation web-page that won the Open Science Award at the ECML/PKDD 2013 machine learning conference.

The rest of this paper is structured as follows. In the next section, we discuss related work. Section 3 describes our proof-of-concept implementation. The final section contains conclusions and directions for future research.

2 Related Work

A key part of related work is in the area of workflow systems. Often, these systems provide accessible documentation to the end user through graphical representations of the workflow. Additionally, they attach detailed provenance information to those workflows [7]. Our work is different in that we build a notebook style representation directly from the provenance.

Other existing papers also explore and derive insight from scientific workflow provenance, with different goals than ours. Work by Biton et al. [8] lets users define views based on relevant workflow parts that determines how a possibly large workflow provenance graph can be explored. The high level query languages for provenance: QLP [9] and OPQL [10], can be used for interactive querying and visualization. Both views simplify provenance results and allow exploration of scientific workflow provenance at the graph level.

Close to our work is that of Gibson et al. [11], on creating an interactive environment where provenance is stored. We see our work as complementary as one can see the generation of the workflows as similar to generating a notebook. Deep [12], an executable document environment that generates scientific results dynamically and interactively, also records the provenance for these results in the document. In this system, provenance is exposed to users via an interface that provides them with an alternative way of navigating the executable document.

Burrito [13] is a system that uses a combination of provenance tracking and user interface constructs for notes to help generate a lab notebook. Our approach shares their motivation but focuses instead documenting distributed computational workflows using provenance. Similarly, Scientific Application Middleware [14] combines information coming from both lab notebooks but also distributed computational components to create documentation for experiments. Our work adds to this vision by connecting to widely used interactive (computational) notebook environments.

The idea of using provenance as a singular result of workflow execution shares some aims with the idea of Research Objects [15]. This is a construct that aims to replace the traditional paper article as the main unit of scientific publication. A research object is a package of not just the research results, but also all artifacts used to create them, such as datasets, code and provenance. Within the research object, the provenance is seen as a feature to facilitate auditing. In our approach, we see the provenance as the key entry point: it should not just be used to audit the experiments, but also to aggregate results and to perform statistical analyses. Our perspective does not change or replace the use of Research Objects, but suggests that the provenance could be used as its central component, tying together the other contents of the package.

3 Proof-of-Concept

The proof-of-concept implementation for our documentation generation approach consists of three components: a workflow system, workflow provenance and generating notebooks from provenance. We first introduce a running example that will illustrate these three components and then we describe the components themselves.

3.1 Running Example

The webpageFootnote 3 for the paper A fast approximation of the Weisfeiler Lehman graph kernel for RDF data [16] won one of the two Open Science Awards at ECML/PKDD 2013, the conference where it was published. On the page, links to software libraries, datasets and the original source code are provided, as well as instructions on how to run the experiments using the provided material. The datasets are available online, via figshare.com, and the code is stored in a git repository, at github.com. We have recreated two partial experiments in the ECML/PKDD 2013 paper [16] for our proof-of-concept. We use these experiments as running examples below. Note that we do not recreate the full set of experiments in the paper. However, the recreated parts are a representative subset, since we cover both a classification experiment and a runtime experiment.

In the classification experiment a number of graph kernels for RDF data are tested on an affiliation prediction task. The goal in this task is to predict affiliations for persons in the dataset. Three different kernels are tested, each for a number of parameter settings. These kernels are combined with a Support Vector Machine (SVM) to perform prediction. To reduce randomness, the experiment is repeated 10 times, with different random seeds.

The runtime experiment uses the same graph kernels and dataset, but this time the kernels are computed for different fractions of that dataset to investigate the runtime performance of the different kernels. The most computationally intensive settings for the kernels are used. For each dataset fraction, the computation is performed 10 times (on 10 random subsets).

3.2 Workflow System: Ducktape

Ducktape is a light-weight workflow system developed in the context of the Data2SemanticsFootnote 4 project. This project provides essential semantic infrastructure for e-science and focuses on how to share, publish, access, analyze, interpret and reuse scientific data. Ducktape is designed to compose experiments using components developed within the project. By using an annotation approach, we keep the system light-weight and impose little additional effort for a scientist to use his existing code in our environment.

Ducktape uses computational modules, which are annotated pieces of codes, typically classes. The annotations indicate what the inputs and outputs of the module are and what the main computation routine is. Currently, Java, Python and command line scripts are supported.

A Ducktape workflow is described in a simple data flow format represented in YAML (YAML Ain’t Markup Language) [17], which contains a list of modules and specifications of each of the modules’ input data. Figure 1 shows part of the workflow description for the affiliation prediction experiment. Module inputs can either be raw data type values, i.e. integers, doubles and strings, or data produced by other modules within the same workflow (e.g. Fig. 1, line 17, 20, 22).

Module input fields in the YAML workflow description can be supplied with lists of inputs of the same type, to allow for parameter sweeps (Fig. 1, line 23). Ducktape allows users to specify whether they want input lists to be consumed in a pair-wise manner or whether the full Cartesian product between the lists should be used in the parameter sweep. Furthermore, there are keywords to indicate whether certain inputs represent datasets (Fig. 1, line 10), what module outputs should be considered experimental results (Fig. 1, line 25) and for which input parameter we want to aggregate results (Fig. 1, line 26).

3.3 Provenance: W3C PROV

Whenever a workflow is executed, Ducktape automatically generates the provenance that captures this execution in the W3C PROV-O [6] format.Footnote 5 Table 1 shows how the different elements of a Ducktape workflow map to the concepts in W3C PROV. The main concepts from W3C PROV that we use are prov:Activity and prov:Entity and their connecting relations: prov:used and prov:wasGeneratedBy. Essentially, a workflow leads to a bipartite graph with alternating nodes of prov:Activity and prov:Entity.

Modules are prov:Activitys and inputs and outputs are prov:Entitys. We model this by creating a class dt-rsc:ModuleName Footnote 6 with the name of the module for all modules. Each dt-rsc:ModuleName is rdfs:subClassOf of prov:Activity. Every instance of a module executed during the run of the workflow is an rdf:type of its corresponding dt-rsc:ModuleName. We do the same for the inputs and outputs, introducing a dt-rsc:InputName or dt-rsc:OutputName for each input and output, which are rdfs:subClassOf of prov:Entity. Each input/output instance is an rdf:type of its corresponding dt-rsc:InputName/OutputName. Outputs that are inputs of another module have one unique URI. For example, the specific instance of ‘seed’ with value ‘1’ in the module ‘Experiment’ in Fig. 1, line 23, would be of type dt-rsc:Experiment/seed/ Footnote 7 which is an rdfs:subClassOf of prov:Entity.

Fig. 1.
figure 1

Example of YAML workflow description from the affiliation prediction experiment. The full workflow is not shown.

Each module (dt-rsc:ModuleName) is associated with a prov:Agent, which represent the specific Ducktape engine used for execution (i.e. the machine(s) and version), and a prov:Plan, the specific YAML workflow file.

Table 1. Mapping of ducktape elements to W3C PROV

Optionally, inputs can also be a dt-voc:Dataset Footnote 8, if they refer to a dataset (e.g. by a URL) or a dt-voc:Aggregator, if they determine how to aggregate experiment outputs based on this input. Outputs can have the dt-voc:resultOf predicate that links them to the workflow (i.e. prov:Plan), if they should be considered the results of that workflow. These optional concepts are added when they are specified in the YAML workflow file.

Furthermore, we also add the software artifact dependencies that we know that are used during execution to the provenance. This is done by creating URI for each artifact and adding it to the prov:Plan via a new property dt-voc:usesArtifact. Currently, we manage our dependencies and execute our workflows using MavenFootnote 9, thus each artifact furthermore has the properties: dt-voc:hasArtifactId, dt-voc:hasGroupId and dt-voc:hasVersion.

3.4 Notebook Generation

Based on the generated provenance, draft IPython notebooks are created. There are two types of notebook drafts: an overview notebook with general workflow execution information and a more detailed notebook at the workflow module level.

The overview notebook contains general information about the workflow plan, software artifacts and datasets used. A summary of the Ducktape modules instantiated during the experiment and inline provenance visualization generated using Prov-O-Viz  [19]Footnote 10 is also included in this overview notebook to give intuitive insight into the overall workflow execution. This notebook is illustrated in Figs. 2 and 3.

Fig. 2.
figure 2

Overview report for the runtime experiment, part 1.

Fig. 3.
figure 3

Overview report for the runtime experiment, part 2.

The detailed notebook draft describes individual module execution results. Users have access to the module input parameters and execution results through default Python code snippets injected into the notebook. The code snippets are generated by performing SPARQL queries on the workflow provenance graph. By using these snippets, users can manipulate how they view the module parameters and execution results.

We use the existing Python Data Analysis library (Pandas)Footnote 11 in the code snippets, to allow users to play with and change the view on their results. Essentially, what the user has here is a data analysis view of each individual module in workflow execution. By default we provide tables of relevant input and outputs for each individual module which users can change by tweaking the injected Python code.

For modules that have input data marked as dt-voc:Aggregator, we provide a pivot table, which aggregates the outputs that are dt-voc:resultOf, grouping by the other input parameters. The default form of aggregation is computing the mean value, however this can be easily changed by editing the code snippet. An example of this aggregation is given in Fig. 4, where the results accuracy and F1 are aggregated over the seed input parameter.

Fig. 4.
figure 4

Part of the detailed notebook for the affiliation prediction experiment which shows a table for the Experiment module.

In summary, the notebooks for the classificationFootnote 12 and the runtimeFootnote 13 experiments contain the following information: a list of datasets, a list of software artifacts, provenance visualization and detailed result tables. This is significantly more information than the original webpage and the notebooks can easily be extended by hand, both by changing the tables and adding more explanatory textFootnote 14. Currently, the notebooks lack instructions on how to re-execute the experiments, this can be partly solved by adding instructions that explain how to use the datasets and artifacts. However, in future work we would like to add automatic re-execution of the workflow from the notebook, all the ingredients are already there.

4 Conclusions and Future Work

We have described an approach for automatic generation of scientific documentation for computational experiments. This is approach is based on the idea of placing provenance at the heart of such experiments, using it as the main output, not just as a way to trace the execution of a workflow. Interactive notebooks provide a way to explore the results and its provenance and are an ideal starting point for creating documentation for the experiments.

We have created a proof-of-concept implementation to automatically generate IPython notebooks from provenance created by workflows run using our Ducktape platform. These notebooks aggregate the main results and components of an experiment. This automatically generated draft documentation provides more information and insight then a hand-crafted documentation page for a machine learning paper that won an Open Science Award.

While our proof-of-concept uses a specific workflow system and a specific interactive platform to load and analyze the provenance, the approach is transferable to other workflow systems and interactive environments. Indeed, most PROV serializations can be represented as a more human-friendly notebook. Central to this conception is the notion that provenance can be a true interface between the execution of an experiment and the analysis of its results.

Another outcome of this work is confirmation of the importance of connecting interactive notebook environments and provenance. By using the IPython Notebook environment, we were able to benefit significantly from the variety of tools within that community, including notebook visualization (using the nbviewer app) and analytics. We believe that the connection between notebooks in general and distributed provenance generation is an area that the community should look at in more detail as there are a number of areas of interest. For instance, one may investigate the issue of maintaining the provenance of live results streamed to notebook environment, encapsulating provenance within a notebook or tracking provenance of interactive sessions.

Beyond investigating these larger themes, there are a number of concrete extensions to the environment we intend to make. First, the current configuration does not allow us to directly re-run the experiments from within the notebooks. We aim to implement such a feature to further improve reproducibility. Furthermore, while we can create links to software artifacts that were used, it would be even nicer to link to the actual source code for these artifacts, if that is available. Therefore, we plan to investigate how to integrate with methods such as GIT2Prov [20] to connect from execution to the source code. Furthermore, we are also investigating what additional visualizations we can embed to make the documentation richer.