Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Every piece of data ever produced, either manually or automatically, has a provenance. This is metadata that provides an account of how the data was created. Examples include a blog’s author, the history of a piece of software along with its contributors, the instruments used to take a measurement, and their settings; or a description of an experimental process used to produce a scientific result. The PROV data model for provenance [MMB+12], endorsed in 2013 by the W3C, provides a formal and domain-agnostic grounding for provenance, in the form of UML and OWL models, and RDF, XML, and relational (PROV-N [MMCSR12]) serializations. We refer to PROV instances as digraphs, where nodes are of three possible types: Entities (for data, documents, anything that has provenance), Activities, which model the execution of a data consumption and production process; and Agents, to whom Entities can be attributed, and who hold responsibility for carrying out Activities. The edges represent instances of relationships amongst the nodes, which are documented in the PROV-DM specification [MMB+12].

The provenance traces associated with a homogeneous data collection (a scientific data repository, all the blogs hosted on a particular site, all the artifacts associated with a complex software project) also naturally form a collection. Such collections grow in size both with the number of underlying data products, and with the complexity of their production process. Figure 1 suggests how different collections can be placed into a space defined by volume, i.e., the number of traces in a collection, and by the typical size of a trace within a collection. For instance, many small traces (upper left) may be associated with a large repository of scientific data, while complex software with a long history may be represented by many large traces (upper right), as exemplified by the Git2Prov [DMV+13] tool.

Arguably, the value of provenance comes not only from querying the content of individual traces, but also from analytics, which can only be computed on whole collections. It is therefore important for practical applications to demonstrate the effectiveness of a data and service architecture to manage large bodies of provenance, with special focus on the upper quadrant of our size/volume space. Thus, we expect that the design of scalable repositories for provenance traces should be a natural concern in provenance management. A number of recent efforts have been documented on nascent provenance management infrastructure [CAB+13, CLFF10, LLCF11, MMW+12], and there is evidence of the emergence of applications that require provenance querying in a variety of settings (e.g. [MOnH+13, ddOOn+12]). However, unlike other “big data” domains such as Linked Data and more generally RDF triple stores, where performance benchmarking is established practice, to the best of our knowledge no community-made benchmarking and commonly accepted datasets that are specific to provenance are available.Footnote 1 This makes it difficult to benchmark and compare different implementations with regards to storage techniques, query models, and analysis algorithms.

This is somewhat counter-intuitive, given the amount of provenance that is generated, in domains such as those alluded to above. In fact, only a handful of real datasets are currently available through a community process, i.e., the first ProvBench initiative in 2013 (http://bit.ly/1fBOswR)Footnote 2, and even fewer conform to the recent PROV standard and are therefore interoperable. Existing benchmarking datasets which apply to RDF triple storesFootnote 3 are not adequate, because they fail to account for the specific data model and semantics of PROV, as well as for the specific requirements of provenance query and analysis.

Fig. 1.
figure 1

A simple space for homogeneous provenance collections

Fig. 2.
figure 2

The document revision provenance pattern in Wikipedia includes multiple derivation and editing activities by multiple user or bot agents.

1.1 Contributions

Our assumption is that synthetic PROV graphs can be a valuable complement to emerging natural provenance collections, provided that their structural properties reflect specific provenance patterns, with control over their repetition and variability, and at varying scales. Such graphs can be used both for benchmarking emerging provenance management systems, as well as to test analytics algorithms that operate naturally on large provenance collections.

Our main contribution (Sect. 2) is the design and implementation of provGen, a PROV generator that is designed to help populate the space described in Fig. 1. provGen “grows” collections of synthetic PROV graphs in a way that conforms to real-life provenance patterns. These are currently user-defined and modelled after patterns found in specific domains, and which reflect the nature of the data generation process described by the provenance. For instance, the prevalent provenance pattern for a Media Wiki website, which we refer to as the “document revision” model, involves multiple revisions of articles, by multiple editors (Fig. 2). Git repositories exhibit similar patterns, which reflect the revision history of the code. These patterns are different, for instance, from those for the provenance of data generated using a workflow, which reflect the consumer-producer graph structure of the dataflow specification.

Users control the “shape” of the graph being generated by provGen by providing two main elements. The first is a seed graph, which determines the specific types of nodes and the relationships amongst them to be considered, in an otherwise random generation process. The second element is a set of constraints, expressed using a dedicated Domain Specific Language (DSL), which limit the possible ways in which nodes and relationships are added. These two elements ensure a predictable general shape for the generated graph, as well as its compliance to PROV.

As discussed later, provGen relies on a graph DBMS backend (Neo4J). In particular, the generation algorithm is based on graph rewrite rules that are implemented using a combination of Cypher queries and Create statements.

1.2 Related Work

A growing body of research is devoted to generating large bodies of synthetic graph data, either using purely random models [KN09, ER60], or by generating graphs that exhibit specific statistical properties [BA99, BB05, LCKF05]. One example is the preferential attachment model. Popularised by Barabasi and Albert [BA99], this model states that as new vertices are added to a graph, the probability of creating a relationship with node n is inversely proportional to the degree of n. This model generates a graph with a degree distribution which follows a power law.

An issue common to these models, emphasised for instance in a comprehensive survey on graph generators [CF06], is their focus on enforcing global properties of the generated graph, such as degree distribution, clustering coefficient, etc. A potential reason for this focus is that these generators are aimed at simulating social networks [PBE13, BB05], the statistical properties of which are based on large sets of examples, and thus are fairly well understood [MMG+07]. In contrast, our generation strategy relies on user-specified patterns, rather than a large set of pre-existing examples (in the future, we hope to be able to use patterns that have been automatically discovered from existing graphs, by means of standard graph mining techniques [KK04]). This has the advantage that the overall topology of the graph can be made to reflect desired semantic properties of the data, such as the average number of usages for a certain type of entity, the average number of association of an agent with activities, and so forth. Pham et al. [PBE13] are amongst the few to have addressed this problem. However, they focus on a loosely related issue, namely the correlation between node and relationship properties, such as an increased likelihood to be called “Joachim” if you live in Germany, and on generating realistic synthetic value dictionaries accordingly.

2 Graph Generation Model

Graph generation in provGen is an iterative process which starts from a single node. At each iteration, a collection of predefined atomic rewrite rules is used to add a set of new nodes or relationships to the current graph. These rules account for all possible relation types that are defined in the PROV-DM specification. As an example, consider the definition of the \( used (a,e)\) relation between an activity \(a\) and \(e\) an entity \(e\). Three atomic graph rewrite rules are defined for this relation, namely (i) given an entity node \(e\), add a new activity node \(a\) and an edge \( used (a,e)\); (ii) given an activity node \(a\), add a new entity node \(e\) and an edge \( used (a,e)\); and (iii) given a pair of unrelated nodes \((a,e)\), add edge \( used (a,e)\). Since each single PROV relation type induces three atomic rewrites, and we consider 13 types of relations from PROV, at each iteration provGen can potentially fire any of 39 different rules.

Users can control the execution of these rules and the overall effect of the generation process in three complementary ways, namely (i) by specifying a seed graph, (ii) by adding a set of constraints, and (iii) by specifying additional execution parameters. We now describe these in some detail.

1. Seed graphs. A seed graph specification restricts the set of rules to choose from, to only those corresponding to the relations that appear in the graph. As an example, the document revision pattern depicted in Fig. 2 may be expressed as follows, using PROV-N syntax:Footnote 4

figure a

Using this graph, provGen determines that only \( wasGeneratedBy \), \( used \), \( was DerivedFrom \) and \( wasAssociatedWith \) rules are to be used. Furthermore, it will associate the properties and values found in the seed graph, for instance prov:type="edit", to the new nodes and relations.

2. Constraints. Even with this restriction, unconstrained generation would lead to a graph with arbitrarily high node degree and branching factor, which would bear little resemblance to the seed trace provided, except in its relationship makeup. To further control the generation process, the second user input consists of an additional set of constraints, specified using a natural and intuitive syntax. Constraints are syntactically similar to workflow control-flow patterns [VTKB03], expressing the required states of data being created.

Table 1. Examples of user-defined constraints for graph generation.

Constraints consist of three structural components, as shown in the examples of Table 1, namely a determiner, an imperative, and a condition. The determiner is either variable (an Agent) or invariable (the Agent, a1) and determines the elements to which a constraint applies. Requirements on these elements are specified by means of the Imperative clause. For instance has in degree (the requirement) at most 1 (a qualifier) allows a new incoming edge to be added to any Entity that has none. The qualifier may optionally include a probability distribution, as in the second example. This determines the likelihood that an action be taken in order to satisfy the requirement, namely the generation of a new \( WasAssociatedWith \) relation. Furthermore, a condition specifies the applicability of an imperative to a determined element, i.e. when (selective condition) or unless (greedy condition). Thus, the second constraint inhibits the creation of a new \( WasAssociatedWith \) relation for any Agent that already has a \( ActedOnBehalfOf \) relation associated to it. Conditions admit the use of logical connectives, as in the third and last constraint examples, and may predicate on properties that are mentioned in the seed graph, such as prov:type (pre-defined) or ex:name (user-defined). Finally, the last constraint shows an example of variable usage (a1).

Note that these constraints are in addition to those defined in the PROV-CONSTR document [CMM12]. For instance, provGen will not create a graph where entities are generated by multiple activities. The sketch in Fig. 3 shows the different patterns obtained when generating the graph with and without enforcing the constraints.

Fig. 3.
figure 3

Sketch of PROV graphs generated with and without enforcing user constraints

A more complete account of the constraint DSL can be found as part of the provGen documentationFootnote 5.

3. Execution Parameters. Finally, users may specify additional execution parameters to control the number of distinct (unconnected) graphs to be generated, as well as the average number of nodes and edges per graph. More advanced parameters can be used to control the average height (maximum depth) and width (maximum breadth) for each graph generated.

The combination of seed graph, constraints, and execution parameters leads to collections of PROV graphs that approximate real traces from different domains, and which can be used to populate selected areas of our provenance state (Fig. 1). In Sect. 4 we briefly sketch the evaluation method we are using to test the quality of generated graphs, with respect to a large testbed of provenance graphs with known topological properties.

Overall, provGen’s generation process consists of a nested iteration loop. In the inner loop, provGen iterates over the set of active atomic rewrite rules. When a rule fires, any constraint that applies to the elements that it is operating upon is checked, and if any of those constraints is violated, the rule has no effect. This process is repeated in the outer loop, until a halting condition is satisfied, i.e., the desired size is reached, and the DSL constraints are satisfied.

3 Mapping the Model to Graph DBMS Queries

provGen is implemented using the Neo4J graph DBMSFootnote 6 as a back end. In particular, both atomic rewrite rules and user constraints are transparently compiled into CREATE and MATCH statements expressed in Cypher, Neo’s declarative graph pattern languageFootnote 7. Queries (in addition to CREATE statements) are required at each iteration to test the requirements and conditions associated with user constraints (Table 1). This compilation step provides isolation from the data layer, delegating graph traversal to the underlying DBMS, and also provides flexibility for retargeting the graph generator to a different back end. A native graph DBMS also offers a more natural data model for PROV than a more traditional RDBMS solution.

Fig. 4.
figure 4

provGen system architecture.

The provGen architecture is shown in Fig. 4. Components are deployed on a server, which is reachable from a web based client application through a REST API. In the following sections, we focus on the steps involved in generating Cypher queries from rewrite rules and user constraints.

3.1 From Seed Traces to MATCH Query Clauses

The first step involves parsing the seed traces. Since these user-supplied samples of PROV data may be serialized into multiple formats, parsing relies upon several third party libraries, including the OWLAPIFootnote 8 and ProvToolbox.Footnote 9 This step results in a subset of the 39 pre-defined atomic graph rewrite rules, mentioned in Sect. 2, to be selected for the generation step.

Rewrite rules are statically mapped to Cypher queries. As an example, below we show the queries responsible for creating the PROV \( used \) relationship. Note that multiple queries are required in order to account for the directed nature of PROV relationships and the ability to create a edge between two pre-existing nodes.

figure c

Query fragment (1) matches any node a of type Activity, it creates a new Entity node, and it connects it to a using a \( used \) relationship. Symmetrically, (2) adds a new Activity node to any existing Entity node. Finally, (3) takes a pair of existing nodes a (Activity), b (Entity) and again creates a \( used \) relationship between them.

The examples above show empty brackets, to indicate that no properties are associated to the nodes and relationships. However, all properties associated to the elements of the seed trace are also associated to corresponding elements of the new graph. Thus, for example activities would have a property prov:type, inherited from the activity node in the seed graph above.

3.2 Constraints as WHERE Clauses

The DSL parserFootnote 10 separates the component elements of each constraint, namely determiner, imperative and condition. Requirements may be expressed on various graph features, i.e., nodes in/out degree, relationship, property, etc.... Each type of requirement is compiled into a Cypher query WHERE clause. These clauses are then added to the MATCH statements that represent the atomic rewrite rules, to form complete queries. Consider the following example:

figure d

These constraints are easily interpreted in the context of a document revision pattern, where activities are edits of document versions, which produce a new version. For these activities, we stipulate that they use only one entity (the original document). Activities that create new documents are exceptions, noted by the ex:name=create property, and these activities are allowed to use zero or more input documents. Additionally, we add an upper bound to an Activity node’s degree to illustrate a more complex constraint.

The constraint is compiled into query fragments (4) and (5) in the Cypher query below, where they are merged with the MATCH and CREATE clauses of atomic query (1) from the example above:

figure e

The query specifies at the same time the node and relationship generation, and the constraint. The MATCH clauses bind variables a and r to an Activity and to the set of its edges, respectively (either incoming or outgoing, as no direction is specified). The WHERE clause ensures that the CREATE statement (which creates a new \( used \) relationship) is only executed on a if the “ex:name” property is not “create”, and the number of edges in set \(r\) is at most 4.

3.3 Generator Loop

The generator loop (Fig. 4) accepts a collection of atomic create operations, selected and constrained as described above, and repeatedly iterates over it, executing each associated Cypher query against the underlying graph database.

The generator loop has several halting conditions: both explicit, where execution parameters, detailed in Sect. 2, halt generation as the order \(|V|\) and size \(|E|\) of the graph reach their specified maxima; and implicit, where constraint rules may prevent the execution of individual operations in order to avoid violating specified range requirements. Note that limits in cardinality imposed by execution parameters may be met before the minimum requirements of a constraint rule are satisfied. When this is the case, provGen gives priority to the user constraints, to ensure that those are not violated.

4 Evaluation Methodology

The main purpose of provGen is to fulfill the need to generate a possibly large number of provenance graphs for data domains where provenance is not yet routinely collected, or is not abundant. Yet, our evaluation of the system’s effectiveness relies on precisely those domains where large provenance collections are available. Specifically, we evaluate provGen by comparing selected properties of existing “real-world” provenance graphs, which we call control set, to those of generated graphs (the test set) intended to emulate them. Using this approach, we aim to empirically demonstrate that provGen may be configured to generate datasets that are “similar” to those produced by multiple different sources of provenance.

Our evaluation is ongoing. Here we illustrate the approach using one single control set, namely a set of Wikipedia provenance traces, representative of the document revision pattern, taken from the ProvBench repository and compliant with PROV.Footnote 11 The control graphs include about 4,000 nodes and 6,000 relationships. Our test set consists of two synthetic datasets of roughly the same size as the control, produced using provGen with a user-created seed trace for the document revision pattern, along with constraints and parameters.

In this initial evaluation we have considered three simple criteria. Firstly, we note that in the control set, which follows the linear Wikpedia pattern (Fig. 2), each Entity is used exactly once. Thanks to our user constraints, this is easily replicated exactly in the test set. Secondly, as example criteria we additionally consider the number of associations per Agent, and the average number of entities with distinct titles contributed to, per Agent. In the control, each Agent has 2.4 associations on average (std dev. 6.2), while in our test set it has 2.9. The average number of contributions per Agent is 1.1 in the control (std dev 0.8), while in the test is 1.8. Encouraged by these preliminary results, we are now in the process of more extensively testing provGen using a variety of criteria that can be easily measured both on control and on test graph.

5 Conclusion

In this paper we have presented provGen, a PROV-specific graph generator driven by user-defined seed graphs, which represent provenance patterns, and additional user-defined constraints designed to enforce semantics properties of the generated graph. Constraints are expressed in a dedicated “plain english” constraints language.

One feature that sets provGen apart from existing approaches to graph generation is that it provides users with local control over topological features and statistical characteristics of the graph. Constraints are evaluated locally for each node created, thus avoiding the complexity of verifying them globally. provGen is implemented using a Neo4J graph database back end. Graph rewrite rules and user constraints are both mapped to Cypher queries. Rewrite rules are mapped to CREATE clauses, while constraints are compiled into WHERE clauses. The two are blended together into complete Cypher queries, so that graph generation relies entirely on Neo4J’s native query engine.

We have also briefly discussed our approach to evaluating the effectiveness of provGen in generating “real-world” provenance, i.e., by comparing some of its key statistical properties with those of real graphs within the same class. We are currently experimenting with a variety of seed graph patterns, and more extensively evaluating provGen’s capability to mimick real provenance. Currently seed patterns must be manually designed or discerned. In future, an attempt to collate a collection of patterns common to provenance data, as has been done with workflow specifications [VTKB03], could prove useful.

Graph generation performance is another concern we are currently addressing. Generating large scale graphs requires efficient execution of the MATCHCREATE-WHERE queries shown above, on graphs of increasing size. We are finding that Neo4J may not be an optimal choice, as it is geared for OLTP workloads with consequent transaction management overhead. However, our architecture is flexible and allows for experimentation, as changing the back end simply requires retargeting the mapping of rules and constraints to a different query language.