Iterative Generation of Diverse Models for Testing Specifications of DSL Tools
Abstract
The validation of modeling tools of custom domainspecific languages (DSLs) frequently relies upon an automatically generated set of models as a test suite. While many software testing approaches recommend that this test suite should be diverse, model diversity has not been studied systematically for graph models. In the paper, we propose diversity metrics for models by exploiting neighborhood shapes as abstraction. Furthermore, we propose an iterative model generation technique to synthesize a diverse set of models where each model is taken from a different equivalence class as defined by neighborhood shapes. We evaluate our diversity metrics in the context of mutation testing for an industrial DSL and compare our model generation technique with the popular model generator Alloy.
1 Introduction
Motivation. DomainSpecific Language (DSL) based modeling tools are gaining an increasing role in the software development processes. Advanced DSL frameworks such as Xtext, or Sirius built on top of model management frameworks such as Eclipse Modeling Framework (EMF) [37] significantly improve productivity of domain experts by automating the production of rich editor features.
Modelling environments may provide validation for the system under design from an early stage of development with efficient tool support for checking wellformedness (WF) constraints and design rules over large model instances of the DSL using tools like Eclipse OCL [24] or graph queries [41]. Model generation techniques [16, 19, 35, 39] are able to automatically provide a range of solution candidates for allocation problems [19], model refactoring or context generation [21]. Finally, models can be processed by querybased transformations or code generators to automatically synthesize source code or other artifacts.
The design of complex DSLs tools is a challenging task. As the complexity of DSL tools increases, special attention is needed to validate the modeling tools themselves (e.g. for tool qualification purposes) to ensure that WF constraints and the preconditions of model transformation and code generation functionality [4, 32, 35] are correctly implemented in the tool.
Problem Statement. There are many approaches aiming to address the testing of DSL tools (or transformations) [1, 6, 42] which necessitate the automated synthesis of graph models to serve as test inputs. Many best practices of testing (such as equivalence partitioning [26], mutation testing [18]) recommends the synthesis of diverse graph models where any pairs of models are structurally different from each other to achieve high coverage or a diverse solution space.
While software diversity is widely studied [5], existing diversity metrics for graph models are much less elaborate [43]. Model comparison techniques [38] frequently rely upon the existence of node identifiers, which can easily lead to many isomorphic models. Moreover, checking graph isomorphism is computationally very costly. Therefore practical solutions tend to use approximate techniques to achieve certain diversity by random sampling [17], incremental generation [19, 35], or using symmetry breaking predicates [39]. Unlike equivalence partitions which capture diversity of inputs in a customizable way for testing traditional software, a similar diversity concept is still missing for graph models.
Contribution. In this paper, we propose diversity metrics to characterize a single model and a set of models. For that purpose, we innovatively reuse neighborhood graph shapes [28], which provide a finegrained typing for each object based on the structure (e.g. incoming and outgoing edges) of its neighborhood. Moreover, we propose an iterative model generation technique to automatically synthesize a diverse set of models for a DSL where each model is taken from a different equivalence class wrt. graph shapes as an equivalence relation.
We evaluate our diversity metrics and model generator in the context of mutationbased testing [22] of WF constraints in an industrial DSL tool. We evaluate and compare the mutation score and our diversity metrics of test suites obtained by (1) an Alloy based model generator (using symmetry breaking predicates to ensure diversity), (2) an iterative graph solver based generator using neighborhood shapes, and (3) from real models created by humans. Our finding is that a diverse set of models derived along different neighborhood shapes has better mutation score. Furthermore, based on a test suite with 4850 models, we found that high correlation between mutation score and our diversity metrics, which indicates that our metrics may be good predictors in practice for testing.
Added Value. Up to our best knowledge, our paper is one of the first studies on (software) model diversity. From a testing perspective, our diversity metrics provide a stronger characterization of a test suite of models than traditional metamodel coverage which is used in many research papers. Furthermore, model generators using neighborhood graph shapes (that keep models only if they are surely nonisomorphic) provide increased diversity compared to symmetry breaking predicates (which exclude models if they are surely isomorphic).
2 Preliminaries
Core modeling concepts and testing challenges of DSL tools will be illustrated in the context of Yakindu Statecharts [46], which is an industrial DSL for developing reactive, eventdriven systems, and supports validation and code generation.
2.1 Metamodels and Instance Models
Formally [32, 34], a metamodel defines a vocabulary of type and relation symbols Open image in new window where a unary predicate symbol Open image in new window is defined for each EClass, and a binary predicate symbol Open image in new window is derived for each EReference. For space considerations, we omit the precise handling of attributes.

the interpretation of a unary predicate symbol Open image in new window is defined in accordance with the types of the EMF model: Open image in new window An object \(o \in Obj _{M}\) is an instance of a class Open image in new window in a model M if Open image in new window .

the interpretation of a binary predicate symbol Open image in new window is defined in accordance withe the links in the EMF model: Open image in new window There is a reference Open image in new window between \(o_1, o_2 \in Obj _{M}\) in model M if Open image in new window
A metamodel also specifies extra structural constraints (type hierarchy, multiplicities, etc.) that need to be satisfied in each valid instance model [32].
Example 1
Figure 2 shows graph representations of three (partial) instance models. For the sake of clarity, Open image in new window and inverse relations Open image in new window and Open image in new window are excluded from the diagram. In \(M_1\) there are two Open image in new window (s1 and s2), which are connected to a loop via Open image in new window t2 and t3. The initial state is marked by a Open image in new window t1 from an entry e1 to state s1. \(M_2\) describes a similar statechart with three states in loop (s3, s4 and s5 connected via t5, t6 and t7). Finally, in \(M_3\) there are two main differences: there is an incoming Open image in new window t11 to an Open image in new window state (e3), and there is a Open image in new window s7 that does not have outgoing transition. While all these M1 and M2 are nonisomorphic, later we illustrate why they are not diverse.
2.2 WellFormedness Constraints as Logic Formulae
In many industrial modeling tools, WF constraints are captured either by OCL constraints [24] or graph patterns (GP) [41] where the latter captures structural conditions over an instance model as paths in a graph. To have a unified and precise handling of evaluating WF constraints, we use a toolindependent logic representation (which was influenced by [29, 32, 34]) that covers the key features of concrete graph pattern languages and a firstorder fragment of OCL.
Syntax. A graph predicate is a first order logic predicate \(\varphi (v_1, \ldots v_n)\) over (object) variables which can be inductively constructed by using class and relation predicates Open image in new window and Open image in new window , equality check \(=\), standard first order logic connectives \(\lnot \), \(\vee \), \(\wedge \), and quantifiers \(\exists \) and \(\forall \).
Semantics. A graph predicate \(\varphi (v_1,\ldots ,v_n)\) can be evaluated on model M along a variable binding \(Z:\{v_1,\ldots ,v_n\} \rightarrow Obj _{M}\) from variables to objects in M. The truth value of \(\varphi \) can be evaluated over model M along the mapping Z (denoted by \({[\![ \varphi (v_1,\ldots ,v_n) ]\!]}^{M}_Z\)) in accordance with the semantic rules defined in Fig. 3.
If there is a variable binding Z where the predicate \(\varphi \) is evaluated to \(1\) over M is often called a \({pattern\, match}\), formally \({[\![ \varphi ]\!]}^{M}_{Z} = 1\). Otherwise, if there are no bindings Z to satisfy a predicate, i.e. \({[\![ \varphi ]\!]}^{M}_{Z} = 0\) for all Z, then the predicate \(\varphi \) is evaluated to \(0\) over M. Graph query engines like [41] can retrieve (one or all) matches of a graph predicate over a model. When using graph patterns for validating WF constraints, a match of a pattern usually denotes a violation, thus the corresponding graph formula needs to capture the erroneous case.
2.3 Motivation: Testing of DSL Tools
A code generator would normally assume that the input models are wellformed, i.e. all WF constraints are validated prior to calling the code generator. However, there is no guarantee that the WF constraints actually checked by the DSL tool are exactly the same as the ones required by the code generator. For instance, if the validation forgets to check a subclause of a WF constraint, then runtime errors may occur during code generation. Moreover, the precondition of the transformation rule may also contain errors. For that purpose, WF constraints and model transformations of DSL tools can be systematically tested.Alternatively, model validation can be interpreted as a special case of model transformation, where precondition of the transformation rules are fault patterns, and the actions place error markers on the model [41].
A popular approach for testing DSL tools is mutation testing [22, 36] which aims to reveal missing or extra predicates by (1) deriving a set of mutants (e.g. WF constraints in our case) by applying a set of mutation operators. Then (2) the test suite is executed for both the original and the mutant programs, and (3) their output are compared. (4) A mutant is killed by a test if different output is produced for the two cases (i.e. different match set). (5) The mutation score of a test suite is calculated as the ratio of mutants killed by some tests wrt. the total number of mutants. A test suite with better mutation score is preferred [18].
Fault Model and Detection. As a fault model, we consider omission faults in WF constraints of DSL tools where some subconstraints are not actually checked. In our fault model, a WF constraint is given in a conjunctive normal form \(\varphi _e = \varphi _1 \wedge \dots \wedge \varphi _k \), all unbound variables are quantified existentially (\(\exists \)), and may refer to other predicates specified in the same form. Note that this format is equivalent to first order logic, and does not reduce the range of supported graph predicates. We assume that in a faulty predicate (a mutant) the developer may forget to check one of the predicates \(\varphi _i\) (Constraint Omission, CO), i.e. \(\varphi _e = [\varphi _1\wedge \ldots \wedge \varphi _i\wedge \ldots \wedge \varphi _k]\) is rewritten to \(\varphi _f = [\varphi _1 \wedge \dots \wedge \varphi _{i1} \wedge \varphi _{i+1} \wedge \dots \wedge \varphi _k]\), or may forgot a negation (Negation Omission), i.e. \(\varphi _e = [\varphi _1\wedge \ldots \wedge (\lnot \varphi _i)\wedge \ldots \wedge \varphi _k]\) is rewritten to \(\varphi _f = [\varphi _1\wedge \ldots \wedge \varphi _i\wedge \ldots \wedge \varphi _k]\). Given an instance model M, we assume that both \({[\![ \varphi _e ]\!]}^{M}_{}\) and the faulty \({[\![ \varphi _f ]\!]}^{M}_{}\) can be evaluated separately by the DSL tool. Now a test model M detects a fault if there is a variable binding Z, where the two evaluations differ, i.e. \({[\![ \varphi _e ]\!]}^{M}_{Z} \ne {[\![ \varphi _f ]\!]}^{M}_{Z}\).
Example 2
According to our fault model, we can derive two mutants for \( incomingToEntry \) as predicates Open image in new window and Open image in new window .
Constraints \(\varphi \) and \(\phi \) are satisfied in model \(M_1\) and \(M_2\) as the corresponding graph predicates have no matches, thus \({[\![ \varphi ]\!]}^{M_1}_{Z} = 0\) and \({[\![ \phi ]\!]}^{M_1}_{Z} = 0\). As a test model, both \(M_1\) and \(M_2\) is able to detect the same omission fault both for \(\varphi _{f_1}\) as \({[\![ \varphi _{f_1} ]\!]}^{M_1}_{} = 1\) (with \(E \mapsto e1\) and \(E \mapsto e2\)) and similarly \(\varphi _{f_2}\) (with s1 and s3). However, \(M_3\) is unable to kill mutant \(\varphi _{f_1}\) as (\(\varphi \) had a match \(E\mapsto e3\) which remains in \(\varphi _{f_1}\)), but able to detect others.
3 Model Diversity Metrics for Testing DSL Tools
As a general best practice in testing, a good test suite should be diverse, but the interpretation of diversity may differ. For example, equivalence partitioning [26] partitions the input space of a program into equivalence classes based on observable output, and then select the different test cases of a test suite from different execution classes to achieve a diverse test suite. However, while software diversity has been studied extensively [5], model diversity is much less covered.
In existing approaches [6, 7, 9, 10, 31, 42] for testing DSL and transformation tools, a test suite should provide full metamodel coverage [45], and it should also guarantee that any pairs of models in the test suite are nonisomorphic [17, 39]. In [43], the diversity of a model \(M_i\) is defined as the number of (direct) types used from its \( MM \), i.e. \(M_i\) is more diverse than \(M_j\) if more types of MM are used in \(M_i\) than in \(M_j\). Furthermore, a model generator Gen deriving a set of models \(\{M_i\}\) is diverse if there is a designated distance between each pairs of models \(M_i\) and \(M_j\): \( dist (M_i,M_j) > D\), but no concrete distance function is proposed.
Below, we propose diversity metrics for a single model, for pairs of models and for a set of models based on neighborhood shapes [28], a formal concept known from the state space exploration of graph transformation systems [27]. Our diversity metrics generalize both metamodel coverage and (graph) isomorphism tests, which are derived as two extremes of the proposed metric, and thus it defines a finer grained equivalence partitioning technique for graph models.
3.1 Neighborhood Shapes of Graphs

For range \(i=0\), \( Nbh _{0}\) is a subset of class symbols: Open image in new window

A neighbor \( Ref _{i}\) for \(i>0\) is defined by a reference symbol and a neighborhood: Open image in new window .

For a range \(i>0\) neighborhood \( Nbh _{i}\) is defined by a previous neighborhood and two sets of neighbor descriptors (for incoming and outgoing references separately): \( Nbh _{i} \subseteq Nbh _{i1} \times 2^{ Ref _{i}} \times 2^{ Ref _{i}}\).
Example 3
For range 2, each object of \(M_1\) would be mapped to a unique element. In Fig. 4, the neighborhood shapes of models \(M_1\), \(M_2\), and \(M_3\) for range 1, are represented in a visual notation adapted from [28, 29] (without additional annotations e.g. multiplicities or predicates used for verification purposes). The trace of the concrete graph nodes to neighbourhood is illustrated on the right. For instance, e1 and e2 in M1 and \(M_2\) Open image in new window are both mapped to the same neighbourhood n1, while e3 can be distinguished from them as it has incoming reference from a transition, thus creating a different neighbourhood n5.
 P1
There are only a finite number of graph shapes in a certain range, and a smaller range reduces the number of graph shapes, i.e. \( S_{i}(M)  \le  S_{i+1}(M) \).
 P2
\( S_{i}(M_j)  +  S_{i}(M_k)  \ge  S_{i}(M_j \cup M_k)  \ge  S_{i}(M_j) \) and \( S_{i}(M_k) \).
3.2 Metrics for Model Diversity
We define two metrics for model diversity based upon neighborhood shapes. Internal diversity captures the diversity of a single model, i.e. it can be evaluated individually for each and every generated model. As neighborhood shapes introduce extra subtypes for objects, this model diversity metric measures the number of neighborhood types used in the model with respect to the size of the model. External diversity captures the distance between pairs of models. Informally, this diversity distance between two models will be proportional to the number of different neighborhoods covered in one model but not the other.
Definition 1
(Internal model diversity). For a range i of neighborhood shapes for model M, the internal diversity of M is the number of shapes wrt. the size of the model: \(d_{i}^{int}(M) =  S_{i}(M)  /  M \).
The range of this internal diversity metric \(d_{i}^{int}(M)\) is [0..1], and a model M with \(d_{1}^{int}(M) = 1\) (and \( M  \ge  MM \)) guarantees full metamodel coverage [45], i.e. it surely contains all elements from a metamodel as types. As such, it is an appropriate diversity metric for a model in the sense of [43]. Furthermore, given a specific range i, the number of potential neighborhood shapes within that range is finite, but it grows superexponentially. Therefore, for a small range i, one can derive a model \(M_j\) with \(d_{i}^{int}(M_j) = 1\), but for larger models \(M_k\) (with \( M_k  >  M_j \)) we will likely have \(d_{i}^{int}(M_j) \ge d_{i}^{int}(M_k)\). However, due to the rapid growth of the number of shapes for increasing range i, for most practical cases, \(d_{i}^{int}(M_j)\) will converge to 1 if \(M_j\) is sufficiently diverse.
Definition 2
(External model diversity). Given a range i of neighborhood shapes, the external diversity of models \(M_j\) and \(M_k\) is the number of shapes contained exclusively in \(M_j\) or \(M_k\) but not in the other, formally, \(d_{i}^{ext}(M_j, M_k) =  S_{i}(M_j) \oplus S_{i}(M_k) \) where \( \oplus \) denotes the symmetric difference of two sets.
External model diversity allows to compare two models. One can show that this metric is a (pseudo)distance in the mathematical sense [2], and thus, it can serve as a diversity metric for a model generator in accordance with [43].
Definition 3

d is nonnegative: \(d(M_j,M_k) \ge 0\)

d is symmetric \(d(M_j,M_k) = d(M_k, M_j)\)

if \(M_j\) and \(M_k\) are isomorphic, then \(d(M_j,M_k) = 0\)

triangle inequality: \(d(M_j,M_l) \le d(M_k, M_j) + d(M_j, M_l)\)
Corollary 1
External model diversity \(d_{i}^{ext}(M_j, M_k)\) is a (pseudo)distance between models \(M_j\) and \(M_k\) for any i.
During model generation, we will exclude a model \(M_k\) if \(d_{i}^{ext}(M_j, M_k) = 0\) for a previously defined model \(M_j\), but it does not imply that they are isomorphic. Thus our definition allows to avoid graph isomorphism checks between \(M_j\) and \(M_k\) which have high computation complexity. Note that external diversity is a dual of symmetry breaking predicates [39] used in the Alloy Analyzer where \(d(M_j,M_k) = 0\) implies that \(M_j\) and \(M_k\) are isomorphic (and not vice versa).
Definition 4
(Coverage of model set). Given a range i of neighborhood shapes and a set of models \(MS = \{ M_1, \dots , M_k\}\), the coverage of this model set is defined as \(cov_{i}\langle MS \rangle =  S_{i}(M_1) \cup \dots \cup S_{i}( M_k) \).
The coverage of a model set is not normalised, but its value monotonously grows for any range i by adding new models. Thus it corresponds to our expectation that adding a new test case to a test suite should increase its coverage.
Example 4
Let us calculate the different diversity metrics for \(M_1\), \(M_2\) and \(M_3\) of Fig. 2. For range 1, they have the shapes illustrated in Fig. 4. The internal diversity of those models are \(d_{1}^{int}(M_1)=4/6\), \(d_{1}^{int}(M_2)=4/8\) and \(d_{1}^{int}(M_3)=6/7\), thus \(M_3\) is the most diverse model among them. As \(M_1\) and \(M_2\) has the same shape, the distance between them is \(d_{1}^{ext}(M_1, M2)=0\). The distance between \(M_1\) and \(M_3\) is \(d_{1}^{ext}(M_1, M3)=4\) as \(M_1\) has 1 different neighbourhoods (n1), and \(M_3\) has 3 (n5, n6 and n7). The set coverage of \(M_1\), \(M_2\) and \(M_3\) is 7 altogether, as they have 7 different neighbourhoods (n1 to n7).
4 Iterative Generation of Diverse Models
As a key conceptual novelty, we enforce the structural diversity of models during the generation process using neighborhood shapes at different stages. Most importantly, if the shape \(S_{i}(M_n)\) of a new instance model \(M_n\) obtained as a candidate solution is identical to the shape \({S_{i}(M_j)}\) for a previously derived model \(M_j\) for a predefined (input) neighborhood range i, the solution candidate is discarded, and iterative generation continues towards a new candidate.
Internally, our tool operates over partial models [30, 34] where instance models are derived along a refinement calculus [43]. The shapes of intermediate (partial) models found during model generation are continuously being computed. As such, they may help guide the search process of model generation by giving preference to refine (partial) model candidates that likely result in a different graph shape. Furthermore, this extra bookkeeping also pays off once a model candidate is found since comparing two neighborhood shapes is fast (conceptually similar to lexicographical ordering). However, our concepts could be adapted to postprocess the output of other (blackbox) model generator tools.
Example 5
As an illustration of the iterative generation of diverse models, let us imagine that model \(M_1\) (in Fig. 2) is retrieved first by a model generator. Shape \(S_{2}(M_1)\) is then calculated (see Fig. 4), and since there are no other models with the same shape, \(M_1\) is stored as a solution. If the model generator retrieves \(M_2\) as the next solution candidate, it turns out that \(S_{2}(M_2) = S_{2}(M_1)\), thus \(M_2\) is excluded. Next, if model \(M_3\) is generated, it will be stored as a solution since \(S_{2}(M_3) \ne S_{2}(M_2)\). Note that we intentionally omitted the internal search procedure of the model generator to focus on the use of neighborhood shapes.
Finally, it is worth highlighting that graph shapes are conceptually different from other approaches aiming to achieve diversity. Approaches relying upon object identifiers (like [38]) may classify two graphs which are isomorphic to be different. Samplingbased approaches [17] attempt to derive nonisomorphic models on a statistical basis, but there is no formal guarantee that two models are nonisomorphic. The Alloy Analyzer [39] uses symmetry breaking predicates as sufficient conditions of isomorphism (i.e. two models are surely isomorphic). Graph shapes provide a necessary condition for isomorphism i.e. if a two nonisomorphic models have identical shape, one of them is discarded.
5 Evaluation

RQ1: How effective is our technique in creating diverse models for testing?

RQ2: How effective is our technique in creating diverse test suites?

RQ3: Is there correlation between diversity metrics and mutation score?
For mutation testing, we used a constraint or negation omission operator (CO and NO) to inject an error to the original WF constraint in every possible way, which yielded 51 mutants from the original 10 constraints (but some mutants may never have matches). We checked both the original and mutated versions of the constraints for each instance model, and a model kills a mutant if there is a difference in the match set of the two constraints. The mutation score for a test suite (i.e. a set of models) is the total number of mutants killed that way.
Compared Approaches. Our test input models were taken from three different sources. First, we generated models with our iterative approach using a graph solver (GS) with different neighborhoods for ranges r = 1 to r = 3.
Next, we generated models for the same DSL using Alloy [39], a wellknown SATbased relational model finder. For representing EMF metamodels we used traditional encoding techniques [8, 32]. To enforce model diversity, Alloy was configured with three different setups for symmetry breaking predicates: s = 0, s = 10 and s = 20 (default value). For greater values the tool produced the same set of models. We used the latest 4.2 build for Alloy with the default Sat4j [20] as backend solver. All other configuration options were set to default.
Finally, we included 1250 manually created statechart models in our analysis (marked by Human). The models were created by students as solutions for similar (but not identical) statechart modeling homework assignments [43] representing real models which were not prepared for testing purposes.
Measurement Setup. To address RQ1–RQ3, we created a twostep measurement setup. In Step I. a set of instance models is generated with all GS and Alloy configurations. Each tool in each configuration generated a sequence of 30 instance models produced by subsequent solver calls, and each sequence is repeated 20 times (so 1800 models are generated for both GS and Alloy). In case of Alloy, we prevented the deterministic run of the solver to enable statistical analysis. The model generators was to create metamodelcompliant instances compliant with the structural constraints of Subsect. 2.1 but ignoring the WF constraints. The target model size is set to 30 objects as Alloy did not scale with increasing size (the scalability and the details of the backend solver is reported in [33]). The size of Human models ranges from 50 to 200 objects.
In Step II., we evaluate and the mutation score for all the models (and for the entire sequence) by comparing results for the mutant and original predicates and record which mutant was killed by a model. We also calculate our diversity metrics for a neighborhood range where no more equivalence classes are produced by shapes (which turned out to be \(r=7\) in our case study). We calculated the internal diversity of each model, the external diversity (distance) between pairs of models in each model sequence, and the coverage of each model sequence.
The right side of Fig. 6a presents the internal diversity of models measured as \(\text {shape nodes}/\text {graph nodes}\) (for fixpoint range 7). The result are similar: the diversity was high with low variance in GS with slight differences between ranges. In case of Alloy, the diversity is similarly affected by the symmetry value: s = 0 produced low average diversity, but a high number of positive outliers. With s = 10, the average diversity increased with decreasing number of positive outliers. And finally, with the default s = 20 value the average diversity was low. The internal diversity of Human models are between GS and Alloy.
Figure 6b illustrates the average distance between all model pairs generated in the same sequence (vertical axis) for range 7. The distribution of external diversity also shows similar characteristics as Fig. 6a: GS provided high diversity for all ranges (56 out of the maximum 60), while the diversity between models generated by Alloy varied based on the symmetry value.
As a summary, our model generation technique consistently outperformed Alloy wrt. both the diversity metrics and mutation score for individual models.
In Fig. 7b, the average coverage of the model sets is calculated (vertical axis) for increasing model sets (horizontal axis). The neighborhood shapes are calculated for \(r=0\) to 5, which after no significant difference is shown. Again, configurations of symmetry breaking predicates resulted in different characteristics for Alloy. However, the number of shape nodes investigated by the test set was significantly higher in case of GS (791 vs. 200 equivalence classes) regardless of the range, and it was monotonously increasing by adding new models.
Altogether, both mutation score and equivalence class coverage of a model sequence was much better for our model generator approach compared to Alloy.
Our initial investigation suggests that a high internal diversity will provide good mutation score, thus our metrics can potentially be good predictors in a testing context, but we cannot generalize to full statistical correlation.
Threats to Validity and Limitations. We evaluated more than 4850 test inputs in our measurement, but all models were taken from a single domain of Yakindu statecharts with a dedicated set of WF constraints. However, our model generation approach did not use any special property of the metamodel or the WF constraints, thus we believe that similar results would be obtained for other domains. For mutation operations, we checked only omission of predicates, as extra constraints could easily yield infeasible predicates due to inconsistency with the metamodel, thus further reducing the number of mutants that can be killed. Finally, although we detected a strong correlation between diversity and mutation score with our test cases, this result cannot be generalized to statistical causality, because the generated models were not random samples taken from the universe of models. Thus additional investigations are needed to justify this correlation, and we only state that if a model is generated by either GS or Alloy, a higher diversity means a higher mutation score with high probability.
6 Related Work
Diverse model generation plays a key role in testing model transformations code generators and complete developement environments [25]. Mutationbased approaches [1, 11, 22] take existing models and make random changes on them by applying mutation rules. A similar random model generator is used for experimentation purposes in [3]. Other automated techniques [7, 12] generate models that only conform to the metamodel. While these techniques scale well for larger models, there is no guarantee whether the mutated models are wellformed.
There is a wide set of model generation techniques which provide certain promises for test effectiveness. Whitebox approaches [1, 6, 14, 15, 31, 32] rely on the implementation of the transformation and dominantly use backend logic solvers, which lack scalability when deriving graph models.
Scalability and diversity of solverbased techniques can be improved by iteratively calling the underlying solver [19, 35]. In each step a partial model is extended with additional elements as a result of a solver call. Higher diversity is achieved by avoiding the same partial solutions. As a downside, generation steps need to be specified manually, and higher diversity can be achieved only if the models are decomposable into separate welldefined partitions.
Blackbox approaches [8, 13, 15, 23] can only exploit the specification of the language or the transformation, so they frequently rely upon contracts or model fragments. As a common theme, these techniques may generate a set of simple models, and while certain diversity can be achieved by using symmetrybreaking predicates, they fail to scale for larger sizes. In fact, the effective diversity of models is also questionable since corresponding safety standards prescribe much stricter test coverage criteria for software certification and tool qualification than those currently offered by existing model transformation testing approaches.
Based on the logicbased Formula solver, the approach of [17] applies stochastic random sampling of output to achieve a diverse set of generated models by taking exactly one element from each equivalence class defined by graph isomorphism, which can be too restrictive for coverage purposes. Stochastic simulation is proposed for graph transformation systems in [40], where rule application is stochastic (and not the properties of models), but fulfillment of WF constraints can only be assured by a carefully constructed rule set.
7 Conclusion and Future Work
We proposed novel diversity metrics for models based on neighbourhood shapes [28], which are true generalizations of metamodel coverage and graph isomorphism used in many research papers. Moreover, we presented a model generation technique that to derive structurally diverse models by (i) calculating the shape of the previous solutions, and (ii) feeding back to an existing generator to avoid similar instances thus ensuring high diversity between the models. The proposed generator is available as an open source tool [44].
We evaluated our approach in a mutation testing scenario for Yakindu Statecharts, an industrial DSL tool. We compared the effectiveness (mutation score) and the diversity metrics of different test suites derived by our approach and an Alloybased model generator. Our approach consistently outperformed the Alloybased generator for both a single model and the entire test suite. Moreover, we found high (internal) diversity values normally result in high mutation score, thus highlighting the practical value of the proposed diversity metrics.
Conceptually, our approach can be adapted to an Alloybased model generator by adding formulae obtained from previous shapes to the input specification. However, our initial investigations revealed that such an approach does not scale well with increasing model size. While Alloy has been used as a model generator for numerous testing scenarios of DSL tools and model transformations [6, 8, 35, 36, 42], our measurements strongly indicate that it is not a justified choice as (1) Alloy is very sensitive to configurations of symmetry breaking predicates and (2) the diversity and mutation score of generated models is problematic.
Notes
Acknowledgement
This paper is partially supported by the MTABME Lendület CyberPhysical Systems Research Group, the NSERC RGPIN0457316 project and the UNKP173III New National Excellence Program of the Ministry of Human Capacities.
References
 1.Aranega, V., Mottu, J.M., Etien, A., Degueule, T., Baudry, B., Dekeyser, J.L.: Towards an automation of the mutation analysis dedicated to model transformation. Softw. Test. Verif. Reliab. 25(5–7), 653–683 (2015)CrossRefGoogle Scholar
 2.Arkhangel’Skii, A., Fedorchuk, V.: General Topolgy I: Basic Concepts and Constructions Dimension Theory, vol. 17. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642612657Google Scholar
 3.Batot, E., Sahraoui, H.: A generic framework for modelset selection for the unification of testing and learning MDE tasks. In: MODELS, pp. 374–384 (2016)Google Scholar
 4.Baudry, B., DinhTrong, T., Mottu, J.M., Simmonds, D., France, R., Ghosh, S., Fleurey, F., Le Traon, Y.: Model transformation testing challenges. In: Integration of Model Driven Development and Model Driven Testing (2006)Google Scholar
 5.Baudry, B., Monperrus, M., Mony, C., Chauvel, F., Fleurey, F., Clarke, S.: Diversify: ecologyinspired software evolution for diversity emergence. In: Software Maintenance, Reengineering and Reverse Engineering, pp. 395–398 (2014)Google Scholar
 6.Bordbar, B., Anastasakis, K.: UML2ALLOY: a tool for lightweight modeling of discrete event systems. In: IADIS AC, pp. 209–216 (2005)Google Scholar
 7.Brottier, E., Fleurey, F., Steel, J., Baudry, B., Le Traon, Y.: Metamodelbased test generation for model transformations: an algorithm and a tool. In: 17th International Symposium on Software Reliability Engineering, pp. 85–94 (2006)Google Scholar
 8.Büttner, F., Egea, M., Cabot, J., Gogolla, M.: Verification of ATL transformations using transformation models and model finders. In: Aoki, T., Taguchi, K. (eds.) ICFEM 2012. LNCS, vol. 7635, pp. 198–213. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642342813_16CrossRefGoogle Scholar
 9.Cabot, J., Clarisó, R., Riera, D.: UMLtoCSP: a tool for the formal verification of UML/OCL models using constraint programming. In: ASE, pp. 547–548 (2007)Google Scholar
 10.Cabot, J., Clariso, R., Riera, D.: Verification of UML/OCL class diagrams using constraint programming. In: ICSTW, pp. 73–80 (2008)Google Scholar
 11.Darabos, A., Pataricza, A., Varró, D.: Towards testing the implementation of graph transformations. In: GTVMT, ENTCS. Elsevier (2006)Google Scholar
 12.Ehrig, K., Küster, J.M., Taentzer, G.: Generating instance models from meta models. Softw. Syst. Model. 8(4), 479–500 (2009)CrossRefGoogle Scholar
 13.Fleurey, F., Baudry, B., Muller, P.A., Le Traon, Y.: Towards dependable model transformations: qualifying input test data. SoSyM, 8 (2007)Google Scholar
 14.González, C.A., Cabot, J.: Test data generation for model transformations combining partition and constraint analysis. In: Di Ruscio, D., Varró, D. (eds.) ICMT 2014. LNCS, vol. 8568, pp. 25–41. Springer, Cham (2014). https://doi.org/10.1007/9783319087894_3Google Scholar
 15.Guerra, E., Soeken, M.: Specificationdriven model transformation testing. Softw. Syst. Model. 14(2), 623–644 (2015)CrossRefGoogle Scholar
 16.Jackson, D.: Alloy: a lightweight object modelling notation. ACM Trans. Softw. Eng. Methodol. 11(2), 256–290 (2002)CrossRefGoogle Scholar
 17.Jackson, E.K., Simko, G., Sztipanovits, J.: Diversely enumerating systemlevel architectures. In: International Conference on Embedded Software, p. 11 (2013)Google Scholar
 18.Jia, Y., Harman, M.: An analysis and survey of the development of mutation testing. IEEE Trans. Softw. Eng. 37(5), 649–678 (2011)CrossRefGoogle Scholar
 19.Kang, E., Jackson, E., Schulte, W.: An approach for effective design space exploration. In: Calinescu, R., Jackson, E. (eds.) Monterey Workshop 2010. LNCS, vol. 6662, pp. 33–54. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642212925_3CrossRefGoogle Scholar
 20.Le Berre, D., Parrain, A.: The sat4j library. J. Satisf. Boolean Model. Comput. 7, 59–64 (2010)Google Scholar
 21.Micskei, Z., Szatmári, Z., Oláh, J., Majzik, I.: A concept for testing robustness and safety of the contextaware behaviour of autonomous systems. In: Jezic, G., Kusek, M., Nguyen, N.T., Howlett, R.J., Jain, L.C. (eds.) KESAMSTA 2012. LNCS (LNAI), vol. 7327, pp. 504–513. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642309472_55CrossRefGoogle Scholar
 22.Mottu, J.M., Baudry, B., Le Traon, Y.: Mutation analysis testing for model transformations. In: Rensink, A., Warmer, J. (eds.) ECMDAFA 2006. LNCS, vol. 4066, pp. 376–390. Springer, Heidelberg (2006). https://doi.org/10.1007/11787044_28CrossRefGoogle Scholar
 23.Mottu, J.M., Simula, S.S., Cadavid, J., Baudry, B.: Discovering model transformation preconditions using automatically generated test models. In: ISSRE, pp. 88–99. IEEE, November 2015Google Scholar
 24.The Object Management Group.: Object Constraint Language, v2.0, May 2006Google Scholar
 25.Ratiu, D., Voelter, M.: Automated testing of DSL implementations: experiences from building mbeddr. In: AST@ICSE 2016, pp. 15–21 (2016)Google Scholar
 26.Reid, S.C.: An empirical analysis of equivalence partitioning, boundary value analysis and random testing. In: Software Metrics Symposium, pp. 64–73 (1997)Google Scholar
 27.Rensink, A.: Isomorphism checking in GROOVE. ECEASST 1 (2006)Google Scholar
 28.Rensink, A., Distefano, D.: Abstract graph transformation. Electron. Notes Theor. Comput. Sci. 157(1), 39–59 (2006)CrossRefMATHGoogle Scholar
 29.Reps, T.W., Sagiv, M., Wilhelm, R.: Static program analysis via 3valued logic. In: Alur, R., Peled, D.A. (eds.) CAV 2004. LNCS, vol. 3114, pp. 15–30. Springer, Heidelberg (2004). https://doi.org/10.1007/9783540278139_2CrossRefGoogle Scholar
 30.Salay, R., Famelis, M., Chechik, M.: Language independent refinement using partial modeling. In: de Lara, J., Zisman, A. (eds.) FASE 2012. LNCS, vol. 7212, pp. 224–239. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642288722_16CrossRefGoogle Scholar
 31.Schonbock, J., Kappel, G., Wimmer, M., Kusel, A., Retschitzegger, W., Schwinger, W.: TETRABox  a generic whitebox testing framework for model transformations. In: APSEC, pp. 75–82. IEEE, December 2013Google Scholar
 32.Semeráth, O., Barta, Á., Horváth, Á., Szatmári, Z., Varró, D.: Formal validation of domainspecific languages with derived features and wellformedness constraints. Softw. Syst. Model. 16(2), 357–392 (2017)CrossRefGoogle Scholar
 33.Semeráth, O., Nagy, A.S., Varró, D.: A graph solver for the automated generation of consistent domainspecific models. In: 40th International Conference on Software Engineering (ICSE 2018), Gothenburg, Sweden. ACM (2018)Google Scholar
 34.Semeráth, O., Varró, D.: Graph constraint evaluation over partial models by constraint rewriting. In: Guerra, E., van den Brand, M. (eds.) ICMT 2017. LNCS, vol. 10374, pp. 138–154. Springer, Cham (2017). https://doi.org/10.1007/9783319614731_10CrossRefGoogle Scholar
 35.Semeráth, O., Vörös, A., Varró, D.: Iterative and incremental model generation by logic solvers. In: Stevens, P., Wąsowski, A. (eds.) FASE 2016. LNCS, vol. 9633, pp. 87–103. Springer, Heidelberg (2016). https://doi.org/10.1007/9783662496657_6CrossRefGoogle Scholar
 36.Sen, S., Baudry, B., Mottu, J.M.: Automatic model generation strategies for model transformation testing. In: Paige, R.F. (ed.) ICMT 2009. LNCS, vol. 5563, pp. 148–164. Springer, Heidelberg (2009). https://doi.org/10.1007/9783642024085_11CrossRefGoogle Scholar
 37.The Eclipse Project.: Eclipse Modeling Framework. https://www.eclipse.org/modeling/emf/
 38.The Eclipse Project.: EMF DiffMerge. http://wiki.eclipse.org/EMF_DiffMerge
 39.Torlak, E., Jackson, D.: Kodkod: a relational model finder. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 632–647. Springer, Heidelberg (2007). https://doi.org/10.1007/9783540712091_49CrossRefGoogle Scholar
 40.Torrini, P., Heckel, R., Ráth, I.: Stochastic simulation of graph transformation systems. In: Rosenblum, D.S., Taentzer, G. (eds.) FASE 2010. LNCS, vol. 6013, pp. 154–157. Springer, Heidelberg (2010). https://doi.org/10.1007/9783642120299_11CrossRefGoogle Scholar
 41.Ujhelyi, Z., Bergmann, G., Hegedüs, Á., Horváth, Á., Izsó, B., Ráth, I., Szatmári, Z., Varró, D.: EMFIncQuery: an integrated development environment for live model queries. Sci. Comput. Program. 98, 80–99 (2015)CrossRefGoogle Scholar
 42.Vallecillo, A., Gogolla, M., Burgueño, L., Wimmer, M., Hamann, L.: Formal specification and testing of model transformations. In: Bernardo, M., Cortellessa, V., Pierantonio, A. (eds.) SFM 2012. LNCS, vol. 7320, pp. 399–437. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642309823_11CrossRefGoogle Scholar
 43.Varró, D., Semeráth, O., Szárnyas, G., Horváth, Á.: Towards the automated generation of consistent, diverse, scalable and realistic graph models. In: Heckel, R., Taentzer, G. (eds.) Graph Transformation, Specifications, and Nets. LNCS, vol. 10800, pp. 285–312. Springer, Cham (2018). https://doi.org/10.1007/9783319753966_16CrossRefGoogle Scholar
 44.Viatra Solver Project (2018). https://github.com/viatra/VIATRAGenerator
 45.Wang, J., Kim, S.K., Carrington, D.: Verifying metamodel coverage of model transformations. In: Software Engineering Conference, p. 10 (2006)Google Scholar
 46.Yakindu Statechart Tools.: Yakindu. http://statecharts.org/
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.