figure a
figure b

1 Introduction

Automated test generation approaches aim at assisting developers in crucial software testing tasks [2, 22], like automatically generating test cases or suites [6, 10, 18], and automatically finding and reporting failures [4, 12, 13, 19, 20, 23]. Many of these approaches involve random components, that avoid making a systematic exploration of the space of behaviors, but improve test generation efficiency [10, 19, 23]. While these approaches have been useful in finding a large number of bugs in software, they might miss exploring certain faulty software behaviors due to their random nature. Alternative approaches aim at systematically exploring a very large number of executions of the software under test (SUT), with the goal of providing stronger guarantees about the absence of bugs [4, 6, 12, 14, 18, 20]. Some of these approaches are based on bounded exhaustive generation (BEG) [4, 20], which consists of generating all feasible inputs that can be constructed using bounded data domains. Common targets to BEG approaches have been implementations of complex, dynamic data structures with rich structural constraints (e.g., linked lists, trees, etc). The most widely-used and efficient BEG approaches for testing software [4, 20] require the user to provide a formal specification of the constraints that the inputs must satisfy –often a representation invariant of the input (repOK)–, and bounds on data domains [4, 20] –often called scopes. Thus, specification-based BEG approaches yield all inputs within the provided scopes that satisfy repOK.

Writing appropriate formal specifications for BEG is a challenging and time consuming task. The specifications must precisely capture the intended constraints of the inputs. Overconstrained specifications lead to missing the generation of valid inputs, which might make the subsequent testing stage miss the exploration of faulty behaviors of the SUT. Underconstrained specifications may lead to the generation of invalid inputs, which might produce false alarms while testing the SUT. Furthermore, sometimes the user needs to take into account the way the generation approach operates, and write the specifications in a very specific way for the approach to achieve good performance [4] (see Section 4). Finally, such precise formal specifications are seldom available in software, hindering the usability of specification-based BEG approaches.

Several studies show that BEG approaches are effective in revealing software failures [4, 16, 20, 33]. Furthermore, the small scope hypothesis [3], which states that most software faults can be revealed by executing the SUT on “small inputs”, suggests that BEG approaches should discover most (if not all) faults in the SUT, if large enough scopes are used. The challenge that BEG approaches face is how to efficiently explore a huge search space, that often grows exponentially with respect to the scope. The search space often includes a very large number of invalid (not satisfying repOK) and isomorphic inputs [15, 28]. Thus, pruning parts of the search space involving invalid and redundant inputs is key to make BEG approaches scale up in practice [4].

In this paper, we propose a new approach for BEG, called BEAPI, that works by making calls to API methods of the SUT. Similarly to API-based test generation approaches [10, 19, 23], BEAPI generates sequences of calls to methods from the API (i.e., test sequences). The execution of each test sequence yielded by BEAPI generates an input in the resulting BEG set of objects. As usual in BEG, BEAPI requires the user to provide scopes for generation, which for BEAPI includes a maximum test sequence length. Brute force BEG from a user-provided scope would attempt to generate all feasible test sequences of methods form the API with up to a maximum sequence length. This is an intrinsically combinatorial process, that exhausts computational resources before completion even for very small scopes (see Section 4). We propose several pruning techniques that are crucial for the efficiency of BEAPI, and allow it to scale up to significantly larger scopes. First, BEAPI executes test sequences and discards those that correspond to violations of API usage rules (e.g., throwing exceptions that indicate incorrect API usage, such as IllegalArgumentException in Java [17, 23]). Thus, as opposed to specification-based BEG approaches, BEAPI does not require a repOK that precisely describes valid inputs. In contrast, BEAPI requires minimum specification effort in most cases (including most of our case studies in Section 4), which consists of making API methods throw exceptions on invalid inputs (in the “defensive programming” style popularized by Liskov [17]). Second, BEAPI implements state matching [15, 28, 36] to discard test sequences that produce inputs already created by previously explored sequences. Third, BEAPI employs only a subset of the API methods to create test sequences: a set of methods automatically identified as builders [27]. Before test generation, BEAPI executes an automated builders identification approach [27] to find a smaller subset of the API that is sufficient to yield the resulting BEG set of inputs. Another advantage of BEAPI with respect to specification-based approaches is that it produces test sequences to create the corresponding inputs using methods from the API, making it easier to create tests from BEAPI’s output [5].

We experimentally assess BEAPI, and show that its efficiency and scalability are comparable to those of the fastest BEG approach (Korat), without the need for repOKs. We also show that BEAPI can be of help in finding flaws in repOKs, by comparing the sets of inputs generated by BEAPI using the API against the sets of inputs generated by Korat from a repOK. Using this procedure, we found several flaws in repOKs employed in the experimental assessment of related tools, thus providing evidence on the difficulty of writing repOKs for BEG.

2 A Motivating Example

Fig. 1.
figure 1

NodeCachingLinkedList’s repOK from ROOPS

To illustrate the difficulties of writing formal specifications for BEG, consider Apache’s NodeCachingLinkedList’s (NCL) representation invariant shown in Figure 1 (taken from the ROOPS benchmarkFootnote 1). NCLs are composed of a main circular, doubly-linked list, used for data storage, and a cache of previously used nodes implemented as a singly linked list. Nodes removed from the main list are moved to the cache, where they are saved for future usage. When a node is required for an insertion operation, a cache node (if one exists) is reused (instead of allocating a new node). As usual, repOK returns true iff the input structure satisfies the intended NCL properties [17]. Lines 1 to 20 check that the main list is a circular doubly-linked list with a dummy head; lines 21 to 33 check that the cache is a null terminated singly linked list (and the consistency of size fields is verified in the process). This repOK is written in the way recommended by the authors of Korat [4]. It returns false as soon as it finds a violation of an intended property in the current input. Otherwise, it returns true at the end. This allows Korat to prune large portions of the search space, and improves its performance [4]. repOK suffers from underspecification: it does not state that the sentinel node and all cache nodes must have null values (lines 3-4 and 28-29, respectively). Mistakes like these are very common when writing specifications (see Section 4.3), and difficult to discover by manual inspection of repOK. These errors can have serious consequences for BEG. Executing Korat with repOK and a scope of up to 8 nodes produces 54.5 million NCL structures, while the actual number of valid NCL instances is 2.8 million. Clearly, this is a problem for Korat’s performance, and for the subsequent testing of the SUT. In addition, the invalid instances generated might trigger false alarms in the SUT in many cases. We discovered these errors in repOK with the help of BEAPI: we automatically contrasted the structures generated using BEAPI and the NCL’s API, with those generated using Korat with repOK, for the same scope.

This example shows that writing sound and precise repOKs for BEG is difficult and time consuming. Fine-tuning repOKs to improve the performance of BEG (e.g., for Korat) is even harder. The main advantage of BEAPI is that it requires minimal specification effort to perform BEG. If API methods used for generation are correct, all generated structures are valid by construction. The programmer only needs to make sure that API methods throw exceptions when API usage rules are violated, in a defensive programming style [17]. In most cases, this requires checking very simple conditions on the inputs. In our example, the method to add an element to a NCL throws an IllegalArgumentException when is called with the null element (the implementation of the method takes care that the remaining NCL properties hold).

3 Bounded Exhaustive Generation from Program APIs

We now describe BEAPI’s approach. We start with the definition of scope, then present BEAPI’s optimizations, and we finally describe BEAPI’s algorithm.

3.1 Scope Definition

Fig. 2.
figure 2

BEAPI’s scope definition for NCL (max. nodes 3)

The definition of scope in Korat involves providing bounded data domains for classes and fields of the SUT, since Korat explores the state space of feasible input candidates, and yields the set of inputs satisfying repOK as a result. Instead, BEAPI explores the search space of (bounded) test sequences that can be formed by making calls to the SUT’s API. Thus, we have to provide data domains for the primitive types employed to make such calls, and a bound on the maximum size of the structures we want to keep, from those generated by such API calls. An example configuration file defining BEAPI’s scope for the NCL case study is shown in Figure 2. The max.objects parameter specifies the maximum number of different objects (reachable from the root) that a structure is allowed to have. Test sequences that create a structure with a larger number of different objects (of any class) than max.objects will be discarded (and the structure too). In our example, this implies that BEAPI will not create NCLs with more than 3 nodes. Next, one has to specify the values that will be employed by BEAPI to invoke API routines that take primitive type parameters (e.g., elements to insert into the list). The int.range parameter allows one to specify a range of integers, which goes from 0 to 2 in Figure 2. One may also specify domains for other primitive types like floats, doubles and strings, by describing their values by extension. For example, line 3 shows how to define str1, str2 and str3 as the feasible values for String-typed parameters. Also, we can instruct BEAPI which fields to take into account for structure canonicalization, or which fields to omit (omit.fields). This allows the user to control the state matching process (see Section 3.2). For example, uncommenting line 4 would make BEAPI omit the DEFAULT_MAXIMUM_CACHE_SIZE in state matching, which in our example is a constant initialized to 20 in the class constructor. In this case, omitting the field does not change anything in terms of the different structures generated by BEAPI, but in other cases omitting fields may have an impact. The configuration in Figure 2 is enough for BEAPI to generate NCLs with a maximum of 3 nodes, containing integers from 0 to 2 as values, which allowed us to mimic the structures generated by Korat for the same scope.

3.2 State Matching

In test generation with BEAPI, multiple test sequences often produce the same structure, e.g., inserting an element into a list and removing the element afterwards. BEAPI assumes that method executions are deterministic: any execution of a method with the same inputs yields the same results. For the generation of a bounded exhaustive set of structures, for each distinct structure s in the set, BEAPI only needs to save the first test sequence that generates s. All test sequences generated subsequently that also create s can be discarded. As BEAPI works by extending previously generated test sequences (Section 3.4), if we save many test sequences for the same structure, all these sequences would have to be extended with new routines in subsequent iterations of BEAPI, resulting in unnecessary computations. Hence, we implement state matching on BEAPI as follows. We store all the structures produced so far by BEAPI in a canonical form (see below). After executing the last routine r(p\(_1\),..,p\(_k\)) of a newly generated test sequence T, we check whether any of r’s parameters hold a structure not seen before (not stored). If T does not create any new structure, it is discarded. Otherwise, T and the new structures it generates are stored by BEAPI.

We represent heap-allocated structures as labeled graphs. After the execution of a method, a (non-primitive typed) parameter p holds a reference to the root object r of a rooted heap (i.e. \(p=r\)), defined below.

Definition 1

Let O be a set of objects, and P a set of primitive values (including null). Let F be the fields of all objects in O.

  • A heap is a labeled graph \(H = \langle O,E\rangle \) with \(E = \{(o,f,v) | o \in O, f \in F, v \in O \cup P\}\).

  • A rooted heap is a pair \(RH = \langle r, H\rangle \) where \(r \in O\), \(H = \langle O,E\rangle \) is a heap, and for each \(v' \in O \cup P\), \(v'\) is reachable from r through fields in F.

The special case \(p=null\) can be represented by a rooted heap with a dummy node and a dummy field pointing to null. In languages without explicit memory management (like Java), each object is identified by the memory address where is allocated. But changing the memory addresses of objects (while keeping the same graph structure) has no effect in the execution of a program. Heaps obtained by permutations of the memory addresses of their component objects are called isomorphic heaps. We avoid the generation of isomorphic heaps by employing a canonical representation for heaps [4, 15]. Rooted heaps can be efficiently canonicalized by an approach called linearization [15, 36], which transforms a rooted heap into a unique sequence of values.

Fig. 3.
figure 3

Linearization algorithm

Figure 3 shows the linearization algorithm used by BEAPI, a customized version that reports when objects exceed the scopes and supports ignoring object fields (for the original version see [36]). linearize starts a depth-first traversal of the heap from the root, by invoking lin in line 3. To canonicalize the heap, lin assigns different identifiers to the different objects it visits. Map ids stores the mapping between objects and unique object identifiers. When an object is visited for the first time, it is assigned a new unique identifier (lines 10-11), and a singleton sequence with the identifier is created to represent the object (line 12). Then, the object’s fields, sorted in a predefined order (e.g., by name), are traversed and the linearization of each field value is constructed, and the result is appended to the sequence representing the current object (lines 13-19). A field storing a primitive value is represented by a singleton sequence with the primitive value (line 15-16). If a field references an object, a recursive call to lin converts the object into a sequence, which will be appended to the result (line 18). At the end of the loop, seq contains the canonical representation of the whole rooted heap starting at root, and is returned by lin (line 20). When an already visited object is traversed by a recursive call, the object must have an identifier already assigned in ids (line 6), and lin returns the singleton sequence with the object’s unique identifier (lines 7). When more than scope objects are reachable from the rooted heap, lin returns an exception to report that the scope has been exceeded (lines 9-10). The exception will be employed later on by BEAPI to discard test sequences that create objects larger than allowed by the scope. linearize also takes as a parameter a regular expression omitFields, that matches the names of the fields that must be omitted during canonicalization (see Section 3.1). To omit such fields, we implemented sortByField (line 13) in such a way that it does not return the edges corresponding to fields whose names match omitFields. This in turn avoids saving the values of omitted fields in the sequence yielded by linearize. Finally, notice that linearization allows for efficient comparison of objects (rooted heaps): two objects are equal if and only if their corresponding sequences yielded by linearize are equal.

3.3 Builders Identification Approach

As the feasible combinations of methods grow exponentially with the number of methods, it is crucial to reduce the number of methods that BEAPI uses to produce test sequences. We employ an automated builders identification approach [27] to find a subset of API methods that are sufficient for the generation of the bounded exhaustive structure sets. We call such routines builders. The previous approach to identify a subset of sufficient builders from an API is based on a genetic algorithm, but is computationally expensive [27]. Here, we consider a simpler hill climbing approach (HC), that achieves better performance. HC may of course be less precise, as it may include some methods in the resulting set of builders that might not be needed to produce a bounded exhaustive set of structures. However, HC worked very well and consistently computed minimal sets of builders in our experiments (we checked that the set of builders computed by HC matched the set of builders we manually identified for each case study). Our goal here is to assess the impact of using builders for BEG from an API. Comparing the HC approach against existing techniques is left for future work.

Let API=\(m_1,m_2,\ldots ,m_n\) be the set of API methods. HC explores the search space of all subsets of methods from API. HC requires the user to provide a scope s (in the same way as in BEAPI). The fitness f(sm) of a given set sm of methods is the number of distinct structures (after canonicalization) that BEAPI generates using the set, for the given scope s. We also give priority in the fitness to sets of methods with less and simpler parameter types (see [27] for further details). The successors succs(sm) for a candidate sm are the sets sm\(\cup \{m_i\}\), for each \(m_i \in \) API. HC starts by computing the fitness of all singletons \(\{c\}\) of constructor methods. The best of the singletons is set as the current candidate curr, and HC starts a typical iterative hill climbing process. At each iteration HC computes f(succ) for each \(\texttt {succ} \in \texttt {succs(curr)}\). Let best be the successor with the highest fitness value. Notice that best has exactly one more method than the best candidate of the previous iteration, curr. If \(\texttt {f(best) > f(curr)}\), methods in best can be used to create a larger set of structures than those in curr. Thus, HC assigns best to curr, and continues with the next iteration. Otherwise, \(\texttt {f(best) <= f(curr)}\), and curr already generates the largest possible set of structures (no method could be added that increases the number of generated structures from curr). At this point, curr is returned as the set of identified builders.

Notice that HC performs many invocations to BEAPI for builders identification. The key insight that makes builders identification feasible is that often builders identified for a relatively small scope are the same set of methods that are needed to create structures of any size. In other words, once the scope for builders computation is large enough, increasing the scope will yield the same set of builders as a result. This result resembles the small scope hypothesis for bug detection [3] (and transcoping [31]). A scope of 5 was enough for builders computation in all our case studies (we manually checked that the computed builders were the right ones in all cases). After builders are identified efficiently using a small scope, we can run BEAPI with the identified builders using a larger scope, for example, to generate bigger objects to exercise the SUT. In most of our case studies, builders comprise a constructor and a single method to add elements to the structure. However, our automated builder identification approach showed that, for Red-Black Trees, a remove method was also required (for scopes greater than 3), since there are trees with a particular balance configuration (red and black coloring for the nodes) that cannot be constructed by just adding elements to the tree. In contrast, AVL trees, which are also balanced, do not require the remove method as a builder, and the class constructor and an add routine suffice. This shows that builders identification is non-trivial to perform manually, as it requires a very careful exploration of a very large number of structures and method combinations. Other structures that require more than two builders are binomial and Fibonacci heaps.

3.4 The BEAPI Approach

Fig. 4.
figure 4

BEAPI algorithm

A pseudocode of BEAPI is shown in Figure 4. BEAPI takes as inputs a list of methods from an API, methods (the whole API, or previously identified builders); the scope for generation, scope; a list of test sequences to create values for each primitive type provided in the scope description, primitives (automatically created from configuration options int.range, strings, etc., see Fig. 2); and a regular expression matching fields to be omitted in the canonicalization of structures, omitFields. Notice that methods from more than one class could be passed in methods if one wants to generate objects for several classes in the same execution of BEAPI, e.g., when methods from one class take objects from another class as parameters. BEAPI’s map currSeqs stores, for each type, the list of test sequences that are known to generate structures of the type. currSeqs starts with all the primitive typed sequences in primitives (lines 2-3). At each iteration of the main loop (lines 5-34), BEAPI creates new sequences for each available method m (line 8), by exhaustively exploring all the possibilities for creating test sequences using m and inputs generated in previous iterations and stored in currSeqs (lines 9-30). The newly created test sequences that generate new structures in the current iteration are saved in map newSeqs (initialized empty in line 6); all the generated sequences are then added to currSeqs at the end of the iteration (line 33). If no new structures are produced at the current iteration (newStrs is false in line 32), BEAPI’s main loop terminates and the list of all sequences in currSeqs is returned (line 35).

Let us now discuss the details of the for loop in lines 9-30. First, all sequences that can be used to construct inputs for m are retrieved in seqsT\(_1\),...,seqsT\(_n\). BEAPI explores each tuple (s\(_1\),...,s\(_n\)) of feasible inputs for m. Then, it executes createNewSeq (line 13), which constructs a new test sequence newSeq by performing the sequential composition of test sequences s\(_1\),...,s\(_n\) and routine m, and replacing m’s formal parameters by the variables that create the required objects in s\(_1\),...,s\(_n\). newSeq is then executed (line 14) and it either produces a failure (failure is set to true), raises an exception that represents an invalid usage of the API (exception is set to true), or its execution is successful and it creates new objects o\(_1\),\(\ldots \),o\(_n\),o\(_r\). In case of a failure, an exception is thrown and newSeq is presented to the user as a witness of the failure (line 15). If a different kind of exception is thrown, BEAPI assumes it corresponds to an API misuse (see below), discards the test sequence (line 16) and continues with the next candidate sequence. Otherwise, the execution of newSeq builds new objects o\(_1\),\(\ldots \),o\(_n\),o\(_r\) (or values of primitive types) that are canonicalized by makeCanonical (line 17) –by executing linearize from Figure 3 on each structure. If any of the structures produced by newSeq exceeds the scope, makeCanonical sets outOfScope to true, BEAPI discards newSeq and continues with the next one (line 18). If none of the above happens, makeCanonical returns canonical versions of o\(_1\),\(\ldots \),o\(_n\),o\(_r\) in variables c\(_1\),\(\ldots \),c\(_n\),c\(_r\), respectively. Afterwards, BEAPI performs state matching by checking that the canonical structure c\(_1\) is of reference type and that it has not been created by any previous test sequence (line 19). Notice that canonicalStrs stores all of the already visited structures. If c\(_1\) is a new structure, it is added to canonicalStrs (line 27), and the sequence that creates c\(_1\), newSeq, is added to the set of test sequences producing structures of type T\(_1\) (newSeqs in line 27). Also, newStrs is set to true to indicate that at least a new object has been created in the current iteration (line 22). This process is repeated for canonical objects c\(_2\),\(\ldots \),c\(_n\),c\(_r\) (lines 24-29).

BEAPI distinguishes failures from bad API usage based on the type of the exception (similarly to previous API based test generation techniques [23]). For example, IllegalArgumentException and IllegalStateException correspond to API misuses, and the remaining exceptions are considered failures by default. BEAPI’s implementation allows the user to select the exceptions that correspond to failures and those that do not, by setting the corresponding configuration parameters. As mentioned in Section 2, BEAPI assumes that API methods throw exceptions when they fail to execute on invalid inputs. We argue that this is a common practice, called defensive programming [17], that should be followed by all programmers, as it results in more robust code and improves software testing in general [2] (besides helping automated test generation tools). We also argued in Section 2 that the specification effort required for defensive programming is much less than writing precise (and efficient) repOKs for BEG, and that this was true after manually inspecting the source code of our case studies. On the other hand, note that BEAPI can employ formal specifications to reveal bugs in the API, e.g., by executing repOK and check that it returns true on every generated object of the corresponding type (as in Randoop [23]). However, the specifications used for bug finding do not need to be very precise (e.g., the underspecified NCL repOK from Section 2 is fine for bug finding), or written in a particular way (as required by Korat). Other kinds of specifications that are weaker and simpler to write can also be used by BEAPI to reveal bugs, like violations of language specific contracts (e.g., equals is an equivalence relation in Java), metamorphic properties [7], user-provided assertions (assert), etc.

Another advantage of BEAPI is that, for each generated object, it yields a test sequence that can be executed to create the object. This is in contrast with specification based approaches (that generate a set of objects from repOK). Finding a sequence of invocations to API methods that create a specific structure is a difficult problem on its own, that can be rather costly computationally [5], or require significant effort to perform manually. Thus, often objects generated by specification based approaches are “hardwired” when used for testing a SUT (e.g., by using Java reflection), making tests very hard to understand and maintain, as they depend on the low-level implementation details of the structures [5].

4 Evaluation

In this section, we experimentally assess BEAPI against related approaches. The evaluation is organized around the following research questions:

  • RQ1 Can BEG be performed efficiently using API routines?

  • RQ2 How much do the proposed optimizations impact the performance of BEG from the API?

  • RQ3 Can BEAPI help in finding discrepancies between repOK specifications and the API’s object generation ability?

As case studies, we employ data structures implementations from four benchmarks: three employed in the assessment of existing testing tools (Korat [4], Kiasan [9], FAJITA [1]), and ROOPS. These benchmarks cover diverse implementations of complex data structures, which are a good target for BEG. We choose these as case studies because the implementations come equipped with repOKs, written by the authors of the benchmarks. The experiments were run on a workstation with an Intel Core i7-8700 CPU (3.2 Ghz) and 16Gb of RAM. We set a timeout of 60 minutes for each individual run. To replicate the experiments, we refer the reader to the paper’s artifact [25].

Table 1. Efficiency assessment of BEAPI against Korat

4.1 RQ1: Efficiency of Bounded Exhaustive Generation from APIs

For RQ1 we assess whether or not BEAPI is fast enough to be a useful BEG approach, by comparing it to the fastest BEG approach, Korat [32]. The results of the comparison are summarized in Table 1. For each technique, we report generation times (in seconds), number of generated and explored structures, for increasingly large scopes. Due to space reasons, we show a representative sample of the results (we try to maintain the same proportion of good and bad cases for each technique in the data we report). We include the largest successful scope for each technique; the execution times for the largest scopes are in boldface in the table. In this way, should scalability issues arise, they can be easily identified. For the complete report of the results visit the paper’s website [26]. To obtain proper performance results for BEAPI, we extensively tested the API methods of the classes to ensure they were correct for this experiment. We did not try to change the repOKs in any way because that would change the performance of Korat, and one of our goals here is evaluating the performance of Korat using repOKs written by different programmers. Differences in explored structures are expected, since the corresponding search spaces for Korat and BEAPI are different. However, for the same case study and scope, one would expect both approaches to generate the same number of valid structures. This is indeed the case in most experiments, with notable exceptions of two different kinds. Firstly, there are cases where repOK has errors; these cases are grayed out in the tables. Secondly, the slightly different notion of scope in each technique can cause discrepancies. This only happens for Red-Black Trees (RBT) and Fibonacci heaps (FibHeap), which are shown in boldface. In these cases certain structures of size n can only be generated from larger structures, with insertions followed by removals and then insertions again to trigger specific balance rearrangements. BEAPI discards generated sequences as soon as they exceed the maximum structure size, hence it cannot generate these structures.

In terms of performance, we have mixed results. In the Korat benchmark, Korat shows better performance in 4 out of 6 cases. In the FAJITA benchmark, BEAPI is better in 3 out of 4 cases. In the ROOPS benchmark, BEAPI is better in 5 out of 7 cases. In the Kiasan benchmark, Korat is faster in 6 of the 7 cases. We observe that BEAPI shows a better performance in structures with more restrictive constraints such as RBT and Binary Search Trees (BST); often these cases have a smaller number of valid structures. Cases where the number of valid structures grows faster with respect to the scope, such as doubly-linked lists (DLList), are better suited for Korat. More structures means BEAPI has to create more test sequences in each successive iteration, which makes its performance suffer more in such cases. As expected, the way repOKs are written has a significant impact in Korat’s performance. For example, for binomial heaps (BinHeap) Korat reaches scope 8 with RoopsrepOK, scope 10 with FAJITA’s repOK, and scope 11 with Korat’s repOK (all equivalent in terms of generated structures). In most cases, repOKs from the Korat benchmark result in better performance, as these are fine-tuned for usage with Korat. Case studies with errors in repOKs are grayed out in the table, and discussed further in Section 4.3. Notice that errors in repOKs can severely affect Korat’s performance.

Table 2. Execution times (sec) of BEAPI under different configurations.

4.2 RQ2: Impact of BEAPI’s Optimizations

In RQ2 we assess the impact each of BEAPI’s proposed optimizations has in BEG. For this, we assess the performance of four different BEAPI configurations: SM/BLD is BEAPI with state matching (SM) and builder identification (BLD) enabled; SM is BEAPI with only state matching (SM) enabled; BLD is BEAPI with only builders (BLD) identification enabled; NoOPT has both optimizations disabled. The left part of Table 2 summarizes the results of this experiment for the ROOPS benchmark; the right part reports preliminary results on five “real world” implementations of data structures: LinkedList (21 API methods), TreeSet (22 API methods), TreeMap (32 methods) and HashMap (29 methods) from java.util, and NCL from Apache Collections (20 methods). As most real world implementations, these data structures do not come equipped with repOKs, hence we only employed them in this RQ.

The brute force approach (NoOPT) performs poorly even for the easiest case studies and very small scopes. These scopes are too small and often not enough if one wants to generate high quality test suites. State matching is the most impactful optimization, greatly improving by itself the performance and scalability all around (compare NoOPT and SM results). As expected, builders identification is much more relevant in cases where the number of methods in the API is large (more than 10), and remarkably in the real world data structures (with 20 or more API methods). SM/BLD is more than an order of magnitude faster than SM in AVL and RBT, and it reaches one more scope in NCL and LList. The remaining classes of ROOPS have just a few methods, and the impact of using builders is relatively small. The conclusions drawn from ROOPS apply to the other three benchmarks (we omit their results here for space reasons, visit the paper’s website for a complete report [26]). In the real world data structures, using precomputed builders allowed SM/BLD to scale to significantly larger scopes in all cases but TreeMap and TreeSet, where it significantly improves running times. Overall, the proposed optimizations have a crucial impact in BEAPI’s performance and scalability, and both should be enabled to obtain good results.

On the cost of builders identification. Due to space reasons we report builders identification times in the paper’s website [26]. For the conclusions of this section, it is sufficient to say that scope 5 was employed for builders identification in all cases, and that the maximum runtime of the approach was 65 seconds in the four benchmarks (ROOPS’ SLL, 11 methods), and 132 seconds in the real world data structures (TreeMap, 32 methods). We manually checked that the identified methods included a set of sufficient builders in all cases. Notice that BEG is often performed for increasingly larger scopes, and the identified builders can be reused across executions. Thus, builder identification times are amortized across different executions, which makes it difficult to calculate how much builder identification times add to BEAPI running times in each case. So we did not include builder identification times in BEAPI running times in any of the experiments. Notice that, for the larger scopes, which arguably are the most important, builders identification time is negligible in relation to generation times.

Table 3. Summary of flaws found in repOKs using BEAPI

4.3 RQ3: Analysis of Specifications using BEAPI

RQ3 addresses whether BEAPI can be useful in assisting the user in finding flaws in repOKs, by comparing the set of objects that can be generated using the API and the set of objects generated from the repOK. We devised the following automated procedure. First, we run BEAPI to generate a set SA of structures from the API, and Korat to generate a set SR from repOK, using the same scope for both tools. Second, we canonicalize the structures in both SA and SR using linearization (Section 3.2). Third, we compare sets SA and SR for equality. Differences in this comparison point out a mismatch between repOK and the API. There are three possible outputs for this automated procedure. If SA \(\subset \) SR, it is possible that the API generates a subset of the valid structures, that repOK suffers from underspecification (missing constraints), or both. In this case, the structures in SR that do not belong to SA are witnesses of the problem, and the user has to manually analyze them to find out where the error is. Here, we report the (manually confirmed) underspecification errors in repOKs that are witnessed by the aforementioned structures. In contrast, when SR \(\subset \) SA, it can be the case that the API generates a superset of the valid structures, that repOK suffers from overspecification (repOK is too strong), or both. The structures in SA that do not belong to SR might point out to the root of the error, and again they have to be manually analyzed by the user. We report the (manually confirmed) overspecification errors in repOKs that are witnessed by these structures. Finally, it can be the case that there are structures in SR that do not belong to SA, and there are structures (distinct than the former ones) in SA that do not belong to SR. These might be due to faults in the API, flaws in the repOK, or both. We report the manually confirmed flaws in repOKs witnessed by such structures simply as errors (repOK describes a different set of structures than the one it should). Notice that differences in the scope definitions for the approaches might make sets SA and SR differ. This was only the case in the RBT and FibHeap structures, where BEAPI generated a smaller set of structures for the same scope than Korat due to balance constraints (as explained in Section 4.1). However, these “false positives” can be easily revealed, since all the structures generated by Korat were always included in the structures generated by BEAPI if a larger scope was used for the latter approach. Using this insight we manually discarded the “false positives” due to scope differences in RBT and FibHeap.

The results of this experiment are summarized in Table 3. We found out flaws in 9 out of 26 repOKs using the approach described above. The high number of flaws discovered evidences that problems in repOKs are hard to find manually, and that BEAPI can be of great help for this task.

5 Related Work

BEG approaches have been shown effective in achieving high code coverage and finding faults, as reported in various research papers [4, 16, 20, 33]. Our goal here is not to assess yet again the effectiveness of BEG suites, but to introduce an approach that is straightforward to use in today’s software because it does not require the manual work of writing formal specifications of the properties of the inputs (e.g., repOKs). Different languages have been proposed to formally describe structural constraints for BEG, including Alloy’s relational logic (in the so-called declarative style), employed by the TestEra tool [20]; and source code in an imperative programming language (in the so-called operational style), as used by Korat [4]. The declarative style has the advantage of being more concise and simpler for people familiar with it, however this knowledge is not common among developers. The operational style can be more verbose, but as specifications and source code are written in the same language this style is most of the time preferred by developers. UDITA [11] and HyTeK [29] propose to employ a mix of the operational and the declarative styles to write the specifications, as parts of the constraints are often easier to write in one style or the other. With precise specifications both approaches can be used for BEG. Still, to use these approaches developers have to be familiar with both specification styles, and take the time and effort required to write the specifications. Model checkers like Java Pathfinder [34] (JPF) can also perform BEG, but the user has to manually provide a “driver” for the generation: a program that the model checker can use to generate the structures that will be fed to the SUT afterwards. Writing a BEG driver often involves invoking API routines in combination with JPF’s nondeterministic operators, hence the developer must familiarize with such operators and put in some manual effort to use this approach. Furthermore, JPF runs over a customized virtual machine in place of Java’s standard JVM, so there is a significant overhead in running JPF compared to the use of the standard JVM (employed by BEAPI). The results of a previous study [32] show that JPF is significantly slower than Korat for BEG. Therein, Korat has been shown to be the fastest and most scalable BEG approach at the time of publication [32]. This in part can be explained by its smart pruning of the search space of invalid structures and the elimination of isomorphic structures. In contrast, BEAPI does not require a repOK and works by making calls to the API.

An alternative kind of BEG consists of generating all inputs to cover all feasible (bounded) program paths, instead of generating all feasible bounded inputs. This is the approach of systematic dynamic test generation, a variant of symbolic execution [14]. This approach is implemented by many tools [8, 12, 13, 24], and has been successfully used to produce test suites with high code coverage, reveal real program faults, and for proving memory safety of programs. Kiasan [9] and FAJITA [1] are also white-box test case generation approaches that require formal specifications and aim for coverage of the SUT.

Linearization has been employed to eliminate isomorphic structures in traditional model checkers [15, 28], and also in software model checkers [35]. A previous study experimented with state matching in JPF and proposed several approaches for pruning the search space for program inputs using linearization, for both concrete and symbolic execution [35]. As stated before, concrete execution in JPF requires the user to provide a driver. The symbolic approach attempts to find inputs to cover paths of the SUT; we perform BEG instead. Linearization has also been employed for test suite minimization [36].

6 Conclusions

Software quality assurance can be greatly improved thanks to modern software analysis techniques, among which automated test generation techniques play an outstanding role [4, 6, 10, 12, 13, 18,19,20, 23]. Random and search-based approaches have shown great success in automatically generating test suites with very good coverage and mutation metrics, but their random nature does now allow these techniques to precisely characterize the families of software behaviors that the generated tests cover. Systematic techniques such as those based on model checking, symbolic execution or bounded exhaustive generation, cover a precise set of behaviors, and thus can provide specific correctness guarantees.

In this paper, we presented BEAPI, a technique that aims at facilitating the application of a systematic technique, bounded exhaustive input generation, by producing structures solely from a component’s API, without the need for a formal specification of the properties of the structures. BEAPI can generate bounded exhaustive suites from components with implicit invariants, and reduces the burden of providing formal specifications, and tailoring the specifications for improved generation. Thanks to a number of optimizations, including an automated identification of builder routines and a canonicalization/state matching mechanism, BEAPI can generate bounded exhaustive suites with a performance comparable to that of the fastest specification-based technique Korat [4]. We have also identified the characteristics of a component that may make it more suitable for a specification-based generation, or an API-based generation.

Finally, we have shown how specification based approaches and BEAPI can complement each other, depicting how BEAPI can be used to assess repOK implementations. Using this approach, we found a number of subtle errors in repOK specifications taken from the literature. Thus, techniques that require repOK specifications (e.g, [30]), as well as techniques that require bounded-exhaustive suites (e.g., [21]) can benefit from our presented generation technique.