1 Introduction

Thanks to recent advances, saturation-based theorem provers are increasingly used to reason about problems requiring quantified theory-reasoning [4, 6]. One of the standard techniques to enable such reasoning is to automatically add first-order axiomatisations of theories detected in the input [14, 18]. For example, (incomplete) axiomatisations of integer and real arithmetic or McCarthy’s axioms of the theory of arrays [15] are routinely used. While this simple technique is often effective, we observed (see also [21]) two problems inherent to the solution: First, explicit axioms blow up the search space in the sense that a huge amount of consequences can additionally be generated. This happens since theory axioms are often repeatedly combined with certain clauses or among themselves, effectively creating cyclic patterns in the derivation. Most of these consequences would immediately be classified as practically useless by humans. Second, many of the resulting consequences have small weight. This has the unfortunate effect that the age-weight clause selection heuristic [16], predominantly used by saturation-based theorem provers for guiding the exploration of the search-space, often selects these theory-focused consequences. This way the prover is getting lost in parts of the search space where the chance of finding a proof is low.

In this paper, we propose to limit the exploration of theory-focused consequences by extending clause selection to take into account the amount of theory reasoning in the derivation of a clause. Our solution consists of two parts. First, we propose an efficiently computable feature of clauses, which we call th-distance, that measures the amount of theory reasoning in the derivation of a clause (Sect. 3). Second, we turn to the general problem of incorporating a feature to a clause selection strategy. There has been an ongoing interest in this problem [24, 25, 28]. We take inspiration from the layered clause selection approach presented in [28] and introduce the refined notion of multi-split queues, which present a principled solution to the incorporation problem (Sect. 2). We finally obtain a clause selection strategy for theory reasoning by instantiating multi-split queues with the feature th-distance. We implemented the resulting clause selection in the state-of-the-art saturation-based theorem prover Vampire [14], and evaluate its benefits on a relevant subset of the smt-lib benchmark (Sect. 4).

Related Work. There are different approaches to adding support for theory reasoning to saturation-based theorem provers, either by extending the prover’s inference system with dedicated inference rules [2, 10, 12, 13] or using even more fundamental design changes [1, 7, 20, 22]. While such solutions can result in very efficient reasoning procedures, their development is incredibly challenging and their implementation is a huge effort. As a result, only a few theories are covered by such approaches, in contrast to our technique, which applies to arbitrary theories. In particular, our technique can be used by non-experts on custom theory-domains coming from applications for which no dedicated solution exists. Our work has similar motivation to [21], where the authors use the set-of-support strategy [30] to limit the amount of reasoning performed with pure theory consequences. However, unlike our technique, they do not impose any limit on clauses whose derivation contains at least one non-theory-axiom.

Contributions. The summarized contributions of this paper are:

  • A new approach for building clause selection strategies from clause features, based on multi-split queues.

  • A new clause selection strategy for theory reasoning based on the instantiation of multi-split queues with the th-distance-feature measuring the amount of theory reasoning in the derivation of a clause. Our solution applies to arbitrary theories and does not require fundamental changes to the implementation of clause selection.

  • An implementation of the introduced clause selection strategy in the state-of-the-art theorem prover Vampire.

  • An experimental evaluation confirming the effectiveness of the technique, by improving on the existing heuristics by up to 37 % on a relevant set of benchmarks.

2 Layered Clause Selection

We assume the reader to be familiar with the saturation-based theorem proving technology (see, e.g. [3, 17]) and, in particular, with clause selection, the procedure for deciding, at each iteration of a saturation algorithm, which of the currently passive clauses to next select for activation, i.e. for participation in inferences with the previously activated clauses. To agree on terminology, we start this section by recalling clause selection by age and weight. We then move on to explaining layered clause selection.

The two most important features of a clause for clause selection are 1) its age, typically implemented using an ever-increasing “date of birth” timestamp, and 2) weight, which refers to the number of symbols occurring in the clause. A theorem prover prefers to select clauses that are old, which implicitly corresponds to a breadth-first search strategy, and clauses that are light, which is a form of best-first search (clauses with few symbols are cheaper to process, tend to be stronger simplifiers, and are intuitively closer to the ultimate target, the empty clause). In practice, the best performance is achieved by combining these two criteria [16, 25]. This is achieved by storing the passive clauses in two queues, one sorted by age and the other by weight, and setting a ratio to specify how the selection alternates between picking from these two queues.

Layered Selection. In the system description of GKC [28], Tammet describes an idea of using two layers of queues to organise clause selection. The first layer relies on the just-described combination of selection by age and weight. In the second layer, clauses are split into disjoint groups using a certain property (e.g., “being derived from the goal or not” could define two groups), each group is represented by two sub-queues of the first layer, and the decision from which group to select the next clause is dictated by a new second-layer ratio. Although Tammet does not expand much on the insights behind using the layered approach, he reports it highly beneficial for the performance of GKC. In our understanding, the additional layer (in principle, there could be more than two) provides a clean way of incorporating into clause selection a new notion of what a preferred clause should be, without a priori disturbing the already established and tuned primary approach, such as selection by age and weight.Footnote 1

Our preliminary experiments with the idea (instantiated with the derived-from-the-goal property) found it useful, but not as powerful as other goal-directed heuristics in Vampire. In particular, finding a universally good ratio between the “good” clauses and the “bad” ones seemed hard. What we propose here instead (and what also led in our experiment to a greater performance gain) is to instead organise the clauses into groups with “good” ones and “all”. Here the second group contains all the passive clauses and essentially represents a fallback to the original single-layer strategy. The advantage of this new take on layered selection is that a bad clause is only selected if 1) it is time to try a bad clause according to the second-layer ratio and 2) the best bad clause is also the current overall best according to the age-weight perspective. This makes picking a good second-layer ratio much easier. In particular, one can “smoothly” move (by changing the second-layer ratio) from a high preference for the “all” second-layer queue towards selecting more “good” clauses without necessarily having to select any “bad” ones.

Multi-split Queues. We propose multi-split queues to realize layered selection with second layer groups defined by a real-valued clause feature.

Definition 1

Let \(\mu \) be a real-valued clause evaluation feature such that preferable clauses have low value of \(\mu (C)\). Let the cutoffs \(c_1,\ldots ,c_k\) be monotonically increasing real numbers with \(c_k = \infty \), and let the ratio \(r_1:\ldots :r_k\) be a list of positive integer values. These together determine a layered selection scheme with k groups \(\mathcal {C}_i = \{C | \mu (C) \le c_i \}\) for \(i = 1,\ldots ,k\), such that we select from the i-th group with a frequency \(r_i / (\varSigma _{j=1}^k r_j)\).

It is easy to see that multi-split queues generalise the binary “good” vs “all” arrangement, since, thanks to monotonicity of the cutoffs, we have \(\mathcal {C}_i \subseteq \mathcal {C}_{i+1}\). Moreover, since \(c_k=\infty \), \(\mathcal {C}_k\) will contain all the passive clauses.

3 Theory Part

In this section, we instantiate the idea of multi-split queues from Sect. 2 with a concrete clause evaluation feature, which measures the amount of theory reasoning in the derivation of a clause. We assume that the initial clauses given to the saturation algorithm, which we simply refer to as axioms, consists of non-theory axioms obtained by classifying the input problem and theory axioms added to facilitate theory reasoning.

We start by defining the fraction of theory reasoning in the derivation of a general clause. This relies on counting the number of theory axioms, resp. the number of all axioms, in the derivation-tree using running sums.

Definition 2

For a theory axiom C, define both \( thAx (C)\) and \( allAx (C)\) as 1. For a non-theory axiom C, define \( thAx (C)\) as 0 and \( allAx (C)\) as 1. For a derived clause C with parent clauses \(C_1, \dots , C_n\), define \( thAx (C)\) as \(\sum _i thAx (C_i)\) and \( allAx (C)\) as \(\sum _i allAx (C_i)\). Finally, we set \( frac (C) := thAx (C) / allAx (C)\).

Assume now that for a given problem we expect (based on domain knowledge and experience) the fraction of theory reasoning in the final refutation \( frac (\bot )\) to be at most 1/d, for a positive integer d. Our clause evaluation feature th-distance measures how much \( frac (C)\) exceeds the expected “maximally allowed” fraction 1/d. More precisely, th-distance counts the number of non-theory axioms which the derivation of C would additionally need to contain to achieve a ratio 1/d.

Definition 3

The th-distance\(\,: Clauses \rightarrow \mathbb {N}\) is defined as

$$\begin{aligned} {th-distance}(C) := {max}( thAx (C) \cdot d - allAx (C), 0). \end{aligned}$$

Our heuristic is based on the idea that a clause with small th-distance is more likely to contribute to the refutation than a clause with high th-distance. We therefore want to ensure that clause selection focuses on selecting clauses C with a low value th-distance(C). We realize this with the multi-split queues (see Sect. 2), instantiating the clause evaluation feature \(\mu \) by th-distance, resulting in a second layer clause selection strategy with parameters d\(c_1,\ldots ,c_k\) and \(r_1:\ldots :r_k\).

4 Experiments

We implemented the heuristic described in Sect. 3 in Vampire (version 4.4). Our newly added implementation consists of about 900 lines of C++ code and is compatible with both the LRS saturation algorithm [23] and Avatar [29].

For evaluation, we used the following subset of the most recent version (as of January 2020) of SMTLIB [5]: We took all the problems from the sub-logics that contain quantification and theories, such as LIA, LRA, NRA, ALIA, UFDT, ... except for those requiring bit-vector (BV) or floating-point (FP) reasoning, currently not supported by Vampire. Subsequently, we excluded problems known to be satisfiable and those that were provable using Vampire’s default strategy in 10 s either without adding theory axioms or while performing clause selection by age only. This way, we obtained 20795 problems.Footnote 2

Table 1. Comparing clause selection strategies on Vampire’s default configuration.

As a first experiment, we compared the number of problems solved in 10 s by the default strategyFootnote 3 and its various extensions by multi-split queues defined in Sect. 3.Footnote 4 The d-value, cutoffs and ratio values for the heuristic were selected by educated guessing and randomised hill-climbing. Table 1 lists results of the best obtained configurations. It can be seen that already with two second layer queues a substantial improvement of 25.5% over the default is achieved. Moreover, while it is increasingly more difficult to choose good values for the many parameters defining a configuration with multiple queues, their use further significantly improves the number of problems solved.

Table 2. Comparing clause selection strategies on Vampire’s portfolio configuration.

In a second experiment,Footnote 5 we ran Vampire’s strategy schedule for SMTCOMP 2019 [11] on our problems and also the same schedule additionally imposing the most successful second-layer clause selection scheme layered4 from the first experiment. The time limit was 500 s per problem. Table 2 shows the results.

We can see that the version with second-layer queues improved over the standard schedule by 150 solved problems. This is a very significant result, suggesting the achieved control of theory reasoning is incredibly helpful. Moreover, one should keep in mind that strategies in a schedule are carefully selected to complement each other and even locally good changes in the strategies often destroy this complementarity (cf., e.g., [19, 21]). In our case, however, we achieve an improvement despite this looming negative effect. Finally, it is very likely that a new schedule, constructed while taking our new technique into account, will be able to additionally cover some of the 194 problems currently only solved by the unaltered schedule.

5 Conclusion

We introduced a new clause selection heuristic for reasoning in the presence of explicit theory axioms. The heuristic is based on the combination of multi-split queues and a new clause-feature measuring the amount of theory reasoning in the derivation of a clause. Our experiments show that the new heuristic significantly improves the existing state-of-the-art clause selection strategy. As future work, we want to extend layered clause selection with new clause-features and combine it with the machine-learning-based approach in the style of ENIGMA [8].