Keywords

1 Introduction

The satisfiability problem (SAT) asks to determine if a given propositional formula F has a satisfying assignment or not. Since Cook’s NP-completeness proof of the problem [21], SAT is believed to be computationally intractable in the worst case. However, in the field of applied SAT solving, there were enormous improvements in the performance of SAT solvers in the last 20 years. Motivated by these significant improvements, SAT solvers have been applied to an increasing number of areas, including bounded model checking [15, 19], cryptology [23], or even bioinformatics [35], to name just a few. Two algorithmic paradigms turned out to be especially promising to construct solvers. The first to mention is conflict-driven clause learning (CDCL) [13, 36, 39]. The CDCL procedure systematically explores the search space of all possible assignments for the formula F in question, by constructing partial assignments in an exhaustive branching and backtracking search. Whenever a conflict occurs, a reason (a conflict clause) is learned and added to the original clause set [16, 42]. The other successful paradigm is stochastic local search (see [16, Chapter 6] for an overview). Starting from a complete initial assignment for the formula, the neighborhood of this assignment is explored by moving from assignment to assignment in the space of solution candidates while trying to minimize the number of unsatisfied clauses by the assignment (or some other criterion). This movement is usually performed by flipping one variable of such a complete assignment. Both paradigms are described in Sect. 2 in more detail.

Besides the difference in the completeness of assignments considered during a run of those algorithms, another major difference between both paradigms is the completeness and incompleteness of the solvers (i. e., being able to certify both satisfiability and unsatisfiability, or not) [42]. CDCL solvers are complete, while SLS algorithms are incomplete. More interestingly for practitioners, perhaps, is the complimentary behavior these paradigms exhibit in terms of performance: CDCL seems well suited for application instances, whereas SLS excels on random formulasFootnote 1. An interesting question thus is, if it is possible to combine the strength of both solvers or to eliminate the weaknesses of one paradigm by an oracle-assistance of the other. This challenge was posed by Selman et al. in [43, Challenge 7] (and again later in [30]):

“Demonstrate the successful combination of stochastic search and systematic search techniques, by the creation of a new algorithm that outperforms the best previous examples of both approaches.”

This easy to state question turns out to be surprisingly challenging. There have been some advances towards this goal, as we survey below. However, the performance of most algorithms that try to combine the strength of both paradigms, so-called hybrid solvers, is far from those of CDCL solvers (or other non-hybrids), especially on application instances [2] or even any wider range of benchmark problems [30]. In [27], the DPLL algorithm Satz [34] was used to derive implication dependencies and equivalencies between literals in WalkSAT [37].

The effect of a restricted form of resolution to clause weighting solvers was investigated in [1]. A similar approach was previously studied in [18], where new resolvent clauses based on unsatisfied clauses at local minima and randomly selected neighboring clauses are added.

Local Search over partial assignments instead of complete assignments extended with constraint propagation was studied in [29]. In case of a conflict, a conflict clause is learned, and local search is used to repair this conflict. A similar approach to construct a hybrid solver using SLS as the main solver and a complete CDCL solver as a sub-solver was also studied in [4, 7], where the performance of the solvers hybridGM, hybridGP, and hybridPP was empirically analyzed. The idea of these solvers is to build a partial assignment around one complete assignment from the search trajectory of the SLS solver. This partial assignment can then be applied to the formula resulting in a simpler one, which is solved by a complete CDCL solver.

A shared memory approach for multi-core processor architectures was proposed in [32]. In this case, DPLL can provide guidance for an SLS solver, being run simultaneously on a different core.

The solver hbisat, introduced in [24], uses the partial assignments calculated by CDCL to initialize the SLS solver. The SLS solver sends unsatisfied clauses to the CDCL solver to either identify an unsatisfiable subformula of those clauses or satisfy them. This approach was later significantly improved by Letombe and Marques-Silva in [33].

Audemard et al. [2] introduced SatHys, where both components cooperate by alternating between them. E. g., when CDCL chooses a variable to branch, its polarity is extracted from the best complete assignment found by the SLS solver. On the other hand, CDCL helps SLS out of local minima (i. e., CDCL is invoked conditionally in this solver).

Our Contribution. Our hybrid solver GapSAT differs from the approaches described above in the sense that CDCL is used as a preprocessor for probSAT [11] (an SLS solver), terraforming the landscape in advance. This approach eliminates, in many cases, the possibility for probSAT to get stuck in local minima, eliminating the necessity of further, more complicated interactions between both paradigms.

We examine the question, whether it is beneficial for SLS to add new clauses to the original formula in a preprocessing step, by invoking a complete CDCL solver. As it turns out, not all additional clauses are created equal. Experimentally, we demonstrate that adding clauses that contain a larger number of correct literals w. r. t. a fixed solution, drastically improves the performance of SLS solvers. However, clauses that only contain few correct literals w. r. t. the fixed solution can be deceptive for SLS. This effect can exponentially increase the runtime as measured in the number of flips of probSAT, a very simple SLS solver. In practice, one has to resort to known complete algorithms or proof systems to generate helpful clauses. We, in particular, investigate the effect of new clauses learned by CDCL and depth-limited resolution (Sect. 3). With the help of experiments, we conclude that CDCL (or resolution limited to depth 2) produces distinctively more helpful clauses for probSAT than resolution limited to depth 1, as was studied in the past [1]. We, therefore, focus our effort on CDCL as a clause learning mechanism for probSAT.

Motivated by these insights, we study the quality CDCL-learned clauses have for SLS in more detail. In training experiments that are described in Sect. 4, we systematically deduce parameter settings by statistical analysis that increase this quality. For example, shorter clauses learned by CDCL are more beneficial. As it turns out, however, the specific width depends on the underlying formula. Another interesting observation is that the amount of added clauses has to be carefully restricted. Again, the specific restriction to use depends on the underlying formula.

To finally test the concrete effect of these ideas, we compared the performance of a newly designed solver with the winner of the 2018 random track competition, Sparrow2Riss [10]. Our observations were implemented in GapSAT, which forms a combination of Glucose and probSAT. A comprehensive experimental evaluation on 255 instances provides statistical evidence that the performance of our proposed solver GapSAT exceeds Sparrow2Riss’ substantially. In particular, GapSAT was able to solve more instances in just 30 s than Sparrow2Riss in 5000 s. We present a summary of our experimental evaluation in Sect. 5.

2 Preliminaries

We briefly reiterate the notions necessary for this work. For a thorough introduction to the field, we refer the reader to [42]. A literal over a Boolean variable x is either x itself or its negation \(\overline{x}\). A clause \(C = a_1 \vee \dots \vee a_\ell \) is a (possibly empty) disjunction of literals \(a_i\) over pairwise disjoint variables. A CNF formula \(F = C_1 \wedge \dots \wedge C_m\) is a conjunction of clauses. A CNF formula is a k-CNF if all clauses in it have at most k variables. An assignment \({\alpha }\) for a CNF formula F is a function that maps some subset of \(\mathrm {Vars}({F})\) to \(\{0,1\}\). Given a complete assignment \({\alpha }\), the act of changing the truth value of precisely one variable of \({\alpha }\) is called a flip. Resolution is the proof system with the single derivation rule \(\frac{B \vee x \quad C \vee \overline{x}}{B \vee C}\), where B and C are clauses.

CDCL. CDCL solvers, introduced in [13, 36, 39], construct a partial assignment. When some clause is falsified by the constructed assignment, the CDCL solver adds a new clause to the original formula F. This clause is a logical consequence of F. A more detailed description of CDCL can be found in [16, 40, 42]. Modern SAT solvers are additionally equipped with incremental data structures, restart policies [26], and activity-based variable selection heuristics (VSIDS) [39]. In this work, we use the CDCL solver Glucose [3] (based on MiniSat [22]).

probSAT . Contrary to CDCL-like algorithms, algorithms based on stochastic local search (SLS) operate on complete assignments for a formula F. These solvers are started with a randomly generated complete initial assignment \({\alpha }\). If \({\alpha }\) satisfies F, a solution is found. Otherwise, the SLS solver tries to find a solution by repeatedly flipping the assignment of variables according to some underlying heuristic. That is, they perform a random walk over the set of complete assignments for the underlying formula.

In [11], the probSAT class of solvers was introduced. Over the last few years, probSAT-based solvers performed excellently on random instances: probSAT won the random track of the SAT competition 2013, dimetheus [9] in 2014 and 2016, YalSAT [14] won in 2017. Only recently, in 2018, other types of solvers significantly exceeded probSAT based algorithms. This performance is the reason for choosing probSAT in this study.

The idea behind the solver is that a function f is used, which gives a high probability to a variable if flipping this variable is deemed advantageous. A description of probSAT is given in Algorithm 1. This class of solvers is related to Schöning’s random walk algorithm introduced in [41].

figure a

In [11], the break-only-poly-algorithm with \(f(x,{\alpha }) := \big ( \varepsilon + {\text {break}}(x,{\alpha }) \big )^{-b}\) was considered for 3-SAT, where \({\text {break}}(x,{\alpha })\) is the number of clauses that are satisfied under \({\alpha }\) but will be falsified when the assignment of x is flipped. For \(k \ne 3\), the break-only-exp-algorithm \(f(x,{\alpha }) := b^{-{\text {break}}(x,{\alpha })}\) was studied. Balint and Schöning [11] found good choices for the parameters of these two functions. In this work, we have adopted these parameter settings.

3 The Quality of Learned Clauses

In this section, we investigate the effect logically equivalent formulas have on the SLS solver probSAT. More precisely, we use a formula F as a base and add a set of clauses \(S=\{C_1, \dots , C_t\}\) to F to obtain a new formula \(G := F \cup S\). In general, adding new clauses to a formula F does not yield a logically equivalent new formula G.

Thus, we observe two artificial models related to the backbone (see, e. g., [31]) and consider the 3-CNF case in the following. The backbone \(\mathcal {B}(F)\) are the literals appearing in all satisfying assignments of F. In the first model, each new clause consists of one backbone literal \(x\in \mathcal {B}(F)\) and two literals yz such that their complements are backbone literals, i. e., y and z do not occur in any solution. We call this the deceptive model. In the second model, each new clause has one backbone literal and two randomly chosen literals. This is the general model.

Fig. 1.
figure 1

On the left, the effect of deceptive clauses is displayed on an instance with 100 variables and 423 clauses. On the right, the effect of general clauses is displayed on an instance with 500 variables and 2100 clauses. The x-axes denote the number of additional clauses, and the y-axes denote the average runtime of 100 runs of probSAT as measured in the number of flips. Both y-axes are scaled logarithmically.

Figure 1 displays the effect of both models on probSAT. On the left is the deceptive model. Generally, a large number of deceptive clauses have a harmful effect on the runtime of probSAT. That is, the average runtime of probSAT increases exponentially with the number of added clauses.

The right-hand side of Fig. 1 shows the general model. Here, we can observe a strong, positive effect on the behavior of probSAT. The average runtime of probSAT improved by two orders of magnitude by adding 200 new clauses generated by the general model. Even though Fig. 1 depicts the data of only one instance, the general shape of the plot is similar on all tested instances.

Clauses generated by the deceptive model seem to give rise to new local minima which are far away from the solutions. Once probSAT is stuck in such a local minimum, the break-value makes it unlikely that probSAT escapes the region of the local minimum. On the other hand, the prevalence of correct literals in the general model seems to guide probSAT towards a solution. Due to this interpretation, we call clauses that have a high number of correct literals w. r. t. a fixed solution high-quality clauses. The view that clauses with few correct literals have a detrimental effect on local search solvers is also supported by the literature [28, 42].

From the considerations described above, it should be evident that it is crucial which clauses are added to the formula. Clearly, neither the deceptive nor the general model can be applied to real instances: The solution space would have to be known in advance to generate the clauses. In contrast, approaches like resolution and CDCL can be applied to real instances. All clauses which can be derived by resolution are already implied by the original formula. Accordingly, adding such a clause to the original formula yields a logically equivalent formula. Similarly, clauses learned by a CDCL algorithm can be added to obtain a logically equivalent formula.

In the following, we compare two models based on resolution and one model based on CDCL. In particular, let F be a formula and let \(B, C \in F\) be clauses such that there is a resolvent R. We call R a level 1 resolvent. Secondly, let DE be clauses such that there is a resolvent S and let D or E (or both) be level 1 resolvents. We call S a level 2 resolvent. As a representative for CDCL solvers, we use Glucose [3].

Let F be a 3-CNF formula with m clauses. New and logically equivalent formulas \(F_1\), \(F_2\), and \(F_C\) are obtained in the following manner.

\(F_1\):

Randomly select at most m/10 level 1 resolvents of maximum width 4 and add them to F.

\(F_2\):

Randomly select at most m/10 level 2 resolvents of maximum width 4 and add them to F.

\(F_C\):

Randomly select at most m/10 learned clauses with maximum width 4 from Glucose (with a time limit of 300 s) and add them to F.

The average behavior of probSAT over 1000 runs per instance on the instance types \(F_1\), \(F_2\), and \(F_C\) is observed. We use a small testbed of 23 uniformly generated 3-CNF instances with 5000 to 11 600 variables and a clause-to-variable ratio of 4.267. The instances of type \(F_1\) were the most challenging for probSAT; as a matter of fact, \(F_1\) instances were considerably harder to solve than the original instances. On instances of type \(F_2\), probSAT performed better, and on \(F_C\), it was even more efficient. The t-test [44] confirms the observations: \(F_2\) instances are easier on average than \(F_1\) instances (\(p<0.01\)), and \(F_C\) instances are easier than \(F_2\) instances (\(p<0.05\)). Sections  and  present an in-depth examination of the effect clauses of type \(F_C\) have on probSAT.

These results lead us to believe that level 1 clauses are of low quality while level 2 and CDCL clauses are generally of higher quality. It is impractical to confirm this suspicion on uniformly generated instances of the above-mentioned size. Hence, we use randomly generated models with hidden solutions [12] and judge the quality of learned clauses based on the hidden solution.

The SAT competition 2018 incorporated three types of models with hidden solutions. All three types are generated in a similar manner; they just differ in the choice of the parameters. Here, we compare the average quality of the new clauses on each of the three models. For each instance, the set of all level 1, level 2, and CDCL clauses is computed, and the quality is measured w. r. t. the hidden solution.

For the most part, the results confirm the observations from the uniformly generated instances: On all three models, level 2 clauses have a statistically significantly higher quality than level 1 clauses (t-test, all \(p < 0.01\)). On two of three domains, CDCL clauses have higher quality than level 2 clauses (t-test, both \(p < 10^{-5}\)), while level 2 clauses have higher quality on the remaining domain (t-test, \(p<10^{-8}\)).

As a side note, CDCL is capable of learning unit and binary clauses. Nevertheless, this did not influence the quality of the clauses in any meaningful way: In the 120 test instances, only a single binary and no unit clause was learned.

In conclusion, we conjecture that level 2 and CDCL clauses have higher quality than level 1 clauses. On the uniform random testbed, CDCL performs better than level 2 clauses; also, CDCL clauses have higher quality than level 2 clauses on two of the three hidden solution domains.

4 Training Experiments

In the previous section, we argued that adding supplementary clauses to an instance can have a positive effect on the behavior of probSAT. The focus of this section lies on the question which clauses and how many should be added.

Especially for 3-SAT instances, an initial guess might be adding all clauses acquirable by so-called ternary resolution [17]. Informally speaking, ternary resolution is the restriction of the resolution rule to ternary clauses such that the resolvent is either a binary or ternary clause. Ternary resolution is performed until saturation. In [8], the effect of (amongst other techniques) ternary resolution on another SLS solver, Sparrow (see [6]), is observed. They empirically show that ternary resolution has a negative effect on the performance on satisfiable hard combinatorial instances. Anbulagan et al. [1] study the effect of ternary resolution on uniform random instances. They found that SLS solvers do not benefit from ternary resolution. They even conjecture that ternary resolution has a harmful impact on the runtime of SLS solvers on uniform instances. We performed some experiments on our own and can confirm this suspicion for probSAT. On medium-sized uniform instances, ternary resolution slowed probSAT down by \(0.5\%\) on average. As a consequence, we focus on methods to improve the runtime behavior of probSAT with clauses learned by Glucose for the rest of this work.

The supplementary clauses are all learned by Glucose within a 300 second time window; we only distinguish the learned clauses by their width. The number of supplementary clauses is measured in percent of the number of original clauses. To put it differently, we are interested in the maximal length of the new clauses and what percentage of the modified formula should be new clauses. The results of this section are used to configure GapSAT.

Description of Training Experiments

We split the experiments into two phases. In the preliminary phase, promising intervals for the maximal width and the maximal percentage of new clauses are obtained. In the subsequent phase, the most advantageous parameter combination is sought. Hereafter, we describe the setup of the experiments and their results.

Training Data. We used a set of training instances \(\mathcal {C}\), which is assembled as follows: All instances of the SAT Competitions random tracksFootnote 2 2014 to 2017 were gathered. We filtered these instances by proven satisfiability: An instance was added to the training set \(\mathcal {C}\) if and only if at least one participating solver showed satisfiability. Since not enough uniform random 3-SAT instances of medium size were in \(\mathcal {C}\), we added all instances of this kind from the SAT Competition 2013 as well. In total, \(\mathcal {C}\) consists of 377 instances which can be divided into the following three domains:

  • 120 randomly generated instances with a hidden solution [12],

  • 149 uniformly generated random 3, 5, and 7-SAT instances of medium size. The clause-to-variable ratio is close to the satisfiability threshold [38].

  • 108 uniform random 3, 5, and 7-SAT instances of huge size, i. e., with over 50 000 variables. The clause-to-variable ratio of each instance is somewhat far from the satisfiability threshold.

Training Setup. The experiments were performed on the bwUniCluster and a local server. Sputnik [45] helped to parallelize the trials. The setup of the computer systems is heterogeneous. Therefore, the runtimes are not directly comparable to another. Consequently, we do not use the runtimes for these experiments. Instead, the number of variable flips performed by probSAT is used, which is a hardware-independent performance measure.

In this section, we use a timeout of \(10^9\) flips for 3-SAT instances (5-SAT: \(5 \cdot 10^8\); 7-SAT: \(2.5 \cdot 10^8\)). This timeout corresponds to roughly 10 min runtime on medium-sized instances on our hardware. Each instance from \(\mathcal {C}\) is run 1000 times for each parameter combination. The primary performance indicator in this section is the number of timeouts per instance. Furthermore, the average par2 value is sometimes used as a secondary performance indicator. The par2 value is the number of flips if a solution was found or twice the timeout otherwise. For the rest of this work, probSAT refers to probSAT version SC13_v2 [5].

Results of Training Experiments

We conducted a thorough statistical analysis of the data that was obtained in the training experiments described above. We describe our findings in condensed form below. The main part of the remainder of this section is concerned with uniform, medium-sized instances. The results for uniform, huge instances, and instances with a hidden solution are briefly discussed at the end of this section.

3-SAT. We found that adding all clauses up to width 4 that Glucose could find within 300 s is the most beneficial configuration. Not limiting the number of added clauses is in stark contrast to the 5-SAT and 7-SAT cases. For 3-SAT, the relationship between the number of clauses and the performance is explored in Fig. 2. Each blue dot corresponds to one medium-sized 3-SAT instance from \(\mathcal {C}\). We compare the average par2 value on the original instance with the average par2 value on the instance with all clauses up to width 4 added. Whenever the blue dot lies below the zero-baseline, then the performance of probSAT on the modified instance was better. The blue line is obtained by linear regression. By its slope, we can tell that, on average, adding more clauses is beneficial. The light blue area denotes the 95% confidence interval, which is calculated by bootstrapping [25]. The confidence interval shows that this relationship is unlikely to be due to chance. We conclude that the number of new clauses should not be limited for 3-SAT instances. On the other hand, the maximal width of the new clauses should be no more than four. Our experiments showed that adding longer clauses deteriorates the performance of probSAT.

Fig. 2.
figure 2

In this plot, probSAT is compared on the original 3-SAT instances and the modified instances. The number of added clauses of the 3-SAT instances in % is on the x-axis. The y-axis is the logarithm of par2(modified)/par2(orig). The blue line denotes a linear regression fit, and the light blue area is the 95% confidence interval obtained by 1000 bootstrapping steps [25]. (Color figure online)

The left-hand side of Fig. 3 shows the overall performance of probSAT on instances with all clauses up to width 4. The y-axis denotes the difference of timeouts between the modified instance with all width 4 clauses and the original instance. Whenever the dot lies below the zero-baseline, the performance of probSAT was better on the modified instance. The color of the dots stands for the hardness as measured in the average par2 on the original instance. We can see that adding additional clauses has a positive effect, especially on hard instances with 4000 to 9000 variables. Nonetheless, the effect reverses for more than 9000 variables. Overall the results on 3-SAT instances are not statistically significant (t-test, \(p=0.0595\)). However, we believe that the main reason for this is the bad performance of probSAT on the modified instances with more than 9000 variables. Furthermore, with a slightly larger sample size, the results might turn out to be statistically significant. We have used these observations in the configuration of GapSAT, as depicted in Fig. 5.

Fig. 3.
figure 3

Scatterplot of the number of variables of the 3-SAT (left) and 5-SAT (right) instance against difference in timeouts between probSAT on the original instance and the instance with the modified strategy. The color encodes the hardness of the original instance as measured in the logarithmic average par2 on 1000 runs of probSAT on the original instances. (Color figure online)

5-SAT. In preliminary experiments, we found that the maximal width of the new clauses should be in the interval \(\{7,8,9\}\), and the maximal number of new clauses should be at most \(15\%\) of the original clauses. The effect of adding more clauses is especially pronounced: Adding more than \(15\%\) of the clauses diminished the performance of probSAT dramatically, in contrast to the 3-SAT case where more clauses turned out to be beneficial.

In the detailed phase of the experiments, we found that the best configuration is adding clauses up to width 8 and using a limit of at most \(5\%\) of the original clauses. The results of this parameter configuration are shown on the right-hand side of Fig. 3. Again, the performance of probSAT was better on the modified instances if the dot lies below the zero-baseline. The color of the dots describes the hardness of the instance. Overall, the modification has a favorable impact on the performance of probSAT over the full domain. The effect is statistically significant (t-test, \(p=0.0348\)). Also, it appears to be increasing as the number of variables increase. However, we did not further investigate this relationship.

7-SAT. The preliminary experiments showed that the maximal width of the new clauses should be in the interval \(\{9,10,11\}\). Moreover, similarly to the 5-SAT case, the number of new clauses should be limited. In the preliminary phase, we found that at most \(3\%\) of the original clauses should be added, otherwise the performance of probSAT decreases.

The detailed phase showed that clauses up to width 9 and a limit of at most \(1\%\) is the most advantageous combination. Figure 4 shows the results of this combination. The modified strategy was better on average if the corresponding dot lies below the zero-baseline. Again, we observe that the performance of probSAT clearly benefits from the modified instances, especially on hard instances (red dots). This observation is also confirmed by the t-test (\(p=0.0062\)). Additionally, similar to 5-SAT instances, the effect seems to increase as the number of variables increase.

Fig. 4.
figure 4

Scatterplot of the number of variables of the 7-SAT instances against difference in timeouts between probSAT on the original instance and the instance with modified strategy. The color encodes the hardness of the original instance as measured in the logarithmic average par2 on 1000 runs of probSAT on the original instances. (Color figure online)

Hidden Solution. In our training set \(\mathcal {C}\), all instances with a hidden solution are 3-SAT instances with few variables (at most 540). The results are similar in nature to those discussed in the paragraph about uniform medium 3-SAT instances. That is, the addition of new clauses of maximal width 4 and no limit on the number of new clauses is generally beneficial.

Huge Instances. The huge instances in the training set \(\mathcal {C}\) often have several million original clauses. In contrast, only a few new clauses are learned during the preprocessing time. Consequently, the effect of additional clauses is negligible. The preprocessing step should, therefore, be avoided on these instances.

Description of GapSAT

The name GapSAT stands for Glucose assisted probSAT, hinting towards the combination of probSAT as the core solver, that is being helped by a Glucose preprocessing phase. The exact functioning principle of GapSAT is depicted in the flowchart of Fig. 5.

Fig. 5.
figure 5

Flowchart description of GapSAT.

As was noticeable in Fig. 3, if the 3-CNF formula contained more than approx. 9000 variables, the act of adding new clauses slows down probSAT. Furthermore, on huge instances, the preprocessing step yields no advantage. Thus, for over 9000 variables, the strategy of GapSAT falls back to just probSAT on the original formula. Otherwise, in each case, a short run of probSAT is used to filter out very easy to solve instances. The runtime is limited by the number of flips. If the instance could not be solved, we employ the strategy (depending on the maximal clause width in the formula) that was deemed most promising in the evaluations described in the previous subsection. That is, we first let Glucose extract clauses. The runtime of glucose was limited by 300 s in all cases. In the 5-SAT and 7-SAT case, Glucose could finish earlier, if the restrictions on the number of added clauses were met. We again emphasize that Fig. 2 explains the difference between the 3-SAT case when compared to the 5-SAT and 7-SAT case. Not restricting the number of learned clauses in the 3-SAT case turned out to be the superior strategy in our training experiments. One should further observe that Glucose has the possibility to solve the instance during its runtime. If this was not successful, probSAT is restarted on the formula, that was modified by running Glucose and adding the clauses corresponding to the strategy as developed in the previous subsection. It is noteworthy that GapSAT does not use any additional preprocessing techniques. We refer to Sect. 6 for a further discussion of that point.

5 Experimental Evaluation

In the following, the performance of GapSAT is evaluated. We compare GapSAT with the winner of the random track at the SAT competition 2018, Sparrow2Riss, and with the original version of probSAT.

Table 1. GapSAT, Sparrow2Riss, and probSAT are compared based on the number of solved instances and the corresponding score.

All experiments were executed on a computer with 32 Intel Xeon E5-2698 v3 CPUs running at 2.30 GHz. We set the time limit to 5000 s and used no memory limit. The benchmarks consist of all 255 instances of the random track at the SAT competition 2018. Unlike the experiments in Sect. 4, the performance of each solver is measured based on its par2 value w. r. t. the runtime in seconds. In the following, the score denotes the sum of the par2 values over all instances.

Fig. 6.
figure 6

Cactus plot comparing probSAT, Sparrow2Riss, and GapSAT on the instances of the random track of the SAT competition 2018. On the left, the plot is linearly-scaled; on the right, it is logarithmically-scaled.

As can be observed in Table 1, GapSAT solved substantially more instances than probSAT and Sparrow2Riss. The score of GapSAT is nearly halved compared to the score of Sparrow2Riss. Figure 6 demonstrates that GapSAT is especially efficient within the first few seconds. GapSAT solved more instances within 30 s than Sparrow2Riss solved within the standard timeout of 5000 s is a case in point. This behavior can be observed in the logarithmically-scaled part of Fig. 6. Furthermore, there are no instances that could be solved with Sparrow2Riss, but not with GapSAT within 5000 s.

We used statistical testing to evaluate the performance of GapSAT compared to Sparrow2Riss and probSAT. The t-test [44] shows that the score of GapSAT is better than both other solvers. We also used the Wilcoxon signed-rank test [46] to show that the median runtime of GapSAT is superior to Sparrow2Riss and probSAT. All results are statistically significant, with p-values less than \(10^{-9}\). Cohen’s d value [20] is 0.39 for the comparison with Sparrow2Riss and 0.73 for the comparison with probSAT.

The instances of the random track at the SAT competition 2018 can be split into three domains. Some instances are generated uniformly at random with a medium number of variables and a clause-to-variable ratio close to the satisfiability threshold [38]. Similarly, there are uniform random instances with a huge number of variables but with a clause-to-variable ratio not too close to the phase transition. Finally, there are randomly generated instances with a hidden solution [12]. Table 2 shows the performance of all three solvers on each domain. GapSAT was the fastest solver on all three domains. It should be stated that the performances of GapSAT and probSAT are interchangeable on huge, uniform instances since the differences are just due to random noise on this domain. Lastly, the average time needed to learn the new clauses was 103.17 s. That said, the actual time to perform the clause learning process is much shorter on most instances: The median time is just 7.89 s.

Table 2. GapSAT, Sparrow2Riss, and probSAT are compared on three domains based on the score.

We conclude that future generations of local search solvers should incorporate some kind of clause learning mechanisms, for example, as a prepossessing step as used by GapSAT.

6 Conclusion and Outlook

In this work, a novel combination of CDCL as a preprocessing step and local search as the main solver is introduced. We empirically show on several domains that short clauses learned by CDCL have a high number of correct literals w. r. t. a fixed solution. Consequently, these new clauses guide local search solvers towards a solution. Using this knowledge, we design a new SAT solver GapSAT which uses the CDCL solver Glucose in a preprocessing step to find new clauses. It then proceeds to use probSAT on the modified formula to find a solution. We show that GapSAT improves the state-of-the-art on randomly generated instances.

The GapSAT solver can be improved even further: Besides the techniques described in this paper, no preprocessing steps are performed. We believe that further, finely tuned preprocessing may help to increase the performance of GapSAT on instances where it struggled to find a solution. When tuning GapSAT, we used the original settings of probSAT (i. e., we use the parameters from [5]). The only tuned parameters are the number of new clauses and their length. An interesting direction for further research is to obtain even better performance by simultaneously tuning these parameters together with the probSAT settings. Furthermore, we argued that the clauses which are added to the formula have a substantial effect on the performance of SLS algorithms. Even though clauses learned by Glucose have good properties on average, it would be beneficial to devise a clause selection heuristic for local search algorithms. If clauses having a negative impact on local search can be avoided, then the overall performance of solvers like GapSAT should improve significantly. Another general question that could be investigated is about the clauses from MiniSAT, that is being used in Glucose. Clauses learned by Glucose are generated by conflict analysis but may depend on clauses generated by the MiniSAT preprocessing. It may be the case that short clauses are missed because they are only considered inside the solver, but never by the learning mechanism (when generated in the preprocessing step).

Supplementary Material. The source code of GapSAT and all evaluations are available under https://doi.org/10.5281/zenodo.3776052.