Keywords

1 Introduction

Algorithms to solve propositional Boolean Satisfiability (SAT) problems are being used extensively for various applications, such as artificial intelligence, circuit design, automatic test pattern generation, automatic theorem proving, and bounded model checking. Of course, SAT being NP-complete, scalability of these algorithms is an issue. Simplifying SAT problems prior to solving them has proven its effectiveness in modern conflict-driven clause learning (CDCL) SAT solvers [6, 9], particularly when applied on real-world applications relevant to software and hardware verification [8, 12, 17, 19]. It tends to produce reasonable reductions in acceptable processing time. Many techniques based on, e.g., variable elimination, clause elimination, and equivalence reasoning are being used to simplify SAT problems, whether prior to the solving phase (preprocessing) [8, 10, 12, 15, 16, 24] or during the search (inprocessing) [3, 18]. However, applying variable and clause elimination iteratively to large problems (in terms of the number of literals) can be a performance bottleneck in the whole SAT solving procedure, or even increase the number of literals, negatively impacting the solving time.

Recently, the authors of [2, 11] discussed the current main challenges in parallel SAT solving. One of these challenges concerns the parallelisation of SAT simplification in modern SAT solvers. Massively parallel computing systems such as Graphics Processing Units (GPUs) offer great potential to speed up computations, but to achieve this, it is crucial to engineer new parallel algorithms and data structures from scratch to make optimal use of those architectures. GPU platforms have become attractive for general-purpose computing with the availability of the Compute Unified Device Architecture (CUDA) programming model [20]. CUDA is widely used to accelerate applications that are computationally intensive w.r.t. data processing and memory access. In recent years, for instance, we have applied GPUs to accelerate explicit-state model checking [26, 27, 30], state space decomposition [28, 29] and minimisation [25], metaheuristic SAT solving [31], and SAT-based test generation [22].

In this paper, we introduce the first parallel algorithms for various techniques widely used in SAT simplification and discuss the various performance aspects of the proposed implementations and data structures. Also, we discuss the main challenges in CPU-GPU memory management and how to address them. In a nutshell, we aim to effectively simplify SAT formulas, even if they are extremely large, in only a few seconds using the massive computing capabilities of GPUs.

Contributions. We propose novel parallel algorithms to simplify SAT formulas using GPUs, and experimentally evaluate them, i.e., we measure both their runtime efficiency and their effect on the overall solving time, for a large benchmark set of SAT instances encoding real-world problems. We show how multiple variables can be eliminated simultaneously on a GPU while preserving the original satisfiability of a given formula. We call this technique Bounded Variable-Independent Parallel Elimination (BVIPE). The eliminated variables are elected first based on some criteria using the proposed algorithm Least-Constrained Variable Elections (LCVE). The variable elimination procedure includes both the so-called resolution rule and gate equivalence reasoning. Furthermore, we propose an algorithm for parallel subsumption elimination (PSE), covering both subsumption elimination and self-subsuming resolution.

The paper is organised as follows: Sect. 2 introduces the preliminaries. The main GPU challenges for SAT simplification are discussed in Sect. 3, and the proposed algorithms are explained in Sect. 4. Section 5 presents our experimental evaluation. Section 6 discusses related work, and Sect. 7 provides a conclusion and suggests future work.

2 Preliminaries

All SAT formulas in this paper are in conjunctive normal form (CNF). A CNF formula is a conjunction of clauses \(\bigwedge _{i=1}^{m} C_i\) where each clause \(C_i\) is a disjunction of literals \(\bigvee _{j=1}^{k} \ell _j\) and a literal is a Boolean variable x or its complement \(\lnot x\), which we refer to as \(\bar{x}\). We represent clauses by sets of literals \(C = \{\ell _1,\dots ,\ell _k\}\), i.e., \(\{\ell _1,\dots ,\ell _k\}\) represents the formula \(\ell _1 \vee \ldots \vee \ell _k\), and a SAT formula by a set of clauses \(\{ C_1, \ldots , C_m \}\), i.e., \(\{ C_1, \ldots , C_m \}\) represents the formula \(C_1 \wedge \ldots \wedge C_m\).

Variable Elimination. Variables can be removed from clauses by either applying the resolution rule or gate-equivalence reasoning. Concerning the former, we represent application of the resolution rule w.r.t. some variable x using a resolving operator \(\otimes _x\) on \(C_1\) and \(C_2\). The result of applying the rule is called the resolvent [24]. It is defined as \(C_1\otimes _x C_2=C_1 \cup C_2 \setminus \{x,\bar{x}\}\), and can be applied iff \(x \in C_1\), \(\bar{x} \in C_2\). The \(\otimes _x\) operator can be extended to resolve sets of clauses w.r.t. variable x. For a formula S, let \(S_x \subseteq S\), \(S_{\bar{x}} \subseteq S\) be the set of all clauses in S containing x and \(\bar{x}\), respectively. The new resolvents are defined as \(R_x(S) =\{C_1\otimes _x C_2\mid C_1\in S_x \wedge C_2\in S_{\bar{x}} \wedge \lnot \exists y. \{ y, \bar{y} \} \subseteq C_1\otimes _x C_2 \}\). The last condition addresses that a resolvent should not be a tautology, i.e. a self-satisfied clause in which a variable and its negation exist. The set of non-tautology resolvents can replace S, producing an equivalent SAT formula.

Gate-Equivalence Reasoning. This technique substitutes eliminated variables with deduced logical equivalent expressions. In this work, we focus on the reasoning of AND-OR gates since they are common in SAT-encoded problems and OR gates are likely to be found if AND-equivalence reasoning fails. In general, gate-equivalence reasoning can also be applied using other logical gates. A logical AND gate with k inputs \(\ell _1, \ldots , \ell _k\) and output x can be captured by the two implications \(x \implies \ell _1 \wedge \ldots \wedge \ell _k\) and \(\ell _1 \wedge \ldots \wedge \ell _k \implies x\). In turn, these two implications can be encoded in SAT clauses \(\{\{\bar{x},\ell _1\}, \ldots , \{\bar{x},\ell _k\}\}\) and \(\{\{x,\bar{\ell _1}, \ldots , \bar{\ell _k}\}\}\), respectively. Similarly, the implications of an OR gate \(x \implies \ell _1 \vee \ldots \vee \ell _k\) and \(\ell _1 \vee \ldots \vee \ell _k \implies x\) are expressed by the SAT clauses \(\{\{x,\bar{\ell _1}\}, \ldots , \{x,\bar{\ell _k}\}\}\) and \(\{\{\bar{x},\ell _1, \ldots , \ell _k\}\}\), respectively.

For instance, consider the following formula:

$$\begin{aligned} S=\{\{x,\bar{a},\bar{b}\},\{\bar{x},a\},\{\bar{x},b\},\{x,c\},\{\bar{x},\bar{b}\},\{y,f\},\{\bar{y},d,e\},\{y,\bar{d}\},\{y,\bar{e}\}\} \end{aligned}$$

The first three clauses in S capture the AND gate (xab) and the last three clauses capture the OR gate (yde). By substituting \(a \wedge b\) for x and \(d \vee e\) for y in the fourth, fifth, and sixth clauses, a new formula \(\{\{a,c\},\{b,c\},\{\bar{a},\bar{b}\},\{d,e,f\}\}\) can be constructed. Combining AND/OR-gate equivalence reasoning with the resolution rule tends to result in smaller formulas compared to only applying the resolution rule [8, 23].

Subsumption Elimination. A clause \(C_2\) is said to subsume clause \(C_1\) iff \(C_2\subseteq C_1\). The subsumed clause \(C_1\) is redundant and can be deleted from the original SAT equation. A special form of subsumption is called self-subsuming resolution. It is applicable for two clauses \(C_1, C_2\) iff for some variable x, we have \(C_1 = C_1' \cup \{ x \}\), \(C_2 = C_2' \cup \{ \bar{x} \}\), and \(C_2' \subseteq C_1'\). Consider the clauses: \(C_1=\{x,a,b\}\) and \(C_2=\{\bar{x},b\}\); \(C_2\) self-subsumes \(C_1\) since \(x \in C_1\), \(\bar{x} \in C_2\) and \(\{ b \} \subseteq \{ a, b \}\). The self-subsuming literal x can be discarded, producing clause \(\tilde{C_1}=\{a,b\}\). In other words, we say that \(C_1\) is strengthened by \(C_2\).

3 GPU Challenges: Memory and Data

GPU Architecture. CUDA is a programming model developed by NVIDIA [20] to provide a general-purpose programming paradigm and allow using the massive capabilities of GPU resources to accelerate applications. Regarding the processing hardware, a GPU consists of multiple streaming multiprocessors (SMs) and each SM resembles an array of streaming processors (SPs) where every SP can execute multiple threads grouped together in 32-thread scheduling units called warps. On the programming level, a program can launch a kernel (GPU global function) to be executed by thousands of threads packed in thread blocks of up to 1,024 threads or 32 warps. All threads together form a grid. The GPU manages the execution of a launched kernel by evenly distributing the launched blocks to the available SMs through a hardware warp scheduler.

Concerning the memory hierarchy, a GPU has multiple types of memory:

  • Global memory is accessible by all threads with high bandwidth but also high latency. The CPU (host) can access it as an interface to the GPU.

  • Shared memory is on-chip memory shared by the threads in a block; it is smaller in size and has lower latency than global memory. It can be used to efficiently communicate data between threads in a block.

  • Registers provide thread-local storage and provide the fastest memory.

To make optimal use of global memory bandwidth and hide its latency, using coalesced accesses is one of the best practices in global memory optimisation. When the threads in a warp try to access a consecutive block of 32-bit words, their accesses are combined into a single (coalesced) memory access. Uncoalesced memory accesses can for instance be caused by data sparsity or misalignment.

Regarding atomicity, a GPU is capable of executing atomic operations on both global and shared memory. A GPU atomic function typically performs a read-modify-write memory operation on one 32-bit or 64-bit word.

Memory Management. When small data packets need to be accessed frequently, both on the host (CPU) and device (GPU) side (which is the case in the current work), unified memory can play a crucial role in boosting the transfer rates by avoiding excessive memory copies. Unified memory creates a pool of managed memory that is shared between the CPU and GPU. This pool is accessible to both sides using a single pointer. Another advantage of unified memory is that it allows the CPU to allocate multidimensional pointers referencing global memory locations or nested structures. However, if a memory pool is required to be reallocated (resized), one must maintain memory coherency between the CPU-side and GPU-side memories. A reallocation procedure is necessary for our variable elimination algorithm, to make memory available when producing resolvents and reduce the memory use when removing clauses.

To better explain the coherency problem in reallocation, suppose there is an array A allocated and loaded with some data X, then X is visible from both the CPU and GPU memories. When A is reallocated from the host side, the memory is not physically allocated until it is first accessed, particularly when using an NVIDIA GPU with the Pascal architecture [20]. Once new data Y is written to A from the device side, both sides will observe a combination of X and Y, leading to memory corruptions and page faults. To avoid this problem, A must be reset on the host side directly after memory reallocation to assert the physical allocation. After that, each kernel may store its own data safely in the global memory. In the proposed algorithms, we introduce two types of optimisations addressing memory space and latency.

Regarding memory space optimisation, allocating memory dynamically each time a clause is added is not practical on a GPU while variables are eliminated in parallel. To resolve this, we initially launch a GPU kernel to calculate an upper bound for the number of resolvents to be added before the elimination procedure starts (Sect. 4.1). After this, reallocation is applied to store the new resolvents. Furthermore, a global counter is implemented inside our CNF data structure to keep track of new clauses. This counter is incremented atomically by each thread when adding a clause.

Concerning memory latency optimisation, when thread blocks produce resolvents, these can initially be stored in shared memory. Checking for tautologies can then be done by accessing shared memory, and non-tautologies can be written back to the global memory in a new CNF formula. Also, the definitions of AND-OR gates can be stored in shared memory, to be used later when applying clause substitution (see Sect. 4.1). This has the advantage of reducing the number of global memory accesses. Nevertheless, the size of shared memory in a GPU is very limited (48 KB in most architectures). If the potential size of a resolvent is larger than the amount pre-allocated for a single clause, our BVIPE algorithm automatically switches to the global memory and the resolvent is directly added to the new CNF formula. This mechanism reduces the global memory latency when applicable and deals with the shared memory size limitation dynamically.

Data Structures. The efficiency of state-of-the-art sequential SAT solving and preprocessing is to a large extent due to the meticulously coded data structures. When considering SAT simplification on GPUs, new data structures have to be tailored from scratch. In this work, we need two of them, one for the SAT formula in CNF form (which we refer to as CNF) and another for the literal occurrence table (occurTAB), via which one can efficiently iterate over all clauses containing a particular literal. In CPU implementations, typically, they are created using heaps and auto-resizable vectors, respectively. However, heaps and vectors are not suitable for GPU parallelisation, since data is inserted, reallocated and sorted dynamically. The best GPU alternative is to create a nested data structure with arrays using unified memory (see Fig. 1). The CNF contains a raw pointer (linear array) to store CNF literals and a child structure Clause to store clause info. Each clause has a head pointer referring to its first literal. The occurTAB structure has a raw pointer to store the clause occurrences (array pointers) for each literal in the formula and a child structure occurList. The creation of an occurList instance is done in parallel per literal using atomic operations. For each clause C, a thread is launched to insert the occurrences of C’s literals in the associated occurLists. One important remark is that two threads storing the occurrences of different literals do not have to wait for each other. For instance, occurTAB in Fig. 1 shows two different atomic insertions executed at the same time for literals 2 and –1 (if an integer i represents a literal x, then \(-i\) represents \(\bar{x}\)). This minimises the performance penalty of using atomics.

The advantages of the proposed data structures are: as mentioned above, occurTAB instances can be constructed in parallel. Furthermore, coalesced access is guaranteed since pointers are stored consecutively (the gray arrows in Fig. 1), and no explicit memory copying is done (host and device pointers are identical) making it easier to integrate the data structures with any sequential or parallel code.

4 Algorithm Design and Implementation

4.1 Parallel Variable Elimination

In order to eliminate Boolean variables simultaneously in SAT formulas without altering the original satisfiability, a set of variables should be selected for elimination checking that contains only variables that are independent of each other. The LCVE algorithm we propose is responsible for electing such a subset from a set of authorised candidates. The remaining variables relying on the elected ones are frozen.

Fig. 1.
figure 1

An example of CNF and occurTAB data structures.

figure a

Definition 1

(Authorised candidates). Given a CNF formula S, we call \(\mathcal {A}\) the set of authorised candidates: \(\mathcal {A}= \{x\mid 1\le h[x]\le \mu \vee 1\le h[\bar{x}]\le \mu \}\), where

  • h is a histogram array (h[x] is the number of occurrences of x in S)

  • \(\mu \) denotes a given maximum number of occurrences allowed for both x and its complement, representing the cut-off point for the LCVE algorithm

Definition 2

(Candidate Dependency Relation). We call a relation \(\mathcal {D}:\mathcal {A}\times \mathcal {A}\) a candidate dependency relation iff \(\,\forall x,y\in \mathcal {A}\), \(x\, \mathcal {D}\,y\) implies that \(\exists C\in S. (x \in C \vee \bar{x} \in C) \wedge (y \in C \vee \bar{y} \in C)\)

Definition 3

(Elected candidates). Given a set of authorised candidates \(\mathcal {A}\), we call a set \(\varphi \subseteq \mathcal {A}\) a set of elected candidates iff \(\,\forall x,y\in \varphi .\,\lnot (x\,\mathcal {D}\,y)\)

Definition 4

(Frozen candidates). Given the sets \(\mathcal {A}\) and \(\varphi \), the set of frozen candidates \(\mathcal {F} \subseteq \mathcal {A}\) is defined as \(\mathcal {F}=\{x\mid x\in \mathcal {A}\wedge \exists y\in \varphi .\,x\,\mathcal {D}\,y\}\)

Before LCVE is executed, a sorted list of the variables in the CNF formula needs to be created, ordered by the number of occurrences in the formula, in ascending order (following the same rule as in [8]). From this list, the authorised candidates \(\mathcal {A}\) can be straightforwardly derived, using \(\mu \) as a cut-off point. Construction of this list can be done efficiently on a GPU using Algorithms 1 and 2. Algorithm 1 is labelled GPU, indicating that each individual step can be launched as a GPU computation. In contrast, algorithms providing kernel code, i.e., that describe the steps each individual GPU thread should perform as part of a GPU computation, such as Algorithm 2, are labelled GPU kernel.

figure b

As input, Algorithm 1 requires a SAT formula S as an instance of CNF. Furthermore, it requires the number of variables n occurring in S and a cut-off point \(\mu \). At line 1, a histogram array h, providing for each literal the number of occurrences in S, is constructed. This histogram can be constructed on the GPU using the histogram method offered by the Thrust library [4, 5]. At line 2, memory is allocated for the arrays \(\mathcal {A}\) and \(\textit{scores} \). With h, these are input for the kernel assignScores. Once the kernel execution has terminated, at line 4, the candidates in \(\mathcal {A}\) are sorted on the GPU based on their scores in \(\textit{scores} \) and \(\mu \) is used on the GPU to prune candidates with too many occurrences. We used the radix-sort algorithm as provided in the Thrust library [5].

In the kernel of Algorithm 2, at line 1, the \(\textit{tid} \) variable refers to the thread identifier in the launched grid where \(\textit{xThread} \) is the thread index inside a block of size \(\textit{blockDim} \) and \(\textit{xBlock} \) is the block index inside the entire grid of size \(\textit{gridDim} \). The stride variable is used to determine the distance between variables that have to be processed by the same thread. In the subsequent while loop (lines 2–9), the thread index is used as a variable index (variable indices start at 1), jumping ahead stride variables at the end of each iteration. At lines 5–8, a score is computed for the currently considered variable x. This score should be indicative of the number of resolvents produced when eliminating x, which depends on the number of occurrences of both x and \(\bar{x}\), and can be approximated by the formula \(h[x]\times h[\bar{x}]\). To avoid score zero in case exactly one of the two literals does not occur in S, we consider that case separately. On average, Algorithm 1 outperforms the sequential counterpart 52\(\times \), when considering our benchmark set of problems [21].

LCVE Algorithm. Next, Algorithm 3 is executed on the host, given S, \(\mathcal {A}\), h and an instance of occurTAB named OT. This algorithm accesses \(2 \cdot |\mathcal {A}|\) number of occurList instances and parts of S. The use of unified memory significantly improves the rates of the resulting transfers and avoids explicitly copying entire data structures to the host side. The output is \(\varphi \), implemented as a list. Function abs is defined as follows: \(\mathbf{abs }(x) = x\) and \(\mathbf{abs }(\bar{x}) = x\). The algorithm considers all variables x in \(\mathcal {A}\) (line 2). If x has not yet been frozen (line 3), it adds x to \(\varphi \) (line 4). Next, the algorithm needs to identify all variables that depend on x. For this, the algorithm iterates over all clauses containing either x or \(\bar{x}\) (line 5), and each literal \(\ell \) in those clauses is compared to x (lines 6–8). If \(\ell \) refers to a different variable v, and v is an authorised candidate, then v must be frozen (line 9).

figure c

BVIPE GPU Algorithm. After \(\varphi \) has been constructed, a kernel is launched to compute an upper bound for the number of resolvents (excluding tautologies) that may be produced by eliminating variables in \(\varphi \). This kernel accumulates the number of resolvents of each variable using parallel reduction in shared memory within thread blocks. The resulting values (resident in shared memory) of all blocks are added up by atomic operations, resulting in the final output, stored in global memory (denoted by \(|\tilde{S}|\)). Afterwards, the CNF is reallocated according to the extra memory needed. The parallel variable elimination kernel (Algorithm 4) is now ready to be performed on the GPU, considering both the resolution rule and gate-equivalence reasoning (Sect. 2).

In Algorithm 4, first, each thread selects a variable in \(\varphi \), based on \(\textit{tid} \) (line 4). The eliminated array marks the variables that have been eliminated. It is used to distinguish eliminated and non-eliminated variabels when executing Algorithm 6.

Each thread checks the control condition at line 5 to determine whether the number of resolvents (\(h[x]\times h[\bar{x}]\)) of x will be less than the number of deleted clauses (\(h[x] + h[\bar{x}]\)). If the condition evaluates to true, a list resolvents is created in shared memory, which is then added to the simplified formula \(\tilde{S}\) in global memory after discarding tautologies (line 8). The markDeleted routine marks resolved clauses as deleted. They are actually deleted on the host side, once the algorithm has terminated.

At line 10, definitions of AND and OR gates are deduced by the gateReasoning routine, and stored in shared memory in the lists \(\alpha \) and \(\beta \), respectively. If at least one gate definition is found, the clauseSubstitution routine substitutes the involved variable with the underlying definition (line 11), creating the resolvents.

In some situations, even if h[x] or \(h[\bar{x}]\) is greater than 1, the number of resolvents can be smaller than the deleted clauses, due to the fact that some resolvents may be tautologies that are subsequently discarded. For this reason, we provide a third alternative to lookahead for tautologies in order to conclusively decide whether to resolve a variable if the conditions at lines 5 and 10 both evaluate to false. This third option (line 15) has lower priority than gate equivalence reasoning (line 10), since the latter in practice tends to perform more reduction than the former.

figure d

The sequential running time of Algorithm 4 is \(\mathcal {O}(k \cdot |\varphi |)\), where k is the maximum length of a resolved clause in S. In practice, k often ranges between 2 and 10. Therefore, the worst case is linear w.r.t. \(|\varphi |\). Consequently, the parallel complexity is \(\mathcal {O}(|\varphi |/p)\), where p is the number of threads. Since a GPU is capable of launching thousands of threads, that is, \(p\approx |\varphi |\), the parallel complexity is an amortised constant \(\mathcal {O}(1)\). Our empirical results show an average speedup of 32\(\times \) compared to the sequential alternative [21].

4.2 Parallel Subsumption Elimination

Algorithm 5 presents our PSE algorithm. Notice that the subsumption check (lines 6–7) has a higher priority than self-subsumption (lines 8–12) because the former often results in deleting more clauses. We apply PSE on the most constrained variables, that is, the variables that occur most frequently in S, to maximise parallelism on the GPU. For each literal \(\ell \), the PSE algorithm is launched on a two-dimensional GPU grid, in which threads and blocks have two identifiers. Each thread compares the two clauses in \(\textit{OT} [\ell ]\) that are associated with the thread coordinates in the grid. At line 4, those two clauses are obtained. At line 5, it is checked whether subsumption or self-subsumption may be applicable. For this to be the case, the length of one clause needs to be larger than the other. The sig routine compares the identifiers of two clauses. The identifier of a clause is computed by hashing its literals to a 64-bit value [8]. It has the property of refuting many non-subsuming clauses, but not all of them since hashing collisions may occur. At line 6, the isSubset routine runs a set intersection algorithm in linear time \(\mathcal {O}(k)\) assuming that both clauses are sorted. If one clause is indeed a subset of the other, the latter clause is marked for deletion later (line 7).

figure e

As an alternative, the applicability of self-subsumption is checked at line 8. If C is not a unit clause, i.e., \(|C| > 1\), then self-subsumption can be considered. We exclude unit clauses, because they cannot be reduced in size; instead, they should be propagated over the entire formula before being removed (doing so is planned for future work). If applicable, the routine isSelfSub returns the position in C of the literal to be removed, otherwise –1 is returned. This literal is marked for removal in C, using an atomic operation, to avoid race conditions on writing and reading the literal. Finally, the clause is marked as being altered by self-subsumption (line 12), as after execution of the algorithm, the marked literals need to be removed, and the clause resized accordingly.

Finally, following the concept of thread reuse, each thread jumps ahead \(\textit{stride} \) variables in both dimensions at the end of each while loop iteration (line 13). Sequential subsumption elimination has time complexity \(\mathcal {O}(k\cdot |\textit{OT} [\ell ]|^2)\). By launching enough threads in both dimensions (i.e., \(|\textit{OT} [\ell ]|^2\approx p\)), the parallel complexity becomes \(\mathcal {O}(k)\). In practice, the average speedup of PSE compared to sequential SE is 9\(\times \) [21].

Correctly Strenghtening Clauses in PSE. Strengthening self-subsumed clauses cannot be done directly at line 9 of Algorithm 5. Instead, clauses need to be altered in-place. Consider the case that a clause C can be self-subsumed by two different clauses \(C_1\), \(C_2\) on two different self-subsuming literals, and that this is detected by two different threads at the exact same time. We call this phenomenon the parallel-effect of self-subsumption. Notice that the same result can be obtained by applying self-subsumption sequentially in two successive rounds. In the first round, C is checked against \(C_1\), while in the second round, what is left of C is checked against \(C_2\). If C would be removed by \(t_1\) and replaced by a new, strengthened clause, then \(t_2\) may suddenly be accessing unallocated memory. Instead, C is atomically strengthened in-place at line 11; t marks the self-subsumed literal using an atomic Compare-And-Swap operation. At the end, all clauses marked as self-subsumed have to be shrunk in size, in essence, overwriting reset positions. This is performed in a separate kernel call.

figure f

Parallel Hybrid Subsumption Elimination (PHSE). PHSE is executed on elected variables that could not be eliminated earlier by variable elimination. (Self)-subsumption elimination tends to reduce the number of occurrences of these non-eliminated variables as they usually eliminate many literals. After performing PHSE (Algorithm 6), the BVIPE algorithm can be executed again, after which PHSE can be executed, and so on, until no more literals can be removed.

Unlike PSE, the parallelism in the PHSE kernel is achieved on the variable level. In other words, each thread is assigned to a variable when executing PHSE. At line 4, previously eliminated variables are skipped. At line 6, a new clause is loaded, referenced by \(\textit{OT} [x]\), into shared memory (we call it the shared clause, \({\textit{sh}\_C}\)).

The shared clause is then compared in the loop at lines 7–12 to all clauses referenced by \(\textit{OT} [\bar{x}]\) to check whether x is a self-subsuming literal. If so, both the original clause C, which resides in the global memory, and \(\textit{sh}\_C\) must be strengthened (via the strengthenWT function). Subsequently, the strengthened \(\textit{sh}\_C\) is used for subsumption checking in the loop at lines 13–15.

Regarding the complexity of Algorithm 6, the worst-case is that a variable x occurs in all clauses of S. However, in practice, the number of occurrences of x is bounded by the threshold value \(\mu \) (see Definition 1). The same applies for its complement. Therefore, worst case, a variable and its complement both occur \(\mu \) times. As PHSE considers all variables in \(\varphi \) and worst case has to traverse each loop \(\mu \) times, its sequential worst-case complexity is \(\mathcal {O}(|\varphi |\cdot \mu ^2)\) and its parallel worst-case complexity is \(\mathcal {O}(\mu ^2)\).

5 Benchmarks

We implemented the proposed algorithms in CUDA C++ using CUDA toolkit v9.2. We conducted experiments using an NVIDIA Titan Xp GPU which has 30 SMs (128 cores each), 12 GB global memory and 48 KB shared memory. The machine equipped with the GPU was running Linux Mint v18. All GPU SAT simplifications were executed in four phases iteratively. Every two phases, the resulting CNF was reallocated to discard removed clauses. In general, the number of iterations can be configured, and the reallocation frequency can be also set to a desired value.

We selected 214 SAT problems from the industrial track of the 2013, 2016 and 2017 SAT competitions [13]. This set consists of almost all problems from those tracks that are more than 1 MB in file size. The largest size of problems occurring in this set is 1.2 GB. These problems have been encoded from 31 different real-world applications that have a whole range of dissimilar logical properties. Before applying any simplifications using the experimental tools, any occurring unit clauses were propagated. The presence of unit clauses immediately leads to simplification of the formula. By only considering formulas without unit clauses, the benchmark results directly indicate the true impact of our preprocessing algorithms.

In the benchmark experiments, besides the implementations of our new GPU algorithms, we involved the SatELite preprocessor [8], and the MiniSat and Lingeling [6] SAT solvers for the solving of problems, and executed these on the compute nodes of the DAS-5 cluster [1]. Each problem was analysed in isolation on a separate computing node. Each computing node had an Intel Xeon E5-2630 CPU running at a core clock speed of 2.4 GHz with 128 GB of system memory, and runs on the CentOS 7.4 operating system. We performed the equivalent of 6 months and 22 days of uninterrupted processing on a single node to measure how GPU SAT simplification impacts SAT solving in comparison to using sequential simplification or no simplification. The time out for solving experiments and simplification experiments was set to 24 h and 5,000 s, respectively. Time outs are marked ‘T’ in the tables in this section. Out-of-memory cases are marked ‘M’ in the tables.

Table 1 gives the MiniSat solving performance summed over the CNF families for both the original and simplified problems produced by GPU SAT simplification and SatELite. The ve+ and ve modes in GPU simplification represent variable elimination with and without PHSE, respectively. The all mode involves both ve+ and PSE. The numbers between brackets of columns 2 to 7 denote the number of solved instances at each family. Bold results in column 6 indicate that the combination of GPU SAT simplification and MiniSat reduced the net solution times (preprocessing + solving), or allowed more problems to be solved compared to the ve column of SatElite + MiniSat (column 3). Likewise, the all column for our results (column 7) is compared to column 4. The final four columns summarise the number of instances solved faster by MiniSat when using GPU SAT simplification (GPU+M) compared to not using GPU SAT simplification, and compared to using SatElite for simplification (SatElite+M). Numbers in bold indicate that 50% or more of the CNF formulas were solved faster.

The final row accumulates all problems solved faster. The percentage expresses the induced improvement by GPU SAT simplification over MiniSat and SatElite. Similarly, Table 2 gives the performance of solving the problems with Lingeling and GPU SAT simplification+Lingeling (GPU+L). Since SatELite is the groundwork of preprocessing in Lingeling, there is no reason to apply it again.Footnote 1

The presented data shows that GPU+L solved many instances faster than SatElite with Lingeling, especially for the ak128, blockpuzzle, and sokoban problems. In addition, GPU SAT simplification allowed Lingeling to solve more instances of the hwmcc, velev, and sncf models. This effect did not occur for MiniSat.

Table 1. MiniSat solving of original and simplified formulas (time in seconds).
Table 2. Lingeling solving of original and simplified formulas (time in seconds).
Table 3. Comprehensive evaluation of simplification impact on original CNF size

Table 3 provides an evaluation of the achieved simplifications with GPU SAT simplification and SatELite, where V, C, L are the number of variables, clauses, and literals (in thousands), respectively, and t(s) is the runtime in seconds. In case of GPU SAT simplification, t includes the transfer time between host and device. Bold results indicate significant improvements in reductions on literals for more than 50% of the instances.

On average, SatElite managed to eliminate more variables and clauses at the expense of literals explosion. Regarding scalability, GPU SAT simplification is able to preprocess extremely large problems (e.g., the esawn problem, sokoban-p20.sas.ex.23, and the high-depth sncf model) in only a few seconds while SatElite failed. For our benchmark set of SAT problems, GPU SAT simplification methods ve + and all achieve on average a speedup of 66\(\times \) and 12\(\times \) compared to SatElite ve and all, respectively [21].

Our appendix [21] presents larger tables, providing all our results related to Tables 1, 2 and 3. In addition, it presents a comparison between our GPU simplification algorithms and directly ported to the CPU versions of the algorithms, to provide more insight into the contribution of the GPU parallelisation. On average, subsumption elimination is performed 9\(\times \) faster on the GPU, hybrid subsumption elimination is performed 15\(\times \) faster on the GPU, and BVE is conducted 32\(\times \) faster on the GPU.

6 Related Work

Subbarayan and Pradhan [24] provided the first Bounded Variable Elimination (BVE) technique, called NiVER, based on the resolution rule of Davis-Putnam-Logemann-Loveland (DPLL) [7]. Eén and Biere [8] extended NiVER with subsumption elimination and clause substitution. However, the subsumption check is only performed on the clauses resulting from variable elimination, hence no reductions are obtained if there are no variables to resolve.

Heule et al. [14, 16] introduced several approaches for clause elimination that can be effective in SAT simplifications, such as Blocked Clause Elimination. Bao et al. [3], on the other hand, have presented an efficient implementation of Common Sub-clause Elimination (CSE) performed periodically during SAT solving. CSE trims the search space, thereby decreasing the solving time, and is stable, i.e., the outcome of CSE does not depend on the chosen clause ordering. Gebhardt and Manthey [10] presented the first attempt to parallelise SAT preprocessing on a multi-core CPU using a locking scheme to prevent threads corrupting the SAT formula. However, they reported only a very limited speedup of on average 1.88\(\times \) when running on eight cores.

All the above methods apply sound simplifications, but none are tailored for GPU parallelisation. They may consume considerable time when processing large problems.

Finally, it should be noted that BVE, as introduced in [8, 10, 16, 24], is not confluent, as noted by the authors of [16]. Due to dependency between variables, altering the elimination order of these variables may result in different simplified formulas. This drawback is circumvented by our LCVE algorithm, which makes it possible to perform parallel variable elimination while achieving the confluence property.

7 Conclusion

We have shown that SAT simplifications can be performed efficiently on many-core systems, producing impactful reductions in a fraction of a second, even for larger problems consisting of millions of variables and tens of millions of clauses. The proposed BVIPE algorithm provides the first methodology to eliminate multiple variables in parallel while preserving satisfiability. Finally, PSE and PHSE have proven their effectiveness in removing many clauses and literals in a reasonable amount of time.

Concerning future work, the results of this work motivate us to take the capabilities of GPU SAT simplification further by supporting more simplification techniques or balancing the workload on multiple GPUs.