figure a

1 Introduction

Context. Software verification methods have come to rely increasingly on reasoning over logical formulas modulo theory. In particular, the ability to generate models (i.e., find solutions) of a formula is of utmost importance, typically in the context of bug finding or intensive testing—symbolic execution [21] or bounded model checking [7]. Since quantifier-free first-order formulas on well-suited theories are sufficient to represent many reachability properties of interest, the Satisfiability Modulo Theory (SMT) [6, 25] community has primarily dedicated itself to designing solvers able to efficiently handle such problems.

Yet, universal quantifiers are sometimes needed, typically when considering preconditions or code abstraction. Unfortunately, most theories handled by SMT-solvers are undecidable in the presence of universal quantifiers. There exist dedicated methods for a few decidable quantified theories, such as Presburger arithmetic [9] or the array property fragment [8], but there is no general and effective enough approach for the model generation problem over universally quantified formulas. Indeed, generic solutions for quantified formulas involving heuristic instantiation and refutation are best geared to proving the unsatisfiability of a formula (i.e., absence of solution) [13, 20], while recent proposals such as local theory extensions [2], finite instantiation [31, 32] or model-based instantiation [20, 29] either are too narrow in scope, or handle quantifiers on free sorts only, or restrict themselves to finite models, or may get stuck in infinite refinement loops.

Goal and Challenge. Our goal is to propose a generic and efficient approach to the model generation problem over arbitrary quantified formulas with support for theories commonly found in software verification. Due to the huge effort made by the community to produce state-of-the-art solvers for quantifier-free theories ( QF -solvers), it is highly desirable for this solution to be compatible with current leading decision procedures, namely SMT approaches.

Proposal. Our approach turns a quantified formula into a quantifier-free formula with the guarantee that any model of the latter contains a model of the former. The benefits are threefold: the transformed formula is easier to solve, it can be sent to standard QF-solvers, and a model for the initial formula is deducible from a model of the transformed one. The idea is to ignore quantifiers but strengthen the quantifier-free part of the formula with an independence condition constraining models to be independent from the (initially) quantified variables.

Contributions. This paper makes the following contributions:

  • We propose a novel and generic framework for model generation of quantified formula (Sect. 5, Algorithm 1) relying on the inference of sufficient independence condition (Sect. 4). We prove its correctness (Theorem 1, mechanized in Coq) and its efficiency under reasonable assumptions (Propositions 4 and 5). Especially our approach implies only a linear overhead in the formula size. We also briefly study its completeness, related to the notion of weakest independence condition.

  • We define a taint-based procedure for the inference of independence conditions (Sect. 5.2), composed of a theory-independent core (Algorithm 2) together with theory-dependent refinements. We propose such refinements for a large class of operators (Sect. 6.2), encompassing notably arrays and bitvectors.

  • Finally, we present a concrete implementation of our method specialized on arrays and bitvectors (Sect. 7). Experiments on SMT-LIB benchmarks and software verification problems notably demonstrate that we are able not only to very effectively lift quantifier-free decision procedures to the quantified case, but also to supplement recent advances, such as finite or model-based quantifier instantiation [20, 29, 31, 32]. Indeed, we concretely supply SMT solvers with the ability to efficiently address an extended set of software verification questions.

Discussions. Our approach supplements state-of-the-art model generation on quantified formulas by providing a more generic handling of satisfiable problems. We can deal with quantifiers on any sort and we are not restricted to finite models. Moreover, this is a lightweight preprocessing approach requiring a single call to the underlying quantifier-free solver. The method also extends to partial elimination of universal quantifiers, or reduction to quantified-but-decidable formulas (Sect. 5.4).

While techniques a la E-matching allow to lift quantifier-free solvers to the unsatisfiability checking of quantified formulas, this works provides a mechanism to lift them to the satisfiability checking and model generation of quantified formulas, yielding a more symmetric handling of quantified formulas in SMT. This new approach paves the way to future developments such as the definition of more precise inference mechanisms of independence conditions, the identification of interesting subclasses for which inferring weakest independence conditions is feasible, and the combination with other quantifier instantiation techniques.

2 Motivation

Let us take the code sample in Fig. 1 and suppose we want to reach function analyze_me. For this purpose, we need a model (a.k.a., solution) of the reachability condition \(\phi \triangleq ax+b>0\), where a, b and x are symbolic variables associated to the program variables a, b and x. However, while the values of a and b are user-controlled, the value of x is not. Therefore if we want to reach analyze_me in a reproducible manner, we actually need a model of \(\phi _{\forall } \triangleq \forall {x}.ax+b>0\), which involves universal quantification. While this specific formula is simple, model generation for quantified formulas is notoriously difficult: PSPACE-complete for booleans, undecidable for uninterpreted functions or arrays.

Fig. 1.
figure 1

Motivating example

Reduction to the Quantifier-Free Case Through Independence. We propose to ignore the universal quantification over x, but restrict models to those which do not depend on x. For example, model \(\{\mathtt {a}=1, \mathtt {x}=1, \mathtt {b}=0\}\) does depend on x, as taking \(x=0\) invalidates the formula, while model \(\left\{ \mathtt {a}=0, \mathtt {x}=1, \mathtt {b}=1\right\} \) is independent of x. We call constraint \(\psi \triangleq (a=0)\) an independence condition: any interpretation of \(\phi \) satisfying \(\psi \) will be independent of x, and therefore a model of \(\phi \wedge \psi \) will give us a model of \(\phi _{\forall }\).

Inference of Independence Conditions Through Tainting. Figure 1 details in its right part a way to infer such independence conditions. Given a quantified reachability condition (1), we first associate to every variable v a (boolean) taint variable \(v^\bullet \) indicating whether the solution may depend on v (value \(\top \)) or not (value \(\bot \)). Here, \(x^\bullet \) is set to \(\bot \), \(a^\bullet \) and \(b^\bullet \) are set to \(\top \) (2). An independence condition (3)—a formula modulo theory—is then constructed using both initial and taint variables. We extend taint constraints to terms, \(t^\bullet \) indicating here whether t may depend on x or not, and we require the top-level term (i.e., the formula) to be tainted to \(\top \) (i.e., to be indep. from x). Condition (3) reads as follows: in order to enforce that \((ax+b>0)^\bullet \) holds, we enforce that \((ax)^\bullet \) and \(b^\bullet \) hold, and for \((ax)^\bullet \) we require that either \(a^\bullet \) and \(x^\bullet \) hold, or \(a^\bullet \) holds and \(a=0\) (absorbing the value of x), or the symmetric case. We see that \(\cdot ^\bullet \) is defined recursively and combines a systematic part (if \(t^\bullet \) holds then \(f(t)^\bullet \) holds, for any f) with a theory-dependent part (here, based on \(\times \)). After simplifications (4), we obtain \(a=0\) as an independence condition (5) which is adjoined to the reachability condition freed of its universal quantification (6). A QF-solver provides a model of (6) (e.g., \(\left\{ \mathtt {a}=0,\mathtt {b}=1,\mathtt {x}=5 \right\} \)), lifted into a model of (1) by discarding the valuation of x (e.g., \(\left\{ \mathtt {a}=0,\mathtt {b}=1 \right\} \)).

In this specific example the inferred independence condition (5) is the most generic one and (1) and (6) are equisatisfiable. Yet, in general it may be an under-approximation, constraining the variables more than needed and yielding a correct but incomplete decision method: a model of (6) can still be turned into a model of (1), but (6) might not have a model while (1) has.

3 Notations

We consider the framework of many-sorted first-order logic with equality, and we assume standard definitions of sorts, signatures and terms. Given a tuple of variables \(\varvec{x} \triangleq \left( x_1,\dots ,x_n\right) \) and a quantifier \(\mathcal {Q}\) (\(\forall \) or \(\exists \)), we shorten \(\mathcal {Q}{x_1}\dots \mathcal {Q}{x_n}.\varPhi \) as \(\mathcal {Q}\varvec{x}.\varPhi \). A formula is in prenex normal form if it is written as \(\mathcal {Q}_1{\varvec{x}_1}\dots \mathcal {Q}_n{\varvec{x}_n}.\varPhi \) with \(\varPhi \) a quantifier-free formula. A formula is in Skolem normal form if it is in prenex normal form with only universal quantifiers. We write \(\varPhi \left( \varvec{x}\right) \) to denote that the free variables of \(\varPhi \) are in \(\varvec{x}\). Let \(\varvec{t}\triangleq \left( t_1,\dots ,t_n\right) \) be a term tuple, we write \(\varPhi \left( \varvec{t}\right) \) for the formula obtained from \(\varPhi \) by replacing each occurrence of \(x_i\) in \(\varPhi \) by \(t_i\). An interpretation \(\mathcal {I}\) associates a domain to each sort of a signature and a value to each symbol of a formula, and \(\llbracket \varDelta \rrbracket _{\mathcal {I}}\) denotes the evaluation of term \(\varDelta \) over \(\mathcal {I}\). A satisfiability relation \(\models \) between interpretations and formulas is defined inductively as usual. A model of \(\varPhi \) is an interpretation \(\mathcal {I}\) satisfying \(\mathcal {I}\models \varPhi \). We sometimes refer to models as “solutions”. Formula \(\varPsi \) entails formula \(\varPhi \), written \(\varPsi \models \varPhi \), if every interpretation satisfying \(\varPsi \) satisfies \(\varPhi \) as well. Two formulas are equivalent, denoted \(\varPsi \equiv \varPhi \), if they have the same models. A theory \(\mathcal {T}\triangleq \left( \varSigma ,\varvec{\mathcal {I}}\right) \) restricts symbols in \(\varSigma \) to be interpreted in \(\varvec{\mathcal {I}}\). The quantifier-free fragment of \(\mathcal {T}\) is denoted QF-\(\mathcal {T}\).

Convention. Letters \(a,b,c\dots \) denote uninterpreted symbols and variables. Letters \(x,y,z\dots \) denote quantified variables. \(\varvec{a},\varvec{b},\varvec{c}\) denote sets of uninterpreted symbols. \(\varvec{x},\varvec{y},\varvec{z}\dots \) denote sets of quantified variables. Finally, \(\mathtt {a},\mathtt {b},\mathtt {c}\dots \) denote valuations of associated (sets of) symbols.

In the rest of this paper, we assume w.l.o.g. that all formulas are in Skolem normal form. Recall that any formula \(\phi \) in classical logic can be normalized into a formula \(\psi \) in Skolem normal form such that any model of \(\phi \) can be lifted into a model of \(\psi \), and vice versa. This strong relation, much closer to formula equivalence than to formula equisatisfiability, ensures that our correctness and completeness results all along the paper hold for arbitrarily quantified formula.

Companion Technical Report. Additional technical details (proofs, experiments, etc.) are available online at http://benjamin.farinier.org/cav2018/.

4 Musing with Independence

4.1 Independent Interpretations, Terms and Formulas

A solution \((\mathtt {x},\mathtt {a})\) of \(\varPhi \) does not depend on \(\varvec{x}\) if \(\varPhi (\varvec{x},\varvec{a})\) is always true or always false, for all possible valuations of \(\varvec{x}\) as long as \(\varvec{a}\) is set to \(\mathtt {a}\). More formally, we define the independence of an interpretation of \(\varPhi \) w.r.t. \(\varvec{x}\) as follows:

Definition 1

(Independent interpretation)

  • Let \(\varPhi \left( \varvec{x},\varvec{a}\right) \) a formula with free variables \(\varvec{x}\) and \(\varvec{a}\). Then an interpretation \(\mathcal {I}\) of \(\varPhi \left( \varvec{x},\varvec{a}\right) \) is independent of \(\varvec{x}\) if for all interpretations \(\mathcal {J}\) equal to \(\mathcal {I}\) except on \(\varvec{x}\), \(\mathcal {I}\models \varPhi \) if and only if \(\mathcal {J}\models \varPhi \).

  • Let \(\varDelta \left( \varvec{x},\varvec{a}\right) \) a term with free variables \(\varvec{x}\) and \(\varvec{a}\). Then an interpretation \(\mathcal {I}\) of \(\varDelta \left( \varvec{x},\varvec{a}\right) \) is independent of \(\varvec{x}\) if for all interpretations \(\mathcal {J}\) equal to \(\mathcal {I}\) except on \(\varvec{x}\), \(\llbracket \varDelta \left( \varvec{x},\varvec{a}\right) \rrbracket _{\mathcal {I}} = \llbracket \varDelta \left( \varvec{x},\varvec{a}\right) \rrbracket _{\mathcal {J}}\).

Regarding formula \({ax+b>0}\) from Fig. 1, \(\left\{ {a=0}, {b=1}, {x=1} \right\} \) is independent of x while \(\left\{ {a=1}, {b=0}, {x=1} \right\} \) is not. Considering term \(\left( t\left[ {a}\leftarrow {b} \right] \right) [c]\), with t an array written at index a then read at index c, \(\left\{ {a=0}, {b=42}, {c=0}, {t=\left[ \dots \right] } \right\} \) is independent of t (evaluates to 42) while \(\left\{ {a=0}, {b=1}, {c=2}, {t=\left[ \dots \right] } \right\} \) is not (evaluates to \(t\left[ 2\right] \)). We now define independence for formulas and terms.

Definition 2

(Independent formula and term)

  • Let \(\varPhi \left( \varvec{x},\varvec{a}\right) \) a formula with free variables \(\varvec{x}\) and \(\varvec{a}\). Then \(\varPhi \left( \varvec{x},\varvec{a}\right) \) is independent of \(\varvec{x}\) if \(\forall \varvec{x}.\forall \varvec{y}. \left( \varPhi \left( \varvec{x},\varvec{a}\right) \Leftrightarrow \varPhi \left( \varvec{y},\varvec{a}\right) \right) \) is true for any value of \(\varvec{a}\).

  • Let \(\varDelta \left( \varvec{x},\varvec{a}\right) \) a term with free variables \(\varvec{x}\) and \(\varvec{a}\). Then \(\varDelta \left( \varvec{x},\varvec{a}\right) \) is independent of \(\varvec{x}\) if \(\forall \varvec{x}.\forall \varvec{y}. \left( \varDelta \left( \varvec{x},\varvec{a}\right) = \varDelta \left( \varvec{y},\varvec{a}\right) \right) \) is true for any value of \(\varvec{a}\).

Definition 2 of formula and term independence is far stronger than Definition 1 of interpretation independence. Indeed, it can easily be checked that if a formula \(\varPhi \) (resp. a term \(\varDelta \)) is independent of \(\varvec{x}\), then any interpretation of \(\varPhi \) (resp. \(\varDelta \)) is independent of \(\varvec{x}\). However, the converse is false as formula \(ax+b>0\) is not independent of x, but has an interpretation \(\left\{ {a=0},{b=1},{x=1}\right\} \) which is.

4.2 Independence Conditions

Since it is rarely the case that a formula (resp. term) is independent from a set of variables \(\varvec{x}\), we are interested in Sufficient Independence Conditions. These conditions are additional constraints that can be added to a formula (resp. term) in such a way that they make the formula (resp. term) independent of \(\varvec{x}\).

Definition 3

(Sufficient Independence Condition (SIC))

  • A Sufficient Independence Condition for a formula \(\varPhi \left( \varvec{x},\varvec{a}\right) \) with regard to \(\varvec{x}\) is a formula \(\varPsi \left( \varvec{a}\right) \) such that \(\varPsi \left( \varvec{a}\right) \models ({\forall \varvec{x}.\forall \varvec{y}. \varPhi \left( \varvec{x},\varvec{a}\right) \Leftrightarrow \varPhi \left( \varvec{y},\varvec{a}\right) })\).

  • A Sufficient Independence Condition for a term \(\varDelta \left( \varvec{x},\varvec{a}\right) \) with regard to \(\varvec{x}\), is a formula \(\varPsi \left( \varvec{a}\right) \) such that \(\varPsi \left( \varvec{a}\right) \models ({\forall \varvec{x}.\forall \varvec{y}. \varDelta \left( \varvec{x},\varvec{a}\right) = \varDelta \left( \varvec{y},\varvec{a}\right) })\).

We denote by \(\textsc {sic}_{\varPhi ,\varvec{x}}\) (resp. \(\textsc {sic}_{\varDelta ,\varvec{x}}\)) a Sufficient Independence Condition for a formula \(\varPhi \left( \varvec{x},\varvec{a}\right) \) (resp. for a term \(\varDelta \left( \varvec{x},\varvec{a}\right) \)) with regard to \(\varvec{x}\). For example, \(a=0\) is a \(\textsc {sic}_{\varPhi ,x}\) for formula \(\varPhi \triangleq {ax+b>0}\), and \(a=c\) is a \(\textsc {sic}_{\varDelta ,t}\) for term \(\varDelta \triangleq \left( t\left[ {a}\leftarrow {b}\right] \right) [c]\). Note that \(\bot \) is always a sic, and that sic are closed under \(\wedge \) and \(\vee \). Proposition 1 clarifies the interest of sic for model generation.

Proposition 1

(Model generalization). Let \(\varPhi \left( \varvec{x},\varvec{a}\right) \) a formula and \(\varPsi \) a \(\textsc {sic}_{\varPhi ,\varvec{x}}\). If there exists an interpretation \(\left\{ \mathtt {x},\mathtt {a}\right\} \) such that \(\left\{ \mathtt {x},\mathtt {a}\right\} \models \varPsi \left( \varvec{a}\right) \wedge \varPhi \left( \varvec{x},\varvec{a}\right) \), then \(\left\{ \mathtt {a}\right\} \models \forall \varvec{x}.\varPhi \left( \varvec{x},\varvec{a}\right) \).

Proof

(sketch of). Appendix C.1 of the companion technical report.

For the sake of completeness, we introduce now the notion of Weakest Independence Condition for a formula \(\varPhi \left( \varvec{x},\varvec{a}\right) \) with regard to \(\varvec{x}\) (resp. a term \(\varDelta \left( \varvec{x},\varvec{a}\right) \)). We will denote such conditions \(\textsc {wic}_{\varPhi ,\varvec{x}}\) (resp. \(\textsc {wic}_{\varDelta ,\varvec{x}}\)).

Definition 4

(Weakest Independence Condition (WIC))

  • A Weakest Independence Condition for a formula \(\varPhi \left( \varvec{x},\varvec{a}\right) \) with regard to \(\varvec{x}\) is a \(\textsc {sic}_{\varPhi ,\varvec{x}}\) \(\varPi \) such that, for any other \(\textsc {sic}_{\varPhi ,\varvec{x}}\) \(\varPsi \), \(\varPsi \models \varPi \).

  • A Weakest Independence Condition for a term \(\varDelta \left( \varvec{x},\varvec{a}\right) \) with regard to \(\varvec{x}\) is a \(\textsc {sic}_{\varDelta ,\varvec{x}}\) \(\varPi \) such that, for any other \(\textsc {sic}_{\varDelta ,\varvec{x}}\) \(\varPsi \), \(\varPsi \models \varPi \).

Note that \(\varOmega \triangleq \forall \varvec{x}.\forall \varvec{y}. \left( \varPhi \left( \varvec{x},\varvec{a}\right) \Leftrightarrow \varPhi \left( \varvec{y},\varvec{a}\right) \right) \) is always a \(\textsc {wic}_{\varPhi ,\varvec{x}}\), and any formula \(\varPi \) is a \(\textsc {wic}_{\varPhi ,\varvec{x}}\) if and only if \(\varPi \equiv \varOmega \). Therefore all syntactically different wic have the same semantics. As an example, both sic \(a=0\) and \(a=c\) presented earlier are \(\textsc {wic}\). Proposition 2 emphasizes the interest of wic for model generation.

Proposition 2

(Model specialization). Let \(\varPhi \left( \varvec{x},\varvec{a}\right) \) a formula and \(\varPi (\varvec{a})\) a \(\textsc {wic}_{\varPhi ,\varvec{x}}\). If there exists an interpretation \(\left\{ \mathtt {a}\right\} \) such that \(\left\{ \mathtt {a} \right\} \models \forall \varvec{x}. \varPhi \left( \varvec{x},\varvec{a}\right) \), then \(\left\{ \mathtt {x},\mathtt {a} \right\} \models \varPi \left( \varvec{a}\right) \wedge \varPhi \left( \varvec{x},\varvec{a}\right) \) for any valuation \(\mathtt {x}\) of \(\varvec{x}\).

Proof

(sketch of). Appendix C.2 of the companion technical report.

From now on, our goal is to infer from a formula \(\forall \varvec{x}. \varPhi \left( \varvec{x},\varvec{a}\right) \) a \(\textsc {sic}_{\varPhi ,\varvec{x}}\) \(\varPsi \left( \varvec{a}\right) \), find a model for \(\varPsi \left( \varvec{a}\right) \wedge \varPhi \left( \varvec{x},\varvec{a}\right) \) and generalize it. This \(\textsc {sic}_{\varPhi ,\varvec{x}}\) should be as weak—in the sense “less coercive”—as possible, as otherwise \(\bot \) could always be used, which would not be very interesting for our overall purpose.

For the sake of simplicity, previous definitions omit to mention the theory to which the sic belongs. If the theory \(\mathcal {T}\) of the quantified formula is decidable we can always choose \(\forall \varvec{x}.\forall \varvec{y}. \left( \varPhi \left( \varvec{x},\varvec{a}\right) \Leftrightarrow \varPhi \left( \varvec{y},\varvec{a}\right) \right) \) as a sic, but it is simpler to directly use a \(\mathcal {T}\)-solver. The challenge is, for formulas in an undecidable theory \(\mathcal {T}\), to find a non-trivial sic in its quantifier-free fragment QF-\(\mathcal {T}\).

Under this constraint, we cannot expect a systematic construction of \(\textsc {wic}\), as it would allow to decide the satisfiability of any quantified theory with a decidable quantifier-free fragment. Yet informally, the closer a sic is to be a wic, the closer our approach is to completeness. Therefore this notion might be seen as a fair gauge of the quality of a \(\textsc {sic}\). Having said that, we leave a deeper study on the inference of wic as future work.

5 Generic Framework for SIC-Based Model Generation

We describe now our overall approach. Algorithm 1 presents our sic-based generic framework for model generation (Sect. 5.1). Then, Algorithm 2 proposes a taint-based approach for sic inference (Sect. 5.2). Finally, we discuss complexity and efficiency issues (Sect. 5.3) and detail extensions (Sect. 5.4), such as partial elimination.

From now on, we do not distinguish anymore between terms and formulas, their treatment being symmetric, and we call targeted variables the variables we want to be independent of.

5.1 SIC-Based Model Generation

figure b

Our model generation technique is described in Algorithm 1. Function solveQ takes as input a formula \(\forall \varvec{x}. \varPhi \left( \varvec{x},\varvec{a}\right) \) over a theory \(\mathcal {T}\). It first calculates a \(\textsc {sic}_{\varPhi ,\varvec{x}}\) \(\varPsi \left( \varvec{a}\right) \) in QF-\(\mathcal {T}\). Then it solves \(\varPhi \left( \varvec{x},\varvec{a}\right) \wedge \varPsi \left( \varvec{a}\right) \). Finally, depending on the result and whether \(\varPsi \left( \varvec{a}\right) \) is a \(\textsc {wic}_{\varPhi ,\varvec{x}}\) or not, it answers sat, unsat or unknown. solveQ is parametrized by two functions solveQF and inferSIC:

  • solveQF is a decision procedure (typically a SMT solver) for QF-\(\mathcal {T}\). solveQF is said to be correct if each time it answers sat (resp. unsat) the formula is satisfiable (resp. unsatisfiable); it is said to be complete if it always answers sat or unsat, never unknown.

  • inferSIC takes as input a formula \(\varPhi \) in QF-\(\mathcal {T}\) and a set of targeted variables \(\varvec{x}\), and produces a \(\textsc {sic}_{\varPhi ,\varvec{x}}\) in QF-\(\mathcal {T}\). It is said to be correct if it always returns a sic, and complete if all the sic it returns are wic. A possible implementation of inferSIC is described in Algorithm 2 (Sect. 5.2).

Function solveQ enjoys the two following properties, where correctness and completeness are defined as for solveQF.

Theorem 1

(Correctness and completeness)

  • If solveQF and inferSIC are correct, then solveQ is correct.

  • If solveQF and inferSIC are complete, then solveQ is complete.

Proof

(sketch of). Follow directly from Propositions 1 and 2 (Sect. 4.2).

5.2 Taint-Based SIC Inference

figure c

Algorithm 2 presents a taint-based implementation of function inferSIC. It consists of a (syntactic) core calculus described here, refined by a (semantic) theory-dependent calculus theorySIC described in Sect. 6. From formula \(\varPhi \left( \varvec{x},\varvec{a}\right) \) and targeted variables \(\varvec{x}\), inferSIC is defined recursively as follow.

If \(\varPhi \) is a constant it returns \(\top \) as constants are independent of any variable. If \(\varPhi \) is a variable v, it returns \(\top \) if we may depend on v (i.e., \(v \not \in \varvec{x}\)), \(\bot \) otherwise. If \(\varPhi \) is a function \(f\left( \phi _1, \ldots ,\phi _n\right) \), it first recursively computes for every sub-term \(\phi _i\) a \(\textsc {sic}_{\phi _i,\varvec{x}}\) \(\psi _i\). Then these results are sent with \(\varPhi \) to theorySIC which computes a \(\textsc {sic}_{\varPhi ,\varvec{x}}\) \(\varPsi \). The procedure returns the disjunction between \(\varPsi \) and the conjunction of the \(\psi _i\)’s. Note that theorySIC default value \(\bot \) is absorbed by the disjunction.

The intuition is that if the \(\phi _i\)’s are independent of \(\varvec{x}\), then \(f\left( \phi _1, \ldots ,\phi _n\right) \) is. Therefore Algorithm 2 is said to be taint-based as, when theorySIC is left to its default value, it acts as a form of taint tracking [15, 27] inside the formula.

Proposition 3

(Correctness). Given a formula \(\varPhi \left( \varvec{x},\varvec{a}\right) \) and assuming that theorySIC is correct, then \(\texttt {inferSIC}\left( \varPhi ,\varvec{x}\right) \) indeed computes a \(\textsc {sic}_{\varPhi ,\varvec{x}}\).

Proof

(sketch of). This proof has been mechanized in CoqFootnote 1.

Note that on the other hand, completeness does not hold: in general \(\texttt {inferSIC}\) does not compute a \(\textsc {wic}\), cf. discussion in Sect. 5.4.

5.3 Complexity and Efficiency

We now evaluate the overhead induced by Algorithm 1 in terms of formula size and complexity of the resolution—the running time of Algorithm 1 itself being expected to be negligible (preprocessing).

Definition 5

The size of a term is inductively defined as \(\text {size}\left( x\right) \triangleq 1\) for x a variable, and \(\text {size}\left( f\left( t_1, \ldots ,t_n\right) \right) \triangleq 1+\varSigma _i\,\text {size}\left( t_i\right) \) otherwise. We say that theorySIC is bounded in size if there exists K such that, for all terms \(\varDelta \), \(\text {size} \left( \texttt {theorySIC}\left( \varDelta ,\cdot \right) \right) \le {K}\).

Proposition 4

(Size bound). Let N be the maximal arity of symbols defined by theory \(\mathcal {T}\). If theorySIC is bounded in size by K, then for all formula \(\varPhi \) in \(\mathcal {T}\), \(\text {size}\left( \texttt {inferSIC}\left( \varPhi ,\cdot \right) \right) \le \left( K+N\right) \cdot \text {size}\left( \varPhi \right) \).

Proposition 5

(Complexity bound). Let us suppose theorySIC bounded in size, and let \(\varPhi \) be a formula belonging to a theory \(\mathcal {T}\) with polynomial-time checkable solutions. If \(\varPsi \) is a \(\textsc {sic}_{\varPhi ,\cdot }\) produced by inferSIC, then a solution for \(\varPhi \wedge \varPsi \) is checkable in time polynomial in size of \(\varPhi \).

Proof

(sketch of). Appendices C.3 and C.4 of the companion technical report.

These propositions demonstrate that, for formula landing in complex enough theories, our method lifts QF-solvers to the quantified case (in an approximated way) without any significant overhead, as long as theorySIC is bounded in size. This latter constraint can be achieved by systematically binding sub-terms to (constant-size) fresh names and having theorySIC manipulates these binders.

5.4 Discussions

Extension. Let us remark that our framework encompasses partial quantifier elimination as long as the remaining quantifiers are handled by solveQF. For example, we may want to remove quantifications over arrays but keep those on bitvectors. In this setting, inferSIC can also allow some level of quantification, providing that solveQF handles them.

About WIC. As already stated, inferSIC does not propagate wic in general. For example, considering formulas \(t_1 \triangleq (x<0)\) and \(t_2 \triangleq (x\ge 0)\), then \(\textsc {wic}_{t_1,x} = \bot \) and \(\textsc {wic}_{t_2,x} = \bot \). Hence inferSIC returns \(\bot \) as sic for \(t_1 \vee t_2\), while actually \(\textsc {wic}_{t_1 \vee t_2,x} = \top \).

Nevertheless, we can already highlight a few cases where wic can be computed. (1) inferSIC does propagate wic on one-to-one uninterpreted functions. (2) If no variable of \(\varvec{x}\) appears in any sub-term of \(f(t,t')\), then the associated wic is \(\top \). While a priori naive, this case becomes interesting when combined with simplifications (Sect. 7.1) that may eliminate \(\varvec{x}\). (3) If a sub-term falls in a sub-theory admitting quantifier elimination, then the associated wic is computed by eliminating quantifiers in \((\forall .\varvec{x}.\varvec{y}. \varPhi (\varvec{x},\varvec{a}) \Leftrightarrow \varPhi (\varvec{y},\varvec{a}))\). (4) We may also think of dedicated patterns: regarding bitvectors, the wic for \(x \le a \Rightarrow x \le x+k\) is \(a \le \texttt {Max}-k\). Identifying under which condition wic propagation holds is a strong direction for future work.

6 Theory-Dependent SIC Refinements

We now present theory-dependent sic refinements for theories relevant to program analysis: booleans, fixed-size bitvectors and arrays—recall that uninterpreted functions are already handled by Algorithm 2. We then propose a generalization of these refinements together with a correctness proof for a larger class of operators.

6.1 Refinement on Theories

We recall theorySIC takes four parameters: a function symbol f, its arguments \(\left( t_1, \ldots ,t_n\right) \), their associated sic \(\left( t_1^\bullet , \ldots ,t_n^\bullet \right) \), and targeted variables \(\varvec{x}\). theorySIC pattern-matches the function symbol and returns the associated sic according to rules in Fig. 2. If a function symbol is not supported, we return the default value \(\bot \). Constants and variables are handled by inferSIC. For the sake of simplicity, rules in Fig. 2 are defined recursively, but can easily fit the interface required for theorySIC in Algorithm 2 by turning recursive calls into parameters.

Booleans and Ite. Rules for the boolean theory (Fig. 2a) handles \(\Rightarrow \), \(\wedge \), \(\vee \) and \(\text {ite}\) (if-then-else). For binary operators, the sic is the conjunction of the sic associated to one of the two sub-terms and a constraint on this sub-term that forces the result of the operator to be constant—e.g., to be equal to \(\bot \) (resp. \(\top \)) for the antecedent (resp. consequent) of an implication. These equality constraints are based on absorbing elements of operators.

Inference for the \(\text {ite}\) operator is more subtle. Intuitively, if its condition is independent of some \(\varvec{x}\), we use it to select the \(\textsc {sic}_{\varvec{x}}\) of the sub-term that will be selected by the \(\text {ite}\) operator. If the condition is dependent of \(\varvec{x}\), then we cannot use it anymore to select a \(\textsc {sic}_{\varvec{x}}\). In this case, we return the conjunction of the \(\textsc {sic}_{\varvec{x}}\) of both sub-terms and the constraint that the two sub-terms are equal.

Fig. 2.
figure 2

Examples of refinements for theorySIC

Bitvectors and Arrays. Rules for bitvectors (Fig. 2b) follow similar ideas, with constant \(\top \) (resp. \(\bot \)) substituted by \(1_n\) (resp. \(0_n\)), the bitvector of size n full of ones (resp. zeros). Rules for arrays (Fig. 2c) are derived from the theory axioms. The definition is recursive: rules need be applied until reaching either a store at the position where the select occurs, or the initial array variable.

As a rule of thumb, good sic can be derived from function axioms in the form of rewriting rules, as done for arrays. Similar constructions can be obtained for example for stacks or queues.

6.2 \(\mathcal {R}\)-Absorbing Functions

We propose a generalization of the previous theory-dependent sic refinements to a larger class of functions, and prove its correctness.

Intuitively, if a function has an absorbing element, constraining one of its operands to be equal to this element will ensure that the result of the function is independent of the other operands. However, it is not enough when a relation between some elements is needed, such as with \(\left( t[a\leftarrow b]\right) [c]\) where constraint \(a=c\) ensures the independence with regards to t. We thus generalize the notion of absorption to \(\mathcal {R}\)-absorption, where \(\mathcal {R}\) is a relation between function arguments.

Definition 6

Let \(f:\tau _1\times \cdots \times \tau _n\rightarrow \tau \) a function. f is \(\mathcal {R}\)-absorbing if there exists \(\mathcal {I}_\mathcal {R}\subset \left\{ 1,\cdots ,n\right\} \) and \(\mathcal {R}\) a relation between \(\alpha _i:\tau _i,\,i\in \mathcal {I}_\mathcal {R}\) such that, for all \(b \triangleq \left( b_1,\dots ,b_n\right) \) and \(c\triangleq \left( c_1,\dots ,c_n\right) \in \tau _1\times \dots \times \tau _n\), if \(\mathcal {R}(\left. b\right| _{\mathcal {I}_{\mathcal {R}}})\) and \(\left. b\right| _{\mathcal {I}_{\mathcal {R}}} = \left. c\right| _{\mathcal {I}_{\mathcal {R}}}\) where \(\left. \cdot \right| _{\mathcal {I}_{\mathcal {R}}}\) is the projection on \(\mathcal {I}_{\mathcal {R}}\), then \(f(b)=f(c)\).

\(\mathcal {I}_\mathcal {R}\) is called the support of the relation of absorption \(\mathcal {R}\).

For example, \(\left( a,b\right) \mapsto {{a}\vee {b}}\) has two pairs \(\left\langle \mathcal {R},\,\mathcal {I}_\mathcal {R}\right\rangle \) coinciding with the usual notion of absorption, \(\left\langle {a\!=\!\top },\,\left\{ 1_a\right\} \right\rangle \) and \(\left\langle {b\!=\!\top },\,\left\{ 2_b\right\} \right\rangle \). Function \(\left( x,y,z\right) \mapsto {xy+z}\) has among others the pair \(\left\langle {x\!=\!0},\,\left\{ 1_x,3_z\right\} \right\rangle \), while \(\left( a,b,c,t\right) \mapsto \left( t[a \leftarrow b]\right) [c]\) has the pair \(\left\langle {a\!=\!c},\,\left\{ 1_a,3_c\right\} \right\rangle \). We can now state the following proposition:

Proposition 6

Let \(f\left( t_1,\dots ,t_n\right) \) be a \(\mathcal {R}\)-absorbing function of support \(\mathcal {I}_\mathcal {R}\), and let \(t_i^\bullet \) be a \(\textsc {sic}_{t_i,\varvec{x}}\) for some \(\varvec{x}\). Then \(\mathcal {R}\left( t_{i\in \mathcal {\mathcal {I}_\mathcal {R}}}\right) \bigwedge _{i\in \mathcal {\mathcal {I}_\mathcal {R}}}t_i^\bullet \) is a \(\textsc {sic}_{f,\varvec{x}}\).

Proof (sketch of)

Appendix C.5 of the companion technical report.

Previous examples (Sect. 6.1) can be recast in term of \(\mathcal {R}\)-absorbing function, proving their correctness (cf. companion technical report). Note that regarding our end-goal, we should accept only \(\mathcal {R}\)-absorbing functions in QF-\(\mathcal {T}\).

7 Experimental Evaluation

This section describes the implementation of our method (Sect. 7.1) for bitvectors and arrays (ABV), together with experimental evaluation (Sect. 7.2).

7.1 Implementation

Our prototype Tfml (Taint engine for ForMuLa)Footnote 2 comprises 7 klocs of OCaml. Given an input formula in the SMT-LIB format [5] (ABV theory), Tfml performs several normalizations before adding taint information following Algorithm 1. The process ends with simplifications as taint usually introduces many constant values, and a new SMT-LIB formula is output.

Sharing with Let-Binding. This stage is crucial as it allows to avoid term duplication in theorySIC (Algorithm 2, Sect. 5.3, and Proposition 4). We introduce new names for relevant sub-terms in order to easily share them.

Simplifications. We perform constant propagation and rewriting (standard rules, e.g. \(x-x \mapsto 0\) or \(x \times 1 \mapsto x\)) on both initial and transformed formulas – equality is soundly approximated by syntactic equality.

Shadow Arrays. We encode taint constraints over arrays through shadow arrays. For each array declared in the formula, we declare a (taint) shadow array. The default value for all cells of the shadow array is the taint of the original array, and for each value stored (resp. read) in the original array, we store (resp. read) the taint of the value in the shadow array. As logical arrays are infinite, we cannot constrain all the values contained in the initial shadow array. Instead, we rely on a common trick in array theory: we constrain only cells corresponding to a relevant read index in the formula.

Iterative Skolemization. While we have supposed along the paper to work on skolemized formulas, we have to be more careful in practice. Indeed, skolemization introduce dependencies between a skolemized variable and all its preceding universally quantified variables, blurring our analysis and likely resulting in considering the whole formula as dependent. Instead, we follow an iterative process: 1. Skolemize the first block of existentially quantified variables; 2. Compute the independence condition for any targeted variable in the first block of universal quantifiers and remove these quantifiers; 3. Repeat. This results in full Skolemization together with the construction of an independence condition, while avoiding many unnecessary dependencies.

7.2 Evaluation

Objective. We experimentally evaluate the following research questions: RQ1 How does our approach perform with regard to state-of-the-art approaches for model generation of quantified formulas? RQ2 How effective is it at lifting quantifier-free solvers into (sat-only) quantified solvers? RQ3 How efficient is it in terms of preprocessing time and formula size overhead? We evaluate our method on a set of formulas combining arrays and bitvectors (paramount in software verification), against state-of-the-art solvers for these theories.

Table 1. Answers and resolution time (in seconds, include timeout)

Protocol. The experimental setup below runs on an Intel(R) Xeon(R) E5-2660 v3 @ 2.60 GHz, 4 GB RAM per process, and a timeout of 1000 s per formula.

  • Metrics. For RQ1 we compare the number of sat and unknown answers between solvers supporting quantification, with and without our approach. For RQ2, we compare the number of sat and unknown answers between quantifier-free solvers enhanced by our approach and solvers supporting quantification. For RQ3, we measure preprocessing time and formulas size overhead.

  • Benchmarks. We consider two sets of ABV formulas. First, a set of 1421 formulas from (a modified version of) the symbolic execution tool Binsec [12] representing quantified reachability queries (cf. Sect. 2) over Binsec benchmark programs (security challenges, e.g. crackme or vulnerability finding). The initial (array) memory is quantified so that models depend only on user input. Second, a set of 1269 ABV formulas generated from formulas of the QF-ABV category of SMT-LIB [5] – sub-categories brummayerbiere, dwp formulas and klee selected. The generation process consists in universally quantifying some of the initial array variables, mimicking quantified reachability problems.

  • Competitors. For RQ1, we compete against the two state-of-the-art SMT solvers for quantified formulas CVC4 [4] (finite model instantiation [31]) and Z3 [14] (model-based instantiation [20]). We also consider degraded versions CVC4\(_{E}\) and Z3\(_{E}\) that roughly represent standard E-matching [16]. For RQ2 we use Boolector [10], one of the very best QF-ABV solvers.

Table 2. Complementarity of our approach with existing solvers (sat instances)

Results. Tables 1 and 2 and Fig. 3 sum up our experimental results, which have all been cross-checked for consistency. Table 1 reports the number of successes (sat or unsat) and failures (unknown), plus total solving times. The sign indicates formulas preprocessed with our approach. In that case it is impossible to correctly answer unsat (no wic checking), the unsat line is thus N/A. Since Boolector does not support quantified ABV formulas, we only give results with our approach enabled. Table 1 reads as follow: of the 1269 SMT-LIB formulas, standalone Z3 solves 426 formulas (261 sat, 165 unsat), and 366 (all sat) if preprocessed. Interestingly, our approach always improves the underlying solver in terms of solved (sat) instances, either in a significant way (SMT-LIB) or in a modest way (Binsec). Yet, recall that in a software verification setting every win matters (possibly new bug found or new assertion proved). For , it also strongly reduces computation time. Last but not least, (a pure QF-solver) turns out to have the best performance on sat-instances, beating state-of-the-art approaches both in terms of solved instances and computation time.

Fig. 3.
figure 3

Overhead in formula size

Table 2 substantiates the complementarity of the different methods, and reads as follow: for SMT-LIB, solves 224 (sat) formulas missed by Z3, while Z3 solves 86 (sat) formulas missed by , and 485 (sat) formulas are solved by either one of them.

Figure 3 shows formula size averaging a 9-fold increase (min 3, max 12): yet they are easier to solve because they are more constrained. Regarding performance and overhead of the tainting process, taint time is almost always less than 1s in our experiments (not shown here), 4 min for worst case, clearly dominated by resolution time. The worst case is due to a pass of linearithmic complexity which can be optimized to be logarithmic.

Table 3. GRUB example

Pearls. We show hereafter two particular applications of our method. Table 3 reports results of another symbolic execution experiment, on the grub example.On this example, completely outperforms existing approaches. As a second application, while the main drawback of our method is that it precludes proving unsat, this is easily mitigated by complementing the approach with another one geared (or able) to proving unsat, yielding efficient solvers for quantified formulas, as shown in Table 4.

Table 4. Best approaches

Conclusion. Experiments demonstrate the relevance of our taint-based technique for model generation. (RQ1) Results in Table 1 shows that our approach greatly facilitates the resolution process. On these examples, our method performs better than state-of-the-art solvers but also strongly complements them (Table 2). (RQ2) Moreover, Table 1 demonstrates that our technique is highly effective at lifting quantifier-free solvers to quantified formulas, in both number of sat answers and computation time. Indeed, once lifted, Boolector performs better (for sat-only) than Z3 or CVC4 with full quantifier support. Finally (RQ3) our tainting method itself is very efficient both in time and space, making it perfect either for a preprocessing step or for a deeper integration into a solver. In our current prototype implementation, we consider the cost to be low. The companion technical report contains a few additional experiments on bitvectors and integer arithmetic, including the example from Fig. 1.

8 Related Work

Traditional approaches to solving quantified formulas essentially involve either generic methods geared to proving unsatisfiability and validity [16], or complete but dedicated approaches for particular theories [8, 36]. Besides, some recent methods [20, 22, 31] aim to be correct and complete for larger classes of theories.

Generic Method for Unsatisfiability. Broadly speaking, these methods iteratively instantiate axioms until a contradiction is found. They are generic w.r.t. the underlying theory and allow to reuse standard theory solvers, but termination is not guaranteed. Also, they are more suited to prove unsatisfiability than to find models. In this family, E-matching [13, 16] shows reasonable cost when combined with conflict-based instantiation [30] or semantic triggers [17, 18]. In pure first-order logic (without theories), quantifiers are mainly handled through resolution and superposition [1, 26] as done in Vampire [24, 33] and E [34].

Complete Methods for Specific Theories. Much work has been done on designing complete decision procedures for quantified theories of interest, notably array properties [8], quantified theory of bitvectors [23, 36], Presburger arithmetic or Real Linear Arithmetic [9, 19]. Yet, they usually come at a high cost.

Generic Methods for Model Generation. Some recent works detail attempts at more general approaches to model generation.

Local theory extensions [2, 22] provide means to extend some decidable theories with free symbols and quantifications, retaining decidability. The approach identifies specific forms of formulas and quantifications (bounded), such that these theory extensions can be solved using finite instantiation of quantifiers together with a decision procedure for the original theory. The main drawback is that the formula size can increase a lot.

Model-based quantifier instantiation is an active area of research notably developed in Z3 and CVC4. The basic line is to consider the partial model under construction in order to find the right quantifier instantiations, typically in a try-and-refine manner. Depending on the variants, these methods favors either satisfiability or unsatisfiability. They build on the underlying quantifier-free solver and can be mixed with E-matching techniques, yet each refinement yields a solver call and the refinement process may not terminate. Ge and de Moura [20] study decidable fragments of first-order logic modulo theories for which model-based quantifier instantiation yields soundness and refutational completeness. Reynolds et al. [30], Barbosa [3] and Preiner et al. [28] use models to guide the instantiation process towards instances refuting the current model. Finite model quantifier instantiation [31, 32] reduces the search to finite models, and is indeed geared toward model generation rather than unsatisfiability. Similar techniques have been used in program synthesis [29].

We drop support for the unsatisfiable case but get more flexibility: we deal with quantifiers on any sort, the approach terminates and is lightweight, in the sense that it requires a single call to the underlying quantifier-free solver.

Other. Our method can be seen as taking inspiration from program taint analysis [15, 27] developed for checking the non-interference [35] of public and secrete input in security-sensitive programs. As far as the analogy goes, our approach should not be seen as checking non-interference, but rather as inferring preconditions of non-interference. Moreover, our formula-tainting technique is closer to dynamic program-tainting than to static program-tainting, in the sense that precise dependency conditions are statically inserted at preprocess-time, then precisely explored at solving-time.

Finally, Darvas et al. [11] presents a bottom-up formula strengthening method. Their goal differ from ours, as they are interested in formula well-definedness (rather than independence) and validity (rather than model generation).

9 Conclusion

This paper addresses the problem of generating models of quantified first-order formulas over built-in theories. We propose a correct and generic approach based on a reduction to the quantifier-free case through the inference of independence conditions. The technique is applicable to any theory with a decidable quantifier-free case and allows to reuse all the work done on quantifier-free solvers. The method significantly enhances the performances of state-of-the-art SMT solvers for the quantified case, and supplements the latest advances in the field.

Future developments aim to tackle the definition of more precise inference mechanisms of independence conditions, the identification of interesting subclasses for which inferring weakest independence conditions is feasible, and the combination with other quantifier instantiation techniques.