HighLevel Abstractions for Simplifying Extended String Constraints in SMT
Abstract
Satisfiability Modulo Theories (SMT) solvers with support for the theory of strings have recently emerged as powerful tools for reasoning about stringmanipulating programs. However, due to the complex semantics of extended string functions, it is challenging to develop scalable solvers for the string constraints produced by program analysis tools. We identify several classes of simplification techniques that are critical for the efficient processing of string constraints in SMT solvers. These techniques can reduce the size and complexity of input constraints by reasoning about arithmetic entailment, multisets, and string containment relationships over input terms. We provide experimental evidence that implementing them results in significant improvements over the performance of stateoftheart SMT solvers for extended string constraints.
1 Introduction
Most programming languages support strings natively and a considerable number of programs perform some form of string manipulation. Automated reasoning about stringmanipulating programs for verification and test case generation purposes is then highly relevant for these languages and programs. Applications to security, such as finding SQL injection and XSS vulnerabilities in web applications [16, 18, 23] or proving their absence, are of critical importance. String constraints have also been used to generate relational database tables from SQL queries for unit testing purposes [21]. These applications require modeling all of the string operations that appear in real programs. This is challenging since some of those operations are complex and often realized by iterative applications of simpler operations. Additionally, since strings in many programming languages have variable length, reasoning accurately about them cannot be done by a reduction to bounded types such as bitvectors, and requires instead the development of solvers for unbounded strings. To make this type of reasoning more scalable, the use of dedicated theory solvers natively supporting common string operations has been proposed [5, 9]. Some string solvers are fully integrated within Satisfiability Modulo Theories (SMT) solvers [4, 12]; some are built (externally) on top of such solvers [9, 16, 19]; and others are independent of SMT solvers [23].
A major challenge in developing solvers for unbounded string constraints is the complex semantics of extended string functions beyond the basic operations of string concatenation and equality. Extended functions include \(\mathsf {replace}\), which replaces a string in another string, and Open image in new window , which returns the position of a string in another string. Another challenge is that constraints using extended functions are often combined with constraints over other theories, e.g. integer constraints over string lengths or applications of Open image in new window , which requires the involvement of solvers for those theories. Current string solvers address these challenges by reducing constraints with extended string functions to typically more verbose constraints over basic functions. As with every reduction, some of the higher level structure of the problem may be lost, with negative repercussions on the performance and scalability.
To address this issue, we have developed new techniques that reason about constraints with extended string operators before they are reduced to simpler ones. This analysis of complex terms can often eliminate the need for expensive reductions. The techniques are based on reasoning about relationships over strings with highlevel abstractions, such as their arithmetic relationships (e.g., reasoning about their length), their string containment relationships, and their relationships as multisets of characters. We have implemented these techniques in cvc4, an SMT solver with native support for string reasoning. An experimental evaluation with benchmarks from various applications shows that our new techniques allows cvc4 to significantly outperform other stateoftheart solvers that target extended string constraints.

A novel procedure for proving entailments over arithmetic predicates built from the theory of strings and linear integer arithmetic.

Extensions of this technique for showing containment relationships between strings.

A novel simplification technique based on abstracting strings as multisets.

Experimental evidence that the simplification techniques provide significant performance improvements over current stateoftheart solvers.
In the remainder of this section, we discuss related work. In Sect. 2, we provide some background on the theory of strings and how solvers reduce extended functions. In Sects. 3, 4 and 5, we describe, respectively, our arithmeticbased, containmentbased, and multisetbased simplification techniques. Section 6 describes our implementation of those techniques, and Sect. 7 presents our evaluation.
Related Work. Various approaches to solving constraints over extended string functions have been proposed. Saxena et al. [16] showed that constraints from the symbolic execution of JavaScript code contain a significant number of extended string functions, which underlines their importance. Their approach translates string constraints to bitvector constraints, similar to other approaches based on bounded strings such as HAMPI [9]. Bjørner et al. [5] proposed native support for extended string operators in string solvers for scaling symbolic execution of .NET code. They reduce extended string functions to basic ones after getting bounds for string lengths from an integer solver. They also showed that constraints involving unbounded strings and \(\mathsf {replace}\) are undecidable. PASS [11] reduces string constraints over extended functions to arrays. Z3str and its successors [4, 24, 25] reduce extended string functions to basic functions eagerly during preprocessing. S3 [18] reduces recursive functions such as \(\mathsf {replace}\) incrementally by splitting and unfolding. Its successor S3P [19] refines this reduction by pruning the resulting subproblems for better performance. cvc4 [3] reduces constraints with extended functions lazily and leverages contextdependent simplifications to simplify the reductions [15]. Trau [1] reduces certain extended functions, such as \(\mathsf {replace}\), to contextfree membership constraints. Ostrich [7] implements a decision procedure for a subset of constraints that include extended string functions. The simplification techniques presented in this paper are agnostic to the underlying solving procedure, so they can be combined with all of these approaches.
2 Preliminaries
We work in the context of manysorted firstorder logic with equality and assume the reader is familiar with the notions of signature, term, literal, formula, and formal interpretation of formulas. We review a few relevant definitions in the following. A theory is a pair Open image in new window where \(\varSigma \) is a signature and \(\mathbf {I}\) is a class of \(\varSigma \)interpretations, the models of T. We assume \(\varSigma \) contains the equality predicate \(\approx \), interpreted as the identity relation, and the predicates Open image in new window (for true) and Open image in new window (for false). A \(\varSigma \)formula \(\varphi \) is satisfiable (resp., unsatisfiable) in T if it is satisfied by some (resp., no) interpretation in \(\mathbf {I}\). We write Open image in new window to denote that the \(\varSigma \)formula \(\varphi \) is T valid, i.e., is satisfied in every model of T. Two \(\varSigma \)terms \(t_1\) and \(t_2\) are equivalent in T if Open image in new window .
We refer to the function symbols in the bottom half of the figure as extended functions and refer to terms containing them as extended terms. A position in a string Open image in new window is a nonnegative integer n smaller than the length of l that identifies the \((n+1)^{th}\) character of l—with 0 identifying the first character, 1 the second, and so on. For all models \(\mathcal {I}\) of \(T_\mathsf {S}\), all Open image in new window , and Open image in new window , \(\mathsf {substr}^\mathcal {I}(l, n, m)\) (the interpretation of \(\mathsf {substr}\) in \(\mathcal {I}\) applied to l, n, m) is the longest substring of l starting at position n with length at most m, or \(\mathsf {\epsilon }\) if n is an invalid position or m is not positive; \(\mathsf {contains}^\mathcal {I}(l_1, l_2)\) is true if and only if \(l_2\) is a substring of \(l_1\), with \(\mathsf {\epsilon }\) being a substring of every string; Open image in new window is the position of the first occurrence of \(l_2\) in \(l_1\) at or after position n, n if \(l_2\) is empty and Open image in new window , and \(1\) if n is an invalid position, or if no such occurrence exists; Open image in new window is the result of replacing the first occurrence of \(l_1\) in l by \(l_2\), l if l does not contain \(l_1\), or the result of prepending \(l_2\) to l if \(l_1\) is empty; \(\mathsf {str.to.int}^\mathcal {I}(l)\) is the nonnegative integer represented by l in decimal notation or \(1\) if the string contains nondigit characters; \(\mathsf {int.to.str}^\mathcal {I}(n)\) is the result of converting n to the corresponding string in decimal notation if n is nonnegative, or \(\mathsf {\epsilon }\) otherwise. We write \(\mathsf {substr}(t,u)\) as shorthand for the term \(\mathsf {substr}(t,u,t)\), i.e. the suffix of t starting at position u.
Note that the semantics for \(\mathsf {replace}\) and Open image in new window correspond to the semantics in the current draft of the SMTLIB standard for the theory of strings [17]; they are slightly different from the ones described in previous work [4, 15, 20].
2.1 Solving Extended String Constraints (with Simplification)
Various efficient solvers have been designed for the satisfiability problem for quantifierfree \(T_\mathsf {S}\)constraints, including cvc4 [3], s3# [20] and z3str3 [4]. In this section, we give an overview of how these solvers process extended functions in practice.
Generally speaking, constraints involving extended functions are converted to basic ones through a series of reductions performed in an incremental fashion by the solver. Operators whose reduction requires universal quantification are dealt with by guessing upper bounds on the lengths of input strings or by lazily adding constraints that block models that do not satisfy extended string constraints.
Example 1
The reduction in Example 1 introduces \(5\cdot n\) theory literals over basic string functions and \(3\cdot n\) string variables. A full reduction accounting for all corner cases of \(\mathsf {substr}\) is even more complex and thus more expensive to process, even for small values of n. These performance challenges can be addressed by aggressive simplifications that eliminate extended functions using highlevel reasoning, as shown in the next example.
Example 2
Consider an instance of the previous example where Open image in new window and Open image in new window . A full reduction of Open image in new window that eliminates all applications of \(\mathsf {substr}\), including those in t, introduces \(5\cdot n+5\) new theory literals and \(3\cdot n+3\) string variables. However, based on the semantics of \(\mathsf {contains}\) it is easy to see that Open image in new window is \(T_\mathsf {S}\)valid: if t were to contain s, then s would have to occur in the portion of t after its first character \(\mathsf {b} \), since the first character of s is \(\mathsf {a} \). However, \(\mathsf {con}(\mathsf {a} , x)\) cannot be contained in \(\mathsf {substr}(x,0,n)\), since the length of the former is at least \(x+1\), while the length of the latter is at most \(x\). A solver which recognizes that Open image in new window can be simplified to Open image in new window in this case can avoid the reduction altogether.
We advocate for aggressive simplification techniques to improve the performance of string solvers for extended functions. In the next sections, we describe several classes of such techniques that can be applied to inputs as a preprocessing step or during solving as part of a contextdependent solving strategy [15]. We present them as sets R of rewrite rules of the form Open image in new window , where s is a (simplified) term equivalent to t in \(T_\mathsf {S}\). We assume a deterministic application strategy for these rules, such that each term t rewrites to a unique simplified form, denoted by Open image in new window , which is irreducible by the rules. We split our simplifications into four categories, presented in Figs. 4, 6, 7 and 8.^{3}
3 ArithmeticBased String Simplification
To simplify string terms, it is useful to establish relationships between quantities such as the lengths of strings. For example, \(\mathsf {contains}(t,s)\) can be simplified to Open image in new window for a particular s and t if it can be inferred that \(s\) is strictly greater than \(t\). This section defines an inference system for such arithmetic relationships and the simplifications that it enables.
We are interested in proving the \(T_\mathsf {S}\)validity of formulas of the form Open image in new window , where u is a \(\varSigma _\mathsf {S}\)term of integer type. We describe an inference system as a set of rules for deriving judgments of the form Open image in new window and a specific rule application strategy we have implemented. The inference system is sound in the sense that Open image in new window whenever Open image in new window is derivable in it. It is, however, incomplete as it may fail to derive Open image in new window in some cases when Open image in new window . This incompleteness is by design, since proving the \(T_\mathsf {S}\)validity of inequalities is generally expensive due to the NPhardness of linear integer arithmetic. Without loss of generality, we require that the term u be in a simplified form, where terms of the form \(l\) with l a string literal of n characters are rewritten to n, terms of the form \(\mathsf {con}(t_1, \ldots , t_n)\) are rewritten to \(t_1 + \cdots + t_n\), and like monomials in arithmetic terms are combined in the usual way (e.g., \(2 \cdot x + x\) is rewritten to \(3 \cdot x\)).
Definition 1
 1.
an integer variable,
 2.
an application of length to a string variable, e.g. \(x\),
 3.
an application of length to an extended function, e.g. \(\mathsf {substr}(t,v,w)\), or
 4.
an application of an extended function of integer type, e.g. Open image in new window .
A majority of the rewrite rules have side conditions requiring the derivability of certain judgments in the same inference system. To improve their readability we take some liberties with the notation and write Open image in new window , say, instead of Open image in new window . For example, \(\mathsf {substr}(t,v,w)\) is underapproximated by w if it can be inferred that the interval from v to \(v+w\) is a valid range of positions in string t, which is expressed by the side conditions Open image in new window and Open image in new window . Note that some arithmetic terms, such as \(\mathsf {substr}(t,v,w)\), can be approximated in multiple ways—hence the need for a strategy for choosing the best approximation for arithmetic string terms, described later. The rules for polynomials are written modulo associativity of \(+\) and state that a monomial \(m \cdot v\) in them can be over or underapproximated based on the sign of the coefficient m. For simplicity, we silently assume in the figure that basic arithmetic simplifications are applied after each rewrite step to put the righthand side in polynomial form.
Example 3
Let u be \(\mathsf {replace}(x,\mathsf {aa} ,\mathsf {b} )\). Because Open image in new window , the first case of the overapproximation rule for \(\mathsf {replace}\) applies, and we get that \(u \rightarrow _{O}x\). This reflects that the result of replacing the first occurrence, if any, of \(\mathsf {aa} \) in x with \(\mathsf {b} \) is no longer than x.
Example 4
Let u be the same as in the previous example and let v be \(1 \cdot u + 2 \cdot x\). Since \(u \rightarrow _{O}x\) and the coefficient of u in v is negative, we have that \(v \rightarrow _{U}1 \cdot x + 2 \cdot x\), which simplifies to \(x\); moreover, \(x \rightarrow _{U}0\). Thus, \(v \rightarrow _{U}^*0\) and so Open image in new window . In other words, we can use the approximations to show that u is at most \(2\cdot x\).
3.1 A Strategy for Approximation
Recall that, given an arithmetic inequality Open image in new window , our goal is to find a reduction \(u \rightarrow _{U}^{*} n\) where n is a nonnegative constant. Our strategy for choosing which rule of \(\rightarrow _{U}\) to apply to u is given in Fig. 3. We decompose u into three parts: the portion \(u_x\) consisting of a sum of integer variables, the portion \(u_\ell \) consisting of a sum of lengths of string variables, and the remaining portion \(u_s\) which is a sum of monomials involving extended terms \(v_1, \ldots , v_q\) as defined in Definition 1.
Since there are multiple choices for how terms in \(u_s\) are approximated, the strategy focuses primarily on this portion. In particular, we apply an approximation for one of the terms \(v_i\), underapproximating or overapproximating depending on the sign of its coefficient, and replace the monomial in t by its corresponding approximation. The choice of \(v_i\) and \(v^a_i\) is based on maximizing the likelihood that the overall derivation will produce a nonnegative constant.
For a term u in polynomial form, let Open image in new window be a set of integer terms whose coefficient is negative in u, e.g. Open image in new window . Terms in this set can be seen as obligations for proving entailments in our derivations since if Open image in new window , it must be the case that our derivation applies a rule that introduces a term with a positive coefficient for \(y_2\). In Fig. 3, we say that our choice of \(v_i \rightarrow _{U}v^a_i\) avoids new terms if it does not have the effect of adding any new terms to Open image in new window , and cancels existing terms if it has the effect of removing terms from this set. If the portion \(u_s\) is empty, we apply the rule \(x_j \rightarrow _{U}0\) if there exists a monomial \(m^\ell _j \cdot x_j\) where \(m^\ell _j\) is positive. This rule is applied with lowest priority because these monomials may help to cancel negative terms introduced by the other steps.
Step 1 depends on knowing the set of possible onestep approximations \(v_i \rightarrow _{U}v^a_i\) and \(v_i \rightarrow _{O}v^a_i\) for terms from u. These are determined using the rules of Fig. 2. Whenever applicable, we break ties between rewrites in Step 1 by considering a fixed arbitrary ordering over extended terms.
Example 5
Let u be \(1+t_1 + t_2  x_1\), where \(t_1\) is \(\mathsf {substr}(x_2,1,x_2+x_4)\) and \(t_2\) is \(\mathsf {replace}(x_1,x_2,x_3)\). Step 1 of StrArithApprox considers the possible approximations \(t_1 \rightarrow _{U}x_21\) and \(t_2 \rightarrow _{U}x_1  x_2\). Note that underapproximations are needed because the coefficients of \(t_1\) and \(t_2\) are positive. The first approximation is an instance of the third rule in Fig. 2, noting that both Open image in new window and Open image in new window are derivable by a basic strategy that, wherever applicable, underapproximates string length terms as zero. Our strategy chooses the first approximation since it introduces no new negative coefficient terms, thus obtaining: \(u \rightarrow _{U}x_2 + t_2  x_1\). We now choose the approximation \(t_2 \rightarrow _{U}x_1  x_2\), noting that it introduces no new negative coefficient terms and cancels an existing one, \(x_1\). After arithmetic simplification, we have derived \(u \rightarrow _{U}^{*} 0\), and hence Open image in new window .
3.2 Simplification Rules with Arithmetic Side Conditions
We use the inference system from the previous section for simplifications of string terms with arithmetic side conditions. Figure 4 summarizes those simplifications.
The first rule rewrites a string equality to Open image in new window if one of the two sides can be inferred to be strictly longer than the other. In the second rule, if one side of an equality, \(\mathsf {con}(s, r, q)\), is such that the sum of lengths of s and q alone can be shown to be greater than or equal to the length of the other side, then r must be empty. The third rule recognizes that string containment reduces to string equality when it can be inferred that string s is at least as long as the string t that must contain it. The next rule captures the fact that substring simplifies to the empty string if it can be inferred that its position v is not within bounds, or its length w is not positive. In the figure, we write that rule with a disjunctive side condition; this is a shorthand to denote that we can pick any disjunct and show that it holds assuming the negation of the other disjuncts. We can use those assumptions to perform substitutions to simplify the derivation. Concretely, to show Open image in new window it is sufficient to infer Open image in new window . We demonstrate this with an example.
Example 6
Consider the term \(\mathsf {substr}(t,t+w,w)\). Our rules may simplify this term to \(\mathsf {\epsilon }\) by inferring that its start position (\(t+w\)) is not within the bounds of t if we assume that its size (w) is positive. In detail, assume that \(w > 0\) (the negation of the last disjunct in the side condition of the fourth rule), which is equivalent to \(w \approx x+1\) where x is a fresh string variable and \(x\) denotes an unknown nonnegative quantity. It is sufficient to derive the formula obtained by replacing all occurrences of w by \(x+1\) in the disjunct Open image in new window to show that the start position of our term is out of bounds. After simplification, we obtain Open image in new window , which is trivial to derive.
The next two rules in Fig. 4 apply if we can infer respectively that the start position of the substring comes strictly after a prefix t or that the end position of the substring comes strictly before a suffix t of the first argument string. In either case, t can be dropped.
Example 7
Let t be \(\mathsf {substr}(\mathsf {con}(x_1,\mathsf {replace}(x_2,x_3,x_4)),0,w)\), where w is \(x_1x_2\). We have that Open image in new window , noting that Open image in new window . In other words, only the first component \(x_1\) of the string concatenation is relevant to the substring since its end point must occur before the end of \(x_1\).
The final rule for \(\mathsf {substr}\) shows that a prefix of a substring can be pulled upwards if the start position is zero and we can infer that the substring is guaranteed to include at least a prefix string t. Finally, if we can infer that the last position of s in t starting from position v is at or beyond the end of t, then the Open image in new window term can be rewritten as an ifthenelse (\(\mathsf {ite}\)) term that checks whether s is a suffix of t.
4 ContainmentBased String Simplification
Rules for inferring judgments of these forms are given in Fig. 5. Like our rules for arithmetic, these rules are solely based on the syntactic structure of terms, so inferences in this system can be computed statically. Both the assumptions and conclusions of the rules assume associativity of string concatenation with identity element \(\mathsf {\epsilon }\), that is, \(\mathsf {con}(t,s)\) may refer to a term of the form Open image in new window or alternatively to Open image in new window . Most of the rules are straightforward. The inference system has special rules for substring terms \(\mathsf {substr}(t,v,w)\), using arithmetic entailments from Sect. 3 to show prefix and suffix relationships with the base string t. For negative containment, the rules of the inference system together can show a (possibly nonconstant) string cannot occur in a constant string by reasoning that its characters cannot appear in order in that string. We write \(l_1\,\setminus \,l_2\) to denote the empty string if \(l_1\) does not contain \(l_2\), or the result of removing the smallest prefix of \(l_1\) that contains \(l_2\) from \(l_1\) otherwise.
Example 8
Let t be \(\mathsf {abcab} \) and let s be \(\mathsf {con}(\mathsf {b} , x, \mathsf {a} , y, \mathsf {c} )\). String s is not contained in t for any value of x, y. We derive Open image in new window using two applications of the rightmost rule for negative containment in Fig. 5, noting Open image in new window , Open image in new window , and \(\mathsf {b} \) does not contain \(\mathsf {c} \). In other words, the containment does not hold since the characters \(\mathsf {b} \), \(\mathsf {a} \) and \(\mathsf {c} \) cannot be found in order in the constant \(\mathsf {abcad} \).
4.1 Simplification Rules Based on String Containment
Figure 6 gives rules for simplifying extended function terms based on the aforementioned judgments pertaining to string containment. First, equalities can be rewritten to false and applications of \(\mathsf {contains}\) can be rewritten to a constant based on the appropriate judgment of our inference system. Applications of Open image in new window can be simplified to \(1\) if it can be shown that the second argument does not appear in the suffix of the first argument starting at the position given by the third argument. The next two rules reason about cases where the second argument s definitely occurs in the first argument starting from position v. In this case, if we additionally know that s occurs within (beyond) a prefix t of the first argument, then the suffix r (prefix t) can be dropped, where the start position and the return value of the result are modified accordingly. If we know s is a prefix of the first argument at position v, then the result is v if indeed v is in the bounds of t. Notice that the latter condition is necessary to handle the case where s is the empty string. The three rules for \(\mathsf {replace}\) are analogous. First, the replace rewrites to the first argument if we know it does not contain the second argument s. If we know s is definitely contained in a prefix of the first argument, then we can pull the remainder of that string upwards. Finally, if we know s is a prefix of the first argument, then we can replace that prefix with r while concatenating the remainder. We use the term \(\mathsf {substr}(t,s)\) to denote the remainder after the replacement for the sake of brevity, although this term typically does not involve extended functions after simplification, e.g. Open image in new window noting that Open image in new window , or Open image in new window noting that Open image in new window .
4.2 Simplifications Based on Equivalence of String Containment
We further refine our approach based on inferring when one containment is equivalent to another one. For example, \(\mathsf {con}(\mathsf {a} , x)\) is contained in \(\mathsf {con}(\mathsf {b} ,y)\) if and only if \(\mathsf {con}(\mathsf {a} , x)\) is contained in y alone. We introduce simplifications for such equivalences by reasoning about the maximal overlap between two strings.
The rules in Fig. 7 simplify extended terms by considering string overlaps. The first two rules drop parts of string literals from the suffix or prefix of their first arguments. The two rules for Open image in new window are similar: a suffix of the first argument can be dropped if it does not contribute to whether it contains the second argument. A prefix of an Open image in new window term can be dropped if it does not contribute to containment, but only in the case where we know the second argument is definitely contained in the first argument. This is to guard against the case where the entire Open image in new window term returns \(1\). The rules for \(\mathsf {replace}\) are similar to those for \(\mathsf {contains}\), except that the suffix (resp., prefix) of the first argument is pulled upwards instead of being dropped.
5 MultisetBased String Simplification
Next, we introduce simplifications based on reasoning about strings as multisets, i.e. collections of unordered characters. Such reasoning is sufficient for showing that equalities like \(\mathsf {con}(\mathsf {a} ,x) \approx \mathsf {con}(x,\mathsf {b} )\) are equivalent to Open image in new window , since the left side of the equality contains exactly one more occurrence of character \(\mathsf {a} \) than the righthand side. Similar to arithmetic reasoning from Sect. 3, we use approximations when reasoning about strings as multisets. We define the multiset abstraction of t, written \(\mathcal {M}_{t}\), as the multiset \(\{t_1, \ldots , t_n\}\) where t is equivalent to \(\mathsf {con}(t_1, \ldots , t_n)\) and all constants in this set are characters. For example, Open image in new window . We define a rewrite system Open image in new window over strings where a rewritten string overapproximates the original string in the following sense: if \(t \), then for all models of \(T_\mathsf {S}\) and any character c, the number of occurrences of c in the strings in \(\mathcal {M}_{s}\) is greater than or equal to the number of occurrences in the strings in \(\mathcal {M}_{t}\).
Example 9
We have that Open image in new window by noting that Open image in new window , Open image in new window and Open image in new window . The difference of the latter with the former is \(\{ \mathsf {b} \}\), and the former with the latter is \(\{ \mathsf {a} , \mathsf {a} , \mathsf {a} \}\). Thus, the right side of the equality contains at least one more occurrence of \(\mathsf {b} \) than the left side; hence, the equality is equivalent to false.
6 Implementation
We implemented the above simplification rules and others in the DPLLbased SMT solver cvc4, which implements a theory solver for a basic fragment of word equations with length, several other theory solvers, and reduction techniques for extended string functions as described in Sect. 2.1. Our simplification rules are run in a preprocessing pass as well as an inprocessing pass during solving. For the latter, we use a contextdependent simplification strategy that infers when an extended string constraint, e.g., Open image in new window , simplifies to Open image in new window based on other assertions, e.g., \(s \approx \mathsf {\epsilon }\). Our simplification techniques do not affect the core procedure for the theory of strings, nor the compatibility of the string solver with other theories. In total, our implementation is about 3,500 lines of Open image in new window code. We cache the results of the simplifications and the approximationbased arithmetic entailments to amortize their costs.
Additional Simplification Rules. The simplification rules in this paper are a subset of the rules in the implementation. We omit other uncategorized rules for lack of space. Many of these apply to specific term patterns, such as cases where two nested applications of \(\mathsf {substr}\) can be combined; cases where an application of \(\mathsf {replace}\) can be eliminated by case splitting; and other cases like Open image in new window . An example of such rules is Open image in new window if \(w_3\) does not overlap with either \(w_1\) or \(w_2\), because the \(\mathsf {replace}\) does not change whether t contains \(w_3\) or not. Another class of rules only applies to strings of length one because they cannot span multiple components of a concatenations, e.g. Open image in new window where c is a character. Finally, there are rewrites that benefit from multiple techniques presented in this paper. For example, we have a rewrite that splits string equations into multiple smaller equations if it can determine that prefixes must have the same length: Open image in new window .
Validating Simplification Rules. The correctness of our simplification techniques is critical to the soundness of the overall solver. Due to the sophistication and breadth of those techniques, it is challenging to formally verify our implementation. As a pragmatic alternative, we periodically test our implementation using a testing infrastructure we developed for this purpose. We found this to be critical in our development process. Our testing infrastructure allows the developer to specify a contextfree grammar in the syntaxguided synthesis format [2]. We generate all terms t in this grammar up to a fixed size and test the equivalence of t and its simplified form Open image in new window on a set of randomly generated points. The most recent run of this system on two grammars (one for extended string terms and another for string predicates) up to a term size of three, validated 319, 867 simplifications of string terms and 188, 428 simplifications of string predicates on 1, 000 sample points. This run took 924 s for string terms and 971 s for the string predicates using the same hardware as in Sect. 7.
7 Evaluation
We evaluate the impact of each simplification technique as implemented in cvc4 on three benchmark sets that use extended string operators: CMU, a dataset obtained from symbolic execution of Python code [15]; TermEq, a benchmark set consisting of the verification of term equivalences over strings [14]; and Slog, a benchmark set extracted from vulnerability testing of web applications [22]. The Slog set uses the \(\mathsf {replace}\) function extensively but does not contain other extended functions. We also evaluate the impact on Aplas, a set of handcrafted benchmarks involving looping word equations [10] (string equalities whose left and right sides have variables in common).
Number of solved problems per benchmark set. Best results are in bold. Gray cells indicate benchmark sets not supported by a solver. “R%” indicates the reduction of extended string functions during preprocessing. All benchmarks ran with a timeout of 600 s.
We ran all benchmarks on a cluster equipped with Intel E52637 v4 CPUs running Ubuntu 16.04 and dedicated one core, 8 GB RAM, and 600 s for each job. Table 1 summarizes the number of solved instances for each configuration and the baseline solvers grouped by benchmark sets. We remark that the average reduction of extended string functions (with all simplification techniques enabled) shown in column “R%” is significant on all benchmark sets. The scatter plots in Fig. 9 detail the effects of disabling each family of simplifications. They distinguish between satisfiable and unsatisfiable instances. To emphasize nontrivial benchmarks, we omit the benchmarks that are solved in less than a second by all solvers.
The arithmeticbased simplification techniques have the most significant performance impact on the symbolic execution benchmarks CMU. The number of solved benchmarks is significantly lower when disabling those techniques. The scatter plot shows that for longer running satisfiable queries there is a large portion of the benchmarks that are solved up to an order of magnitude faster with the simplifications. These improvements in runtime on the CMU set are particularly compelling because they come from a symbolic execution application, which involves a large number of queries with a short timeout. The improvements are more pronounced for unsatisfiable benchmarks, where our results show that simplifications often give the solver the ability to derive a refutation in a matter of seconds, something that is infeasible with configurations without these techniques. The Aplas set contains no extended string operators and hence our arithmeticbased simplification techniques have little impact on this set.
In contrast, both containment and multisetbased rewrites have a high impact on the Aplas set, as contain and msets both solve 121 fewer benchmarks. Additionally, contain has a high impact on the TermEq set, where the simplifications enable the best configuration to solve 61 out of 80 benchmarks. Since these techniques apply most frequently to looping word equations, they are less important for the CMU set, which does not have such equations. The containmentbased and multisetbased techniques primarily help on unsatisfiable benchmarks, as shown in the scatter plots. On TermEq benchmarks, it tends to be easier to find counterexamples, i.e. to solve the satisfiable ones, so there is more to gain on unsatisfiable benchmarks.
On Slog, Ostrich solves two more instances than cvc4 but cvc4 is over 50 times faster on commonly solved instances while supporting a richer set of string operators. On all benchmark sets, cvc4 solves at least as many benchmarks as z3 and cvc4 has \(12{}\times \) fewer timeouts than z3. On the simplified benchmarks, z3 performs significantly better. On the CMU and the Aplas benchmarks, \(\textsc {z}{\small 3} _b\) outperforms \(\textsc {z}{\small 3} \) by a large margin. Additionally simplifying the benchmarks with the techniques presented in this paper improves performance further on most benchmark sets and allows \(\textsc {z}{\small 3} _f\) to solve the most unsatisfiable benchmarks overall. These results indicate that z3 could benefit from additional simplifications, and they underscore the importance of curating and publishing simplification techniques in order to improve the stateoftheart.
8 Conclusion
We have presented a set of aggressive simplification techniques for reasoning about extended string constraints. Our results suggest that such techniques are key to advancing the state of the art in SMT string solving. Arithmeticbased simplifications lead to significant speedups in benchmarks from a symbolic execution application, while containment and multisetbased simplifications improve the performance on problems consisting of difficult term equivalences and looping word equations. Our approach is not limited to cvc4 and can be adapted to other solvers.
Given the encouraging results for each of the simplification techniques in our evaluation, we plan to extend them to other types of abstraction and make them contextaware. The latter extension involves taking into account other assertions when checking whether a side condition of a rule is fulfilled.
Footnotes
 1.
Our implementation supports a larger set of symbols, but for brevity, we only show the subset of the symbols used throughout this paper.
 2.
This formula is a simplified form of the general reduction. The general reduction also expresses that i is a valid position in t and that the third argument of \(\mathsf {substr}\) is nonnegative [15].
 3.
Some specialized rules have been omitted for space reasons.
 4.
9cb1a0f is newer than the current release 4.8.4 and includes several fixes for critical issues.
Notes
Acknowledgements
This work was partially supported by the National Science Foundation under award 1656926, the Defense Advanced Research Projects Agency under award FA86501827854, and Amazon Web Services.
References
 1.Abdulla, P.A., et al.: TRAU: SMT solver for string constraints. In: Bjørner, N., Gurfinkel, A. (eds.) 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin, TX, USA, 30 October–2 November 2018, pp. 1–5. IEEE (2018)Google Scholar
 2.Alur, R., et al.: Syntaxguided synthesis. In: Irlbeck, M., Peled, D.A., Pretschner, A. (eds.) Dependable Software Systems Engineering. NATO Sciencefor Peace and Security Series, D: Information and Communication Security, vol. 40, pp. 1–25. IOS Press (2015)Google Scholar
 3.Barrett, C., et al.: CVC4. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 171–177. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642221101_14CrossRefGoogle Scholar
 4.Berzish, M., Ganesh, V., Zheng, Y.: Z3str3: a string solver with theoryaware heuristics. In: Stewart, D., Weissenbacher, G. (eds.) 2017 Formal Methods in Computer Aided Design, FMCAD 2017, Vienna, Austria, 2–6 October 2017, pp. 55–59. IEEE (2017)Google Scholar
 5.Bjørner, N., Tillmann, N., Voronkov, A.: Path feasibility analysis for stringmanipulating programs. In: Kowalewski, S., Philippou, A. (eds.) TACAS 2009. LNCS, vol. 5505, pp. 307–321. Springer, Heidelberg (2009). https://doi.org/10.1007/9783642007682_27CrossRefzbMATHGoogle Scholar
 6.Chaudhuri, S., Farzan, A. (eds.): Computer Aided Verification  28th International Conference, CAV 2016, Toronto, ON, Canada, July 1723, 2016, Proceedings, Part I. Lecture Notes in Computer Science, vol. 9779. Springer, Switzerland (2016). https://doi.org/10.1007/9783319415284Google Scholar
 7.Chen, T., Hague, M., Lin, A.W., Rümmer, P., Wu, Z.: Decision procedures for path feasibility of stringmanipulating programs with complex operations. PACMPL 3(POPL), 49:1–49:30 (2019)Google Scholar
 8.de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540788003_24CrossRefGoogle Scholar
 9.Kiezun, A., Ganesh, V., Artzi, S., Guo, P.J., Hooimeijer, P., Ernst, M.D.: HAMPI: a solver for word equations over strings, regular expressions, and contextfree grammars. ACM Trans. Softw. Eng. Methodol. 21(4), 25:1–25:28 (2012)CrossRefGoogle Scholar
 10.Le, Q.L., He, M.: A decision procedure for string logic with quadratic equations, regular expressions and length constraints. In: Ryu, S. (ed.) APLAS 2018. LNCS, vol. 11275, pp. 350–372. Springer, Cham (2018). https://doi.org/10.1007/9783030027681_19CrossRefGoogle Scholar
 11.Li, G., Ghosh, I.: PASS: string solving with parameterized array and interval automaton. In: Bertacco, V., Legay, A. (eds.) HVC 2013. LNCS, vol. 8244, pp. 15–31. Springer, Cham (2013). https://doi.org/10.1007/9783319030777_2CrossRefGoogle Scholar
 12.Liang, T., Reynolds, A., Tinelli, C., Barrett, C., Deters, M.: A DPLL(T) theory solver for a theory of strings and regular expressions. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 646–662. Springer, Cham (2014). https://doi.org/10.1007/9783319088679_43CrossRefGoogle Scholar
 13.Majumdar, R., Kuncak, V. (eds.): Computer Aided Verification  29th International Conference, CAV 2017, Heidelberg, Germany, July 2428, 2017, Proceedings, Part II. Lecture Notes in Computer Science, vol. 10427. Springer, Heidelberg (2017). https://doi.org/10.1007/9783319633879Google Scholar
 14.Reynolds, A., et al.: Rewrites for SMT solvers using syntaxguided enumeration. SMT (2018)Google Scholar
 15.Reynolds, A., Woo, M., Barrett, C., Brumley, D., Liang, T., Tinelli, C.: Scaling up DPLL(T) string solvers using contextdependent simplification. In: Majumdar and Kuncak [13], pp. 453–474CrossRefGoogle Scholar
 16.Saxena, P., Akhawe, D., Hanna, S., Mao, F., McCamant, S., Song, D.: A symbolic execution framework for Javascript. In: 31st IEEE Symposium on Security and Privacy, S&P 2010, 16–19 May 2010, Berleley/Oakland, California, USA, pp. 513–528. IEEE Computer Society (2010)Google Scholar
 17.Tinelli, C., Barrett, C., Fontaine, P.: Unicode Strings (Draft 1.0) (2018). http://smtlib.cs.uiowa.edu/theoriesUnicodeStrings.shtml
 18.Trinh, M.T., Chu, D.H., Jaffar, J.: S3: a symbolic string solver for vulnerability detection in webapplications. In: Ahn, G., Yung, M., Li, N. (eds.) Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014, pp. 1232–1243. ACM (2014)Google Scholar
 19.Trinh, M.T., Chu, D.H., Jaffar, J.: Progressive reasoning over recursivelydefined strings. In: Chaudhuri and Farzan [6], pp. 218–240CrossRefGoogle Scholar
 20.Trinh, M.T., Chu, D.H., Jaffar, J.: Model counting for recursivelydefined strings. In: Majumdar and Kuncak [13], pp. 399–418CrossRefGoogle Scholar
 21.Veanes, M., Tillmann, N., de Halleux, J.: Qex: symbolic SQL query explorer. In: Clarke, E.M., Voronkov, A. (eds.) LPAR 2010. LNCS (LNAI), vol. 6355, pp. 425–446. Springer, Heidelberg (2010). https://doi.org/10.1007/9783642175114_24CrossRefGoogle Scholar
 22.Wang, H.E., Tsai, T.L., Lin, C.H., Yu, F., Jiang, J.H.R.: String analysis via automata manipulation with logic circuit representation. In: Chaudhuri and Farzan [6], pp. 241–260Google Scholar
 23.Yu, F., Alkhalaf, M., Bultan, T.: Stranger: an automatabased string analysis tool for PHP. In: Esparza, J., Majumdar, R. (eds.) TACAS 2010. LNCS, vol. 6015, pp. 154–157. Springer, Heidelberg (2010). https://doi.org/10.1007/9783642120022_13CrossRefGoogle Scholar
 24.Zheng, Y.: Z3str2: an efficient solver for strings, regular expressions, and length constraints. Form. Methods Syst. Des. 50(2–3), 249–288 (2017)CrossRefGoogle Scholar
 25.Zheng, Y., Zhang, X., Ganesh, V.: Z3str: a z3based string solver for web application analysis. In: Meyer, B., Baresi, L., Mezini, M. (eds.) Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2013, Saint Petersburg, Russian Federation, 18–26 August 2013, pp. 114–124. ACM (2013)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.