The goal of string analysis is to determine the values that a string expression can take during program execution at a given program point. Like many program analysis problems, this turns out to be a difficult problem. We can demonstrate the difficulty of string analysis using a simple language for string manipulation.

2.1 A Simple String Manipulation Language

We define the syntax of our simple string manipulation language in Fig. 2.1. Each program consists of a sequence of labeled statements. Statements can be assignment statements, conditional or unconditional branch statements, input and output statements, and assert statements. We have only one string operation (concatenation “.”) and only one predicate (equality “ = ”) in this simple language.

Fig. 2.1
figure 1

Grammar for a very simple string manipulation language

In Fig. 2.2 we show a simple program in the simple string manipulation language we described above. This program consists of two read statements followed by two assignment statements that use string concatenation followed by a conditional branch statement that uses the equality predicate followed by print and halt statements.

Fig. 2.2
figure 2

A simple string manipulating program

Assume that we want to check if an assertion in a program written in this language can ever fail. If we can do precise string analysis on programs written in this language, then we can determine all possible values for all string expressions in the program. So, then, we can check if it is possible to have an assertion violation. Note that, we can easily transform assertion violation checks to reachability checks where we check if a program statement is reachable. Given an assertion statement

assert exp; we can easily transform it to a reachability check as follows:

if exp goto 2;1: print "assertion violation";2: print "assertion holds"; Checking the reachability of the statement 1 is equivalent to checking if it is possible to have an assertion violation during some execution of the program.

2.2 Automated and Precise Verification of String Programs Is Not Possible

Let us define the halting problem for string programs as the problem of deciding, given a string program P, where initially the string variables are initialized to the null string, 𝜖, whether P will halt (i.e., reach the halt statement) on some execution. More generally, the reachability problem for string programs (which need not halt) is the problem of deciding, given a string program P and a program state s (where a program state s is defined with the instruction label of an instruction in the program and the values of all the variables), whether at some point during a computation, the program state s will be reached. Note that, we define the halting and the reachability conditions using existential quantification over the execution paths, i.e., the halting and the reachability conditions hold if there exists an execution path that halts or reaches the target state, respectively. Hence, if the halting problem is undecidable, then the reachability problem is undecidable.

It can be shown that the halting program for string programs is undecidable [134]. In fact, halting problem for string programs is undecidable even if we restrict the number of string variables to 3. It is possible to show this by demonstrating that string programs can simulate counter machines.

Counter machines are a simple and powerful computational model that can simulate Turing Machines. A counter machine consists of a finite number of counters (unbounded integer variables) and a finite set of instructions. Counter machines have a very small instruction set that includes an increment instruction, a decrement instruction, a conditional branch instruction that tests if a counter value is equal to zero, and a halt instruction. So, during the execution of a counter machine, at each step, a counter can be incremented by 1, decremented by 1, and tested for zero. The counters can only assume nonnegative values. It is well-known that the halting problem for two-counter machines, where both counters are initialized to 0, is undecidable [79]. In fact, two counter machines can simulate Turing Machines.

We show that a two-counter machine M can be simulated by a string program P with only three string variables X 1, X 2, X 3. The program counter of M can be represented as labels in the string program P. The instructions where the counter-machine M halts will be represented with the halt instruction in the string program P. We will use the lengths of the strings X 1, X 2 and X 3 to simulate the values of the counters C 1 and C 2. The value of C 1 will be simulated by |X 1|−|X 3|, and the value of C 2 will be simulated by |X 2|−|X 3|.

The counter machine M starts from the initial configuration (q 0, 0, 0) where q 0 denotes the initial instruction and the two integer values represent the initial values of counters C 1 and C 2, respectively. The initial state of the string program P will be (q 0, 𝜖, 𝜖, 𝜖) where q 0 is the label of the first instruction, and the strings 𝜖, 𝜖, 𝜖 are the initial values of the string variables X 1, X 2 and X 3, respectively. The instructions of the counter-machine C can be simulated by instructions of a string program as shown in Fig. 2.3 where each statement is followed by a goto statement that transitions to the next instruction.

Fig. 2.3
figure 3

Simulation of instructions of a counter machine with string program instructions

Note that although this transformation allows the simulated counter values to possibly take negative values, this can be fixed by adding a conditional branch instruction before each decrement that checks that the simulated counter value is not zero before the instructions simulating the decrement instruction is executed. The string program P constructed from the counter machine M based on these rules will simulate M. Hence, solving the halting problem for string programs would mean that we can solve the halting problem for counter machines. Since halting problem for counter machines is undecidable, we conclude that halting problem for string programs is also undecidable. Since 2-counter machines can simulate Turing Machines we also observe that string programs with only 3 string variables can simulate Turing Machines.

This discussion demonstrates that automatically and precisely checking reachability properties of string manipulating programs is not possible. Reachability problem for string manipulating programs in undecidable, hence it cannot be precisely solved like many other program analysis problems. However, it is possible to check reachability properties of string manipulating programs approximately. For example, it would be possible to develop automated and approximate techniques that guarantee there are no assertion violations (i.e., that there a no bugs) for some programs. However, such sound verification techniques could sometimes report false positives, i.e., report assertion violations although no execution of the program may have assertion violations. Similarly, it would be possible to develop automated and approximate techniques that guarantee that there are assertion violations in some programs. However, such complete verification techniques could sometimes report false negatives, i.e., they may not find an assertion violation even though in some execution of the program there may be an assertion violation.

When a program analysis problem is undecidable, it means that there does not exist a sound and complete technique to solve it. Since we showed that reachability for string programs is an undecidable problem, we conclude that there is no automated verification technique for string programs that can report assertion violations for all string programs that contain assertion violations, and that can report the absence of assertion violations for all string programs that do not contain assertion violations.

2.3 A Richer String Manipulation Language

We have seen that, even with 3 string variables and a small instruction set, string programs can simulate Turing Machines. Adding more string operations does not increase the expressiveness of string manipulating programs theoretically. However, in practice, string analysis has to deal with many other string manipulation operations in addition to string concatenation. In Fig. 2.4 we give the syntax for an extended string manipulation language.

Fig. 2.4
figure 4

Grammar for an extended string manipulation language

Let us briefly describe the semantics of extended string operations below. We assume that Σ denotes the set of all characters (i.e., the alphabet) in our string manipulation language, and Σ denotes the set of all strings (including the empty string 𝜖). The operations “.”, “ ∗”, and “|” correspond to regular expression operations concatenation, Kleene closure and alternation.

  • match(s, r) means that string s matches the string expression r (which could be a regular expression). Let \(\mathcal {L}(r)\) denote the set of strings (i.e., the language) defined by the string expression r, then we define the semantics of match as:

    $$\displaystyle \begin{aligned} \mathtt{{ }match}(s, r) \Leftrightarrow s \in \mathcal{L}(r) \end{aligned}$$
  • contains(s, t) means that string t is a substring of string s:

    $$\displaystyle \begin{aligned} \mathtt{{ }contains}(s, t) \Leftrightarrow \exists s_1, s_2 \in \varSigma^* \ : \ s = s_1 t s_2 \end{aligned}$$
  • begins(s, t) means that string s begins with string t:

    $$\displaystyle \begin{aligned} \mathtt{{ }begins}(s, t) \Leftrightarrow \exists s_1 \in \varSigma^* \ : \ s = t s_1 \end{aligned}$$
  • ends(s, t) means that string s ends with string t:

    $$\displaystyle \begin{aligned} \mathtt{{ }ends}(s, t) \Leftrightarrow \exists s_1 \in \varSigma^* \ : \ s = s_1 t \end{aligned}$$
  • length(s) denotes the length of the string s (for empty string, length (𝜖) = 0):

    $$\displaystyle \begin{aligned} (\mathtt{{ }length}(s) = 0 \Leftrightarrow s = \epsilon) \ \wedge\ (\mathtt{{ }length}(s) = n \Leftrightarrow \exists c_1, c_2, \ldots, c_n \in \varSigma \ : \ s = c_1c_2 \ldots c_n) \end{aligned}$$

    We also use |s| to denote the length of string s, i.e., length(s) = |s|.

  • indexof(s, t) denotes the smallest starting location of substring t in string s. If x is not a substring of s (i.e, ¬contains(s, t)) then indexof(s, t) = −1.

    $$\displaystyle \begin{aligned} \begin{array}{lll} (\mathtt{{ }indexof}(s,t) = -1 & \Leftrightarrow & \neg \mathtt{{ }contains}(s, t)) \ \wedge\ \\ (\mathtt{{ }indexof}(s,t) = n & \Leftrightarrow & \ (\exists s_1, s_2 \in \varSigma^* \ : \ s = s_1 t s_2 \wedge |s_1| = n) \\ & & \wedge (\forall i < n \ : \ \neg (\exists s_1, s_2 \in \varSigma^* \ : \ s = s_1 t s_2 \wedge |s_1| = i))) \end{array} \end{aligned}$$
  • replace(s, p, t) replaces the pattern string p in s with the target string t and returns the result.

    $$\displaystyle \begin{aligned} \begin{array}{l} r = \mathtt{{ }replace}(s, p, t) \Leftrightarrow ((\neg \mathtt{{ }contains}(s, p) \ \wedge\ r = s) \ \vee \ \\ (\exists s_3, s_4, s_5 \in \varSigma^* : s = s_3 p s_4 \ \wedge\ r = s_3 t s_5 \ \wedge\ s_5 = \mathtt{{ }replace}(s_4, p, t) \ \wedge\ \\ (\forall s_6, s_7 \in \varSigma^* : s=s_6 p s_7 \Rightarrow |s_6| \geq |s_3|))) \end{array} \end{aligned}$$

    Note that, this description of the replace function semantics assumes that pattern string p is a string value. In general, p could be a regular expression and more than one substring of the source string s can match the regular expression p. In that case, the semantics has to specify which matching substring will be chosen for replacement. Typically, the longest matching substring is chosen for replacement.

  • substring(s, i, j) returns the substring of string s that starts at index i (inclusive) and ends at index j (exclusive).

    $$\displaystyle \begin{aligned} t = \mathtt{{ }substring}(s,i,j) \Leftrightarrow \exists s_1, s_2 \in \varSigma^* \ : \ s = s_1 t s_2 \ \wedge\ |s_1| = i \ \wedge\ |t| = j - i \end{aligned}$$
  • charat(s, i) returns the character that appears at the index i of the string s. The semantics of charat is defined as follows:

    $$\displaystyle \begin{aligned} t = {\tt {charat}}(s,i) \Leftrightarrow \exists c_0,c_1,\ldots,c_n \in \varSigma \ : \ s = c_0c_1\ldots c_n\ \wedge\ 0 \le i \le n\ \wedge\ t = c_i\end{aligned} $$
  • reverse(s) returns the reverse of the string v.

    $$\displaystyle \begin{aligned} t = {\tt {reverse}}(s) \Leftrightarrow \exists c_0,c_1,\ldots,c_n \in \varSigma \ : \ s = c_0c_1\ldots c_n\ \wedge\ t = c_n\ldots c_1c_0 \end{aligned}$$

It is possible to extend the string manipulation language defined in Fig. 2.4 even further by adding more variations of string operations we defined. For example, the replace operation we defined replaces all occurrences of substrings that match the pattern. A variant of the replace operation replaces only the first appearance of the substring that matches the given pattern. Yet another variant replaces only the last appearance of the substring that matches the given pattern. As another example, the indexof operation we defined returns the index of the first appearance of the given substring. A variant of this operation is the last-index-of operation that returns the index of the last appearance of the given substring. The techniques we discuss in this monograph can be extended to handle such extensions, but to keep our discussions concise we restrict our scope to the operations defined in Fig. 2.4.

Semantics of string manipulating programs can be defined based on the semantics of string manipulation operations we gave above.

Consider the PHP program segment that we showed in Chap. 1, Fig. 1.5. We can model this function using the string manipulation language we introduced as shown in Fig. 2.5. Note that the lines that correspond to each other are labeled with the same line number. Since we did not introduce a complex regular expression syntax in our string manipulation language, we did not provide the translation for the regular expression used in the line 4 of the PHP code snippet. However, it is possible to translate this PHP regular expression to a simple regular expression by just taking disjunction of all the characters that correspond to the PHP regular expression.

Fig. 2.5
figure 5

A code snippet (a) written in PHP and the corresponding code snippet (b) in the string manipulation language

String analysis techniques developed for the string manipulation language we introduced in this chapter can be adopted to string analysis problems in modern programming languages by either extracting a string program from a given program by identifying the string expressions, or by implementing the string analysis techniques in a static analysis tool for the given programming language.

2.4 Summary

In this chapter we first presented a basic set of string manipulating instructions and then provided a larger set of instructions with more complex string operations. We discussed the difficulty of analyzing string manipulating programs. If we assume that string variables can take arbitrarily large string values, then even with a basic set of instructions, verification of string manipulating programs is an undecidable problem. Hence, verification of string manipulating programs cannot be fully automated and precise at the same time. In the following chapters we discuss automated techniques that use approximations and abstractions for string analysis.