Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Regular expressions provide a versatile mechanism for parsing and validating input data. Due to their flexibility, many developers use regular expressions to validate passwords or to extract substrings that match a given pattern. Hence, many languages provide extensive support for regular expression matching.

While there are several algorithms for determining membership in a regular language, a common technique is to construct a non-deterministic finite automaton (NFA) and perform backtracking search over all possible runs of this NFA. Although simple and flexible, this strategy has super-linear (in fact, exponential) complexity and is prone to a class of algorithmic complexity attacks [14]. For some regular expressions (e.g., (a|b)*(a|c)*), it is possible to craft input strings that could cause the matching algorithm to take quadratic time (or worse) in the size of the input. For some regular expressions (e.g., (a+)+), one can even generate input strings that could cause the matching algorithm to take exponential time. Hence, attackers exploit the presence of vulnerable regular expressions to launch so-called regular expression denial-of-service (ReDoS) attacks.

ReDoS attacks have been shown to severely impact the responsiveness and availability of applications. For example, the .NET framework was shown to be vulnerable to a ReDoS attack that paralyzed applications using .NET’s default validation mechanism [2]. Furthermore, unlike other DoS attacks that require thousands of machines to bring down critical infrastructure, ReDoS attacks can be triggered by a single malicious user input. Consequently, developers are responsible for protecting their code against such attacks, either by avoiding the use of vulnerable regular expressions or by sanitizing user input.

Unfortunately, protecting an application against ReDoS attacks can be non-trivial in practice. Often, developers do not know which regular expressions are vulnerable or how to rewrite them in a way that avoids super-linear complexity. In addition, it is difficult to implement a suitable sanitizer without understanding the class of input strings that trigger worst-case behavior. Even though some libraries (e.g., the .Net framework) allow developers to set a time limit for regular expression matching, existing solutions do not address the root cause of the problem. As a result, ReDoS vulnerabilities are still being uncovered in many important applications. For instance, according to the National Vulnerability Database (NVD), there are over 150 acknowledged ReDoS vulnerabilities, some of which are caused by exponential matching complexity (e.g., [2, 3]) and some of which are characterized by super-linear behavior (e.g., [1, 4, 5]).

In this paper, we propose a static technique for automatically uncovering DoS vulnerabilities in programs that use regular expressions. There are two main technical challenges that make this problem difficult: First, given a regular expression \(\mathcal {E}\), we need to statically determine the worst-case complexity of matching \(\mathcal {E}\) against an arbitrary input string. Second, given an application A that contains a vulnerable regular expression \(\mathcal {E}\), we must statically determine whether there can exist an execution of A in which \(\mathcal {E}\) can be matched against an input string that could cause super-linear behavior.

We solve these challenges by developing a two-tier algorithm that combines (a) static analysis of regular expressions with (b) sanitization-aware taint analysis at the source code level. Our technique can identify both vulnerable regular expressions that have super-linear complexity (quadratic or worse), as well as hyper-vulnerable ones that have exponential complexity. In addition and, most importantly, our technique can also construct an attack automaton that captures all possible attack strings. The construction of attack automata is crucial for reasoning about input sanitization at the source-code level.

To summarize, this paper makes the following contributions:

  • We present algorithms for reasoning about worst-case complexity of NFAs. Given an NFA \(\mathcal {A}\), our algorithm can identify whether \(\mathcal {A}\) has linear, super-linear, or exponential time complexity and can construct an attack automaton that accepts input strings that could cause worst-case behavior for \(\mathcal {A}\).

  • We describe a program analysis to automatically identify ReDoS vulnerabilities. Our technique uses the results of the regular expression analysis to identify sinks and reason about input sanitization using attack automata.

  • We use these ideas to build an end-to-end tool called Rexploiter for finding vulnerabilities in Java. In our evaluation, we find 41 security vulnerabilities in 150 Java programs collected from Github with a 11% false positive rate.

Fig. 1.
figure 1

Motivating example containing ReDoS vulnerabilities

2 Overview

We illustrate our technique using the code snippet shown in Fig. 1, which shows two relevant classes, namely RegExValidator, that is used to validate that certain strings match a given regular expression, and CommentFormValidator, that checks the validity of a comment form filled out by a user. In particular, the comment form submitted by the user includes the user’s email address, the URL of the product about which the user wishes to submit a commentFootnote 1, and the text containing the comment itself. We now explain how our technique can determine whether this program contains a denial-of-service vulnerability.

Regular Expression Analysis. For each regular expression in the program, we construct its corresponding NFA and statically analyze it to determine whether its worst-case complexity is linear, super-linear, or exponential. For our running example, the NFA complexity analysis finds instances of each category. In particular, the regular expression used at line 5 has linear matching complexity, while the one from line 4 has exponential complexity. The regular expressions from lines 2 and 7 have super-linear (but not exponential) complexity. Figure 2 plots input size against running time for the regular expressions from lines 2 and 4 respectively. For the super-linear and exponential regular expressions, our technique also constructs an attack automaton that recognizes all strings that cause worst-case behavior. In addition, for each regular expression, we determine a lower bound on the length of any possible attack string using dynamic analysis.

Fig. 2.
figure 2

Matching time against malicious string size for vulnerable (left) and hyper-vulnerable (right) regular expressions from Fig. 1.

Program Analysis. The presence of a vulnerable regular expression does not necessarily mean that the program itself is vulnerable. For instance, the vulnerable regular expression may not be matched against an attacker-controlled string, or the program may take measures to prevent the user from supplying a string that is an instance of the attack pattern. Hence, we also perform static analysis at the source code level to determine if the program is actually vulnerable.

Going back to our example, the validate procedure (lines 11–22) calls validEmail to check whether the website administrator’s email address is valid. Even though validEmail contains a super-linear regular expression, line 15 does not contain a vulnerability because the administrator’s email is not supplied by the user. Since our analysis tracks taint information, it does not report line 15 as being vulnerable. Now, consider the second call to validEmail at line 17, which matches the vulnerable regular expression against user input. However, since the program bounds the size of the input string to be at most 254 (which is smaller than the lower bound identified by our analysis), line 17 is also not vulnerable.

Next, consider the call to validUrl at line 19, where productUrl is a user input. At first glance, this appears to be a vulnerability because the matching time of the regular expression from line 4 against a malicious input string grows quite rapidly with input size (see Fig. 2). However, the check at line 18 actually prevents calling validUrl with an attack string: Specifically, our analysis determines that attack strings must be of the form www.shoppers.com \(\cdot \) / \(^b\cdot \) / \(^+\cdot \) x, where x denotes any character and b is a constant inferred by our analysis (in this case, much greater than 5). Since our program analysis also reasons about input sanitization, it can establish that line 19 is safe.

Finally, consider the call to validComment at line 21, where comment is again a user input and is matched against a regular expression with exponential complexity. Now, the question is whether the condition at line 20 prevents comment from conforming to the attack pattern . Since this is not the case, line 21 actually contains a serious DoS vulnerability.

Summary of Challenges. This example illustrates several challenges we must address: First, given a regular expression \(\mathcal {E}\), we must reason about the worst-case time complexity of its corresponding NFA. Second, given vulnerable regular expression \(\mathcal {E}\), we must determine whether the program allows \(\mathcal {E}\) to be matched against a string that is (a) controlled by the user, (b) is an instance of the attack pattern for regular expression \(\mathcal {E}\), and (c) is large enough to cause the matching algorithm to take significant time.

Our approach solves these challenges by combining complexity analysis of NFAs with sanitization-aware taint analysis. The key idea that makes this combination possible is to produce an attack automaton for each vulnerable NFA. Without such an attack automaton, the program analyzer cannot effectively determine whether an input string can correspond to an attack string.

Fig. 3.
figure 3

Overview of our approach

As shown in Fig. 3, the Rexploiter toolchain incorporates both static and dynamic regular expression analysis. The static analysis creates attack patterns \(s_0 \cdot s^k \cdot s_1\) and dynamic analysis infers a lower bound b on the number of occurrences of s in order to exceed a minimum runtime threshold. The program analysis uses both the attack automaton and the lower bound b to reason about input sanitization.

3 Preliminaries

This section presents some useful background and terminology.

Definition 1

(NFA) An NFA \(\mathcal {A}\) is a 5-tuple \((Q, \varSigma , \varDelta , q_0, F)\) where Q is a finite set of states, \(\varSigma \) is a finite alphabet of symbols, and \(\varDelta : Q \times \varSigma \rightarrow 2^Q\) is the transition function. Here, \(q_0 \in Q\) is the initial state, and \(F \subseteq Q\) is the set of accepting states. We say that \((q, l, q')\) is a transition via label l if \(q' \in \varDelta (q, l)\).

An NFA \(\mathcal {A}\) accepts a string \(s = a_0a_1\ldots a_n\) iff there exists a sequence of states \(q_0, q_1, ..., q_n\) such that \(q_n \in F\) and \(q_{i+1} \in \varDelta (q_i, a_i)\). The language of \(\mathcal {A}\), denoted \(\mathcal L(\mathcal {A})\), is the set of all strings that are accepted by \(\mathcal {A}\). Conversion from a regular expression to an NFA is sometimes referred to as compilation and can be achieved using well-known techniques, such as Thompson’s algorithm [25].

In this paper, we assume that membership in a regular language \(\mathcal L(\mathcal {E})\) is decided through a worst-case exponential algorithm that performs backtracking search over possible runs of the NFA representing \(\mathcal {E}\). While there exist linear-time matching algorithms (e.g., based on DFAs), many real-world libraries employ backtracking search for two key reasons: First, the compilation of a regular expression is much faster using NFAs and uses much less memory (DFA’s can be exponentially larger). Second, the backtracking search approach can handle regular expressions containing extra features like backreferences and lookarounds. Thus, many widely-used libraries (e.g., java.util.regex, Python’s standard library) employ backtracking search for regular expression matching.

In the remainder of this paper, we will use the notation \(\mathcal {A}^*\) and \(\mathcal {A}^\emptyset \) to denote the NFA that accepts \(\varSigma ^*\) and the empty language respectively. Given two NFAs \(\mathcal {A}_1\) and \(\mathcal {A}_2\), we write \(\mathcal {A}_1 \cap \mathcal {A}_2\), \(\mathcal {A}_1 \cup \mathcal {A}_2\), and \(\mathcal {A}_1 \cdot \mathcal {A}_2\) to denote automata intersection, union, and concatenation. Finally, given an automaton \(\mathcal {A}\), we write \(\overline{\mathcal {A}}\) to represent its complement, and we use the notation \(\mathcal {A}^+\) to represent the NFA that recognizes exactly the language \(\{s^k \ | \ k \ge 1 \wedge s \in \mathcal L(\mathcal {A})\}\).

Definition 2

(Path). Given an NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\), a path \(\pi \) of \(\mathcal {A}\) is a sequence of transitions \( (q_1, \ell _1, q_2), \ldots , (q_{m-1}, \ell _{m-1}, q_m) \) where \(q_i \in Q\), \(\ell _i \in \varSigma \), and \(q_{i+1} \in \varDelta (q_{i}, \ell _i )\). We say that \(\pi \) starts in \(q_i\) and ends at \(q_{m}\), and we write \(labels (\pi )\) to denote the sequence of labels \((\ell _1, \ldots , \ell _{m-1})\).

4 Detecting Hyper-Vulnerable NFAs

In this section, we explain our technique for determining if an NFA is hyper-vulnerable and show how to generate an attack automaton that recognizes exactly the set of attack strings.

Definition 3

(Hyper-Vulnerable NFA). An NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\) is hyper-vulnerable iff there exists a backtracking search algorithm \(\textsc {Match}\) over the paths of \(\mathcal {A}\) such that the worst-case complexity of \(\textsc {Match}\) is exponential in the length of the input string.

We will demonstrate that an NFA \(\mathcal {A}\) is hyper-vulnerable by showing that there exists a string s such that the number of distinct matching paths \(\pi _i\) from state \(q_{0}\) to a rejecting state \(q_r\) with \(labels(\pi _i) = s\) is exponential in the length of s. Clearly, if s is rejected by \(\mathcal {A}\), then \(\textsc {Match}\) will need to explore each of these exponentially many paths. Furthermore, even if s is accepted by \(\mathcal {A}\), there exists a backtracking search algorithm (namely, the one that explores all rejecting paths first) that results in exponential worst-case behavior.

Theorem 1

An NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\) is hyper-vulnerable iff there exists a pivot state \(q \in Q\) and two distinct paths \(\pi _1, \pi _2\) such that (i) both \(\pi _1, \pi _2\) start and end at q, (ii) \(labels (\pi _1) = labels (\pi _2)\), and (iii) there is a path \(\pi _p\) from initial state \(q_0\) to q, and (iv) there is a path \(\pi _s\) from q to a state \(q_r \not \in F\).

Proof

The sufficiency argument is laid out below, and the necessity argument can be found in the extended version of this paper [31].

Fig. 4.
figure 4

Hyper-vulnerable NFA pattern

To gain intuition about hyper-vulnerable NFAs, consider Fig. 4 illustrating the conditions of Theorem 1. First, a hyper-vulnerable NFA must contain a pivot state q, such that, starting at q, there are two different ways (namely, \(\pi _1, \pi _2\)) of getting back to q on the same input string s (i.e., \(labels (\pi _1)\)). Second, the pivot state q should be reachable from the initial state \(q_0\), and there must be a way of reaching a rejecting state \(q_r\) from q.

To understand why these conditions cause exponential behavior, consider a string of the form \(s_0 \cdot s^k \cdot s_1\), where \(s_0\) is the attack prefix given by \(labels (\pi _p)\), \(s_1\) is the attack suffix given by \(labels (\pi _s)\), and s is the attack core given by \(labels (\pi _1)\). Clearly, there is an execution path of \(\mathcal {A}\) in which the string \(s_0 \cdot s^k \cdot s_1\) will be rejected. For example, \(\pi _p \cdot \pi _1^k \cdot \pi _s\) is exactly such a path.

figure a

Now, consider a string \(s_0 \cdot s^{k+1} \cdot s_1\) that has an additional instance of the attack core s in the middle, and suppose that there are n possible executions of \(\mathcal {A}\) on the prefix \(s_0 \cdot s^k\) that end in q. Now, for each of these n executions, there are two ways to get back to q after reading s: one that takes path \(\pi _1\) and another that takes path \(\pi _2\). Therefore, there are 2n possible executions of \(\mathcal {A}\) that end in q. Furthermore, the matching algorithm will (in the worst case) end up exploring all of these 2n executions since there is a way to reach the rejecting state \(q_r\). Hence, we end up doubling the running time of the algorithm every time we add an instance of the attack core s to the middle of the input string.

Fig. 5.
figure 5

A hyper-vulnerable NFA (left) and an attack automaton (right).

Example 1

The NFA in Fig. 5 (left) is hyper-vulnerable because there exist two different paths \(\pi _1 = (q, a, q), (q, a, q)\) and \(\pi _2 = (q, a, q_0), (q_0, a, q)\) that contain the same labels and that start and end in q. Also, q is reachable from \(q_0\), and the rejecting state \(q_r\) is reachable from q. Attack strings for this NFA are of the form \(a \cdot (a \cdot a)^k \cdot b\), and the attack automaton is shown in Fig. 5 (right).

We now use Theorem 1 to devise Algorithm 1 for constructing the attack automaton for a given NFA. The key idea of our algorithm is to search for all possible pivot states \(q_i\) and construct the attack automaton for state \(q_i\). The full attack automaton is then obtained as the union of all . Note that Algorithm 1 can be used to determine if automaton \(\mathcal {A}\) is vulnerable: \(\mathcal {A}\) exhibits worst-case exponential behavior iff the language accepted by is non-empty.

In Algorithm 1, most of the real work is done by the AttackForPivot procedure, which constructs the attack automaton for a specific state q: Given a pivot state q, we want to find two different paths \(\pi _1\), \(\pi _2\) that loop back to q and that have the same set of labels. Towards this goal, line 11 of Algorithm 1 considers all pairs of transitions from q that have the same label (since we must have \(labels (\pi _1) = labels (\pi _2)\)).

Now, let us consider a pair of transitions \(\tau _1 = (q, l, q_1)\) and \(\tau _2 = (q, l, q_2)\). For each \(q_i\) (\(i \in \{1,2\})\), we want to find all strings that start in q, take transition \(\tau _i\), and then loop back to q. In order to find all such strings \(\mathcal S\), Algorithm 1 invokes the LoopBack function (lines 18–22), which constructs an automaton \(\mathcal {A}'\) that recognizes exactly \(\mathcal S\). Specifically, the final state of \(\mathcal {A}'\) is q because we want to loop back to state q. Furthermore, \(\mathcal {A}'\) contains a new initial state \(q^*\) (where \(q^* \not \in Q\)) and a single outgoing transition \((q^*, l, q_i)\) out of \(q^*\) because we only want to consider paths that take the transition to \(q_i\) first. Hence, each \(\mathcal {A}_i\) in lines 12–13 of the AttackForPivot procedure corresponds to a set of paths that loop back to q through state \(q_i\). Observe that, if a string s is accepted by \(\mathcal {A}_1 \cap \mathcal {A}_2\), then s is an attack core for pivot state q.

We now turn to the problem of computing the set of all attack prefixes and suffixes for pivot state q: In line 14 of Algorithm 1, \(\mathcal {A}_p\) is the same as the original NFA \(\mathcal {A}\) except that its only accepting state is q. Hence, \(\mathcal {A}_p\) accepts all attack prefixes for pivot q. Similarly, \(A_s\) is the same as \(\mathcal {A}\) except that its initial state is q instead of \(q_0\); thus, \(\overline{A_s}\) accepts all attack suffixes for q.

Finally, let us consider how to construct the full attack automaton for q. As explained earlier, all attack strings are of the form \(s_1 \cdot s^k \cdot s_2\) where \(s_1\) is the attack prefix, s is the attack core, and \(s_2\) is the attack suffix. Since \(\mathcal {A}_p\), \(\mathcal {A}_1 \cap \mathcal {A}_2\), and \(\overline{\mathcal {A}_s}\) recognize attack prefixes, cores, and suffixes respectively, any string that is accepted by \(\mathcal {A}_p \cdot (\mathcal {A}_1 \cap \mathcal {A}_2)^+ \cdot \overline{\mathcal {A}_s}\) is an attack string for the original NFA \(\mathcal {A}\).

Theorem 2

(Correctness of Algorithm 1) Footnote 2. Let be the result of calling \(\textsc {AttackAutomaton}(\mathcal {A})\) for NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\). For every , there exists a rejecting state \(q_r \in Q \setminus F\) s.t. the number of distinct paths \(\pi _i\) from \(q_0\) to \(q_r\) with \(labels(\pi _i) = s\) is exponential in the number of repetitions of the attack core in s.

5 Detecting Vulnerable NFAs

So far, we only considered the problem of identifying NFAs whose worst-case running time is exponential. However, in practice, even NFAs with super-linear complexity can cause catastrophic backtracking. In fact, many acknowledged ReDoS vulnerabilities (e.g., [1, 4, 5]) involve regular expressions whose matching complexity is “only” quadratic. Based on this observation, we extend the techniques from the previous section to statically detect NFAs with super-linear time complexity. Our solution builds on insights from Sect. 4 to construct an attack automaton for this larger class of vulnerable regular expressions.

5.1 Understanding Super-Linear NFAs

Before we present the algorithm for detecting super-linear NFAs, we provide a theorem that explains the correctness of our solution.

Definition 4

(Vulnerable NFA). An NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\) is vulnerable iff there exists a backtracking search algorithm \(\textsc {Match}\) over the paths of \(\mathcal {A}\) such that the worst-case complexity of \(\textsc {Match}\) is at least quadratic in the length of the input string.

Theorem 3

An NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\) is vulnerable iff there exist two states \(q \in Q\) (the pivot), \(q' \in Q\), and three paths \(\pi _1\), \(\pi _2\), and \(\pi _3\) (where \(\pi _1 \ne \pi _2\)) such that (i) \(\pi _1\) starts and ends at q, (ii) \(\pi _2\) starts at q and ends at \(q'\), (iii) \(\pi _3\) starts and ends at \(q'\), (iv) \(labels (\pi _1) = labels (\pi _2) = labels (\pi _3)\), and (v) there is a path \(\pi _p\) from \(q_0\) to q, (vi) there is a path \(\pi _s\) from \(q'\) to a state \(q_r \not \in F\).

Proof

The sufficiency argument is laid out below, and the necessity argument can be found in the extended version of this paper [31].

Figure 6 illustrates the intuition behind the conditions above. The distinguishing characteristic of a super-linear NFA is that it contains two states \(q, q'\) such that \(q'\) is reachable from q on input string s, and it is possible to loop back from q and \(q'\) to the same state on string s. In addition, just like in Theorem 1, the pivot state q needs to be reachable from the initial state, and a rejecting state \(q_r\) must be reachable from \(q'\). Observe that any automaton that is hyper-vulnerable according to Theorem 1 is also vulnerable according to Theorem 3. Specifically, consider an automaton \(\mathcal {A}\) with two distinct paths \(\pi _1, \pi _2\) that loop around q. In this case, if we take \(q'\) to be q and \(\pi _3\) to be \(\pi _1\), we immediately see that \(\mathcal {A}\) also satisfies the conditions of Theorem 3.

Fig. 6.
figure 6

General pattern characterizing vulnerable NFAs

To understand why the conditions of Theorem 3 imply super-linear time complexity, let us consider a string of the form \(s_0 \cdot s^k \cdot s_1\) where \(s_0\) is the attack prefix given by \(labels (\pi _p)\), \(s_1\) is the attack suffix given by \(labels (\pi _s)\), and s is the attack core given by \(labels (\pi _1)\). Just like in the previous section, the path \(\pi _p \,\pi _1^k\, \pi _s\) describes an execution for rejecting the string \(s_0 \cdot s^k \cdot s_1\) in automaton \(\mathcal {A}\). Now, let \(T_q(k)\) represent the running time of rejecting the string \(s^k s_1\) starting from q, and suppose that it takes 1 unit of time to read string s. We can write the following recurrence relation for \(T_q(k)\):

$$\begin{aligned} T_q(k) = (1 + T_q(k-1)) + (1 + T_{q'}(k-1)) \end{aligned}$$

To understand where this recurrence is coming from, observe that there are two ways to process the first occurence of s:

  • Take path \(\pi _1\) and come back to q, consuming 1 unit of time to process string s. Since we are back at q, we still have \(T_q(k-1)\) units of work to perform.

  • Take path \(\pi _2\) and proceed to \(q'\), also consuming 1 unit of time to process string s. Since we are now at \(q'\), we have \(T_{q'}(k-1)\) units of work to perform.

Now, observe that a lower bound on \(T_{q'}(k)\) is k since one way to reach \(q_r\) is \(\pi _3^k \pi _s\), which requires us to read the entire input string. This observation allows us to obtain the following recurrence relation:

$$\begin{aligned} T_q(k) \ge T_q(k-1) + k + 1 \end{aligned}$$

Thus, the running time of \(\mathcal {A}\) on the input string \(s_0 \cdot s^k \cdot s_1\) is at least \(k^2\).

Fig. 7.
figure 7

A vulnerable NFA (left) and its attack automaton (right).

Example 2

The NFA shown in Fig. 7 (left) exhibits super-linear complexity because we can get from q to \(q'\) on input string ab, and for both q and \(q'\), we loop back to the same state when reading input string ab. Specifically, we have:

$$\begin{aligned} \begin{array}{lll} \pi _1: (q, a, q_1), (q_1, b, q) \quad&\pi _2: (q, a, q_2), (q_2, b, q') \quad&\pi _3: (q', a, q_2), (q_2, b, q') \end{array} \end{aligned}$$

Furthermore, q is reachable from \(q_0\), and there exists a rejecting state, namely \(q'\) itself, that is reachable from \(q'\). The attack strings are of the form \(c (ab)^k\), and Fig. 7 (right) shows the attack automaton.

figure b

5.2 Algorithm for Detecting Vulnerable NFAs

Based on the observations from the previous subsection, we can now formulate an algorithm that constructs an attack automaton for a given automaton \(\mathcal {A}\). Just like in Algorithm 1, we construct an attack automaton for each state in \(\mathcal {A}\) by invoking the AttackForPivot procedure. We then take the union of all such ’s to obtain an automaton whose language consists of strings that cause super-linear running time for \(\mathcal {A}\).

Algorithm 2 describes the AttackForPivot procedure for the super-linear case. Just like in Algorithm 1, we consider all pairs of transitions from q with the same label (line 11). Furthermore, as in Algorithm 1, we construct an automaton \(\mathcal {A}_p\) that recognizes attack prefixes for q (line 13) as well as an automaton \(\mathcal {A}_1\) that recognizes non-empty strings that start and end at q (line 12).

The key difference of Algorithm 2 is that we also need to consider all states that could be instantiated as \(q'\) from Fig. 6 (lines 15–19). For each of these candidate \(q'\)’s, we construct automata \(\mathcal {A}_2, \mathcal {A}_3\) that correspond to paths \(\pi _2, \pi _3\) from Fig. 6 (lines 16–17). Specifically, we construct \(\mathcal {A}_2\) by introducing a new initial state \(q_i\) with transition \((q_i, l, q_2)\) and making its accepting state \(q'\). Hence, \(\mathcal {A}_2\) accepts strings that start in q, transition to \(q_2\), and end in \(q'\).

The construction of automaton \(\mathcal {A}_3\), which should accept all non-empty words that start and end in \(q'\), is described in the AnyLoopBack procedure. First, since we do not want \(\mathcal {A}_3\) to accept empty strings, we introduce a new initial state \(q^\star \) and add a transition from \(q^\star \) to all successor states \(q_i\) of \(q'\). Second, the final state of \(\mathcal {A}'\) is \(q'\) since we want to consider paths that loop back to \(q'\).

The final missing piece of the algorithm is the construction of \(\mathcal {A}_s\) (line 19), whose complement accepts all attack suffixes for state \(q'\). As expected, \(\mathcal {A}_s\) is the same as the original automaton \(\mathcal {A}\), except that its initial state is \(q'\). Finally, similar to Algorithm 1, the attack automaton for states \(q, q'\) is obtained as \(\mathcal {A}_p \cdot (\mathcal {A}_1 \cap \mathcal {A}_2 \cap \mathcal {A}_3)^+ \cdot \overline{\mathcal {A}_s}\).

Theorem 4

(Correctness of Algorithm 2). Let NFA \(\mathcal {A}= (Q, \varSigma , \varDelta , q_0, F)\) and be the result of calling \(\textsc {AttackAutomaton}(\mathcal {A})\). For every , there exists a rejecting state \(q_r \in Q \setminus F\) s.t. the number of distinct paths \(\pi _i\) from \(q_0\) to \(q_r\) with \(labels(\pi _i) = s\) is super-linear in the number of repetitions of the attack core in s.

6 Dynamic Regular Expression Analysis

Algorithms 1 and 2 allow us to determine whether a given NFA is vulnerable. Even though our static analyses are sound and complete at the NFA level, different regular expression matching algorithms construct NFAs in different ways and use different backtracking search algorithms. Furthermore, some matching algorithms may determinize the NFA (either lazily or eagerly) in order to guarantee linear complexity. Since our analysis does not perform such partial determinization of the NFA for a given regular expression, it can, in practice, generate false positives. In addition, even if a regular expression is indeed vulnerable, the input string must still exceed a certain minimum size to cause denial-of-service.

In order to overcome these challenges in practice, we also perform dynamic analysis to (a) confirm that a regular expression \(\mathcal {E}\) is indeed vulnerable for Java’s matching algorithm, and (b) infer a minimum bound on the size of the input string. Given the original regular expression \(\mathcal {E}\), a user-provided time limit t, and the attack automaton (computed by static regular expression analysis), our dynamic analysis produces a refined attack automaton as well as a number b such that there exists an input string of length greater than b for which Java’s matching algorithm takes more than t seconds. Note that, as usual, this dynamic analysis trades soundness for completeness to avoid too many false positives.

In more detail, given an attack automaton of the form \(\mathcal {A}_p \cdot \mathcal {A}_c^+ \cdot \mathcal {A}_s\), the dynamic analysis finds the smallest k where the shortest string \(s \in \mathcal L(\mathcal {A}_p \cdot \mathcal {A}_c^k \cdot \mathcal {A}_s)\) exceeds the time limit t. In practice, this process does not require more than a few iterations because we use the complexity of the NFA to predict the number of repetitions that should be necessary based on previous runs. The minimum required input length b is determined based on the length of the found string s. In addition, the value k is used to refine the attack automaton: in particular, given the original attack automaton \(\mathcal {A}_p \cdot \mathcal {A}_c^+ \cdot \mathcal {A}_s\), the dynamic analysis refines it to be \(\mathcal {A}_p \cdot \mathcal {A}_c^k \cdot \mathcal {A}_c^* \cdot \mathcal {A}_s\).

7 Static Program Analysis

As explained in Sect. 2, the presence of a vulnerable regular expression does not necessarily mean that the program is vulnerable. In particular, there are three necessary conditions for the program to contain a ReDoS vulnerability: First, a variable x that stores user input must be matched against a vulnerable regular expression \(\mathcal {E}\). Second, it must be possible for x to store an attack string that triggers worst-case behavior for \(\mathcal {E}\); and, third, the length of the string stored in x must exceed the minimum threshold determined using dynamic analysis.

To determine if the program actually contains a ReDoS vulnerability, our approach also performs static analysis of source code. Specifically, our program analysis employs the Cartesian product [7] of the following abstract domains:

  • The taint abstract domain [6, 26] tracks taint information for each variable. In particular, a variable is considered tainted if it may store user input.

  • The automaton abstract domain [12, 33, 34] overapproximates the contents of string variables using finite automata. In particular, if string s is in the language of automaton \(\mathcal {A}\) representing x’s contents, then x may store string s.

  • The interval domain [13] is used to reason about string lengths. Specifically, we introduce a ghost variable \(l_x\) representing the length of string x and use the interval abstract domain to infer upper and lower bounds for each \(l_x\).

Since these abstract domains are fairly standard, we only explain how to use this information to detect ReDoS vulnerabilities. Consider a statement \(\mathrm{match}(x, \mathcal {E})\) that checks if string variable x matches regular expression \(\mathcal {E}\), and suppose that the attack automaton for \(\mathcal {E}\) is . Now, our program analysis considers the statement \(\mathrm{match}(x, \mathcal {E})\) to be vulnerable if the following three conditions hold:

  1. 1.

    \(\mathcal {E}\) is vulnerable and variable x is tainted;

  2. 2.

    The intersection of and the automaton abstraction of x is non-empty;

  3. 3.

    The upper bound on ghost variable \(l_x\) representing x’s length exceeds the minimum bound b computed using dynamic analysis for and a user-provided time limit t.

The extended version of this paper [31] offers a more rigorous formalization of the analysis.

8 Experimental Evaluation

To assess the usefulness of the techniques presented in this paper, we performed an evaluation in which our goal is to answer the following questions:

Q1::

Do real-world Java web applications use vulnerable regular expressions?

Q2::

Can Rexploiter detect ReDoS vulnerabilities in web applications and how serious are these vulnerabilities?

 

Results for Q1. In order to assess if real-world Java programs contain vulnerabilities, we scraped the top 150 Java web applications (by number of stars) that contain at least one regular expression from GitHub repositories (all projects have between 10 and 2, 000 stars and at least 50 commits) and collected a total of 2, 864 regular expressions. In this pool of regular expressions, Rexploiter found 37 that have worst-case exponential complexity and 522 that have super-linear (but not exponential) complexity. Thus, we observe that approximately \(20\%\) of the regular expressions in the analyzed programs are vulnerable. We believe this statistic highlights the need for more tools like Rexploiter that can help programmers reason about the complexity of regular expression matching.

Results for Q2. To evaluate the effectiveness of Rexploiter in finding ReDoS vulnerabilities, we used Rexploiter to statically analyze all Java applications that contain at least one vulnerable regular expression. These programs include both web applications and frameworks, and cover a broad range of application domains. The average running time of Rexploiter is approximately 14 min per program, including the time to dynamically analyze regular expressions. The average size of analyzed programs is about 58, 000 lines of code.

Our main result is that Rexploiter found exploitable vulnerabilities in 27 applications (including from popular projects, such as the Google Web Toolkit and Apache Wicket) and reported a total of 46 warnings. We manually inspected each warning and confirmed that 41 out of the 46 vulnerabilities are exploitable, with 5 of the exploitable vulnerabilities involving hyper-vulnerable regular expressions and the rest being super-linear ones. Furthermore, for each of these 41 vulnerabilities (including super-linear ones), we were able to come up with a full, end-to-end exploit that causes the server to hang for more than 10 min.

Fig. 8.
figure 8

Running times for exponential vulnerabilities (left) and super-linear vulnerabilities (right) for different input sizes.

In Fig. 8, we explore a subset of the vulnerabilities uncovered by Rexploiter in more detail. Specifically, Fig. 8 (left) plots input size against running time for the exponential vulnerabilities, and Fig. 8 (right) shows the same information for a subset of the super-linear vulnerabilities.

Possible Fixes. We now briefly discuss some possible ways to fix the vulnerabilities uncovered by Rexploiter. The most direct fix is to rewrite the regular expression so that it no longer exhibits super-linear complexity. Alternatively, the problem can also be fixed by ensuring that the user input cannot contain instances of the attack core. Since our technique provides the full attack automaton, we believe Rexploiter can be helpful for implementing suitable sanitizers. Another possible fix (which typically only works for super-linear regular expressions) is to bound input size. However, for most vulnerabilities found by Rexploiter, the input string can legitimately be very large (e.g., review). Hence, there may not be an obvious upper bound, or the bound may still be too large to prevent a ReDoS attack. For example, Amazon imposes an upper bound of 5000 words (\(\sim \)25,000 characters) on product reviews, but matching a super-linear regular expression against a string of that size may still take significant time.

9 Related Work

To the best of our knowledge, we are the first to present an end-to-end solution for detecting ReDoS vulnerabilities by combining regular expression and program analysis. However, there is prior work on static analysis of regular expressions and, separately, on program analysis for finding security vulnerabilities.

Static Analysis of Regular Expressions. Since vulnerable regular expressions are known to be a significant problem, previous work has studied static analysis techniques for identifying regular expressions with worst-case exponential complexity [9, 18, 22, 24]. Recent work by Weideman et al. [30] has also proposed an analysis for identifying super-linear regular expressions. However, no previous technique can construct attack automata that capture all malicious strings. Since attack automata are crucial for reasoning about sanitization, the algorithms we propose in this paper are necessary for performing sanitization-aware program analysis. Furthermore, we believe that the attack automata produced by our tool can help programmers write suitable sanitizers (especially in cases where the regular expression is difficult to rewrite).

Program Analysis for Vulnerability Detection. There is a large body of work on statically detecting security vulnerabilities in programs. Many of these techniques focus on detecting cross-site scripting (XSS) or code injection vulnerabilities [8, 11, 12, 15, 17, 19, 20, 23, 27,28,29, 32,33,34,35]. There has also been recent work on static detection of specific classes of denial-of-service vulnerabilities. For instance, Chang et al. [10] and Huang et al. [16] statically detect attacker-controlled loop bounds, and Olivo et al. [21] detect so-called second-order DoS vulnerabilities, in which the size of a database query result is controlled by the attacker. However, as far as we know, there is no prior work that uses program analysis for detecting DoS vulnerabilities due to regular expression matching.

Time-Outs to Prevent ReDoS. As mentioned earlier, some libraries (e.g., the .Net framework) allow developers to set a time-limit for regular expression matching. While such libraries may help mitigate the problem through a band-aid solution, they do not address the root cause of the problem. For instance, they neither prevent against stack overflows nor do they prevent DoS attacks in which the attacker triggers the regular expression matcher many times.

10 Conclusions and Future Work

We have presented an end-to-end solution for statically detecting regular expression denial-of-service vulnerabilities in programs. Our key idea is to combine complexity analysis of regular expressions with safety analysis of programs. Specifically, our regular expression analysis constructs an attack automaton that recognizes all strings that trigger worst-case super-linear or exponential behavior. The program analysis component takes this information as input and performs a combination of taint and string analysis to determine whether an attack string could be matched against a vulnerable regular expression.

We have used our tool to analyze thousands of regular expressions in the wild and we show that 20% of regular expressions in the analyzed programs are actually vulnerable. We also use Rexploiter to analyze Java web applications collected from Github repositories and find 41 exploitable security vulnerabilities in 27 applications. Each of these vulnerabilities can be exploited to make the web server unresponsive for more than 10 min.

There are two main directions that we would like to explore in future work: First, we are interested in the problem of automatically repairing vulnerable regular expressions. Since it is often difficult for humans to reason about the complexity of regular expression matching, we believe there is a real need for techniques that can automatically synthesize equivalent regular expressions with linear complexity. Second, we also plan to investigate the problem of automatically generating sanitizers from the attack automata produced by our regular expression analysis.