# A Speculative Parallel DFA Membership Test for Multicore, SIMD and Cloud Computing Environments

## Authors

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s10766-013-0258-5

- Cite this article as:
- Ko, Y., Jung, M., Han, Y. et al. Int J Parallel Prog (2014) 42: 456. doi:10.1007/s10766-013-0258-5

- 2 Citations
- 386 Views

## Abstract

We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into chunks, match chunks in parallel, and combine the matching results. Our parallel matching algorithm exploits structural DFA properties to minimize the speculative overhead. Unlike previous approaches, our speculation is *failure-free*, i.e., (1) sequential semantics are maintained, and (2) speed-downs are avoided altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching operation is fully vectorized. The proposed load-balancing scheme uses an off-line profiling step to determine the matching capacity of each participating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous parallel architectures such as cloud computing environments. We evaluated our speculative DFA membership test for a representative set of benchmarks from the Perl-compatible Regular Expression (PCRE) library and the PROSITE protein database. Evaluation was conducted on a 4 CPU (40 cores) shared-memory node of the Intel Academic Program Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. Obtained speedups are on the order of \(\mathcal O \left( 1+\frac{|P|-1}{|Q|\cdot \gamma }\right) \), where \(|P|\) denotes the number of processors or SIMD units, \(|Q|\) denotes the number of DFA states, and \(0<\gamma \le 1\) represents a statically computed DFA property. For all observed cases, we found that \(0.02<\gamma <0.47\). Actual speedups range from 2.3\(\times \) to 38.8\(\times \) for up to 512 DFA states for PCRE, and between 1.3\(\times \) and 19.9\(\times \) for up to 1,288 DFA states for PROSITE on a 40-core MTL node. Speedups on the EC2 computing cloud range from 5.0\(\times \) to 65.8\(\times \) for PCRE, and from 5.0\(\times \) to 138.5\(\times \) for PROSITE. Speedups of our C-based DFA matcher over the Perl-based ScanProsite scan tool range from 559.3\(\times \) to 15079.7\(\times \) on a 40-core MTL node. We show the scalability of our approach for input-sizes of up to 10 GB.

### Keywords

DFA membership testParallel pattern matchingParallel regular expression matchingSpeculative parallelizationMulticores## 1 Introduction

Locating a string within a larger text has applications with text editing, compiler front-ends and web browsers, scripting languages, file-search (grep), command-processors, databases, Internet search engines, computer security, and DNA sequence analysis. Regular expressions allow the specification of a potentially infinite set of strings (or patterns) to search for. A standard technique to perform regular expression matching is to convert a regular expression to a DFA and run the DFA on the input text. DFA-based regular expression matching has robust, linear performance in the size of the input. However, practical DFA implementations are inherently sequential as the matching result of an input character is dependent on the matching result of the previous characters. Related to DFA matching on parallel architectures, considerable research effort has been recently spent [19, 23, 28, 29, 40, 47].

To speed up DFA matching on parallel architectures, we propose to use speculation. With our method, the input string is divided into chunks. Chunks are processed in parallel using sequential DFA matching. For all but the first chunk, the starting state is unknown.

The core contribution of our method is to exploit structural properties of DFAs to bound the set of initial states the DFA may assume at the beginning of each chunk. Each chunk will be matched for its reduced set of possible initial states. By introducing such a limited amount of redundant matching computation for all but the first chunk, our DFA matching algorithm avoids speed-downs altogether (i.e., the speculation is failure-free [30]). To achieve load-balancing, the input string is partitioned non-uniformly according to processor capacity and work to be performed for each chunk. These properties open up the opportunity for an entire new class of parallel DFA matching algorithms.

We present the time complexity of our matching algorithms, and we conduct an extensive experimental evaluation on SIMD, shared-memory multicore and cloud computing environments. For experiments, we employ regular expressions from the PCRE Library [34] and from the PROSITE protein pattern database [35]. We show the scalability of our approach for input-sizes of up to 10 GB.

## 2 Background

### 2.1 Finite Automata

Let \(\Sigma \) denote a finite alphabet of characters and \(\Sigma ^*\) denote the set of all strings over \(\Sigma \). Cardinality \(\vert \Sigma \vert \) denotes the number of characters in \(\Sigma \). A language over \(\Sigma \) is any subset of \(\Sigma ^*\). The symbol \(\emptyset \) denotes the empty language and the symbol \(\lambda \) denotes the null string. A finite automaton \(A\) is specified by a tuple \((Q,\Sigma ,\delta ,q_{0},F)\), where \(Q\) is a finite set of states, \(\Sigma \) is an input alphabet, \(\delta : Q\times \Sigma \rightarrow 2^Q\) is a transition function, \(q_{0}\in Q\) is the start state and \(F \subseteq Q\) is a set of final states. We define \(A\) to be a DFA if \(\delta \) is a transition function of \(Q\times \Sigma \rightarrow Q\) and \(\delta (q_{}, a)\) is a singleton set for any \(q_{}\in Q\) and \(a\in \Sigma \). Let \(|Q|\) be the number of states in \(Q\). We extend transition function \(\delta \) to \(\delta ^*\): \(\delta ^*(q_{}, ua)=p\Leftrightarrow \delta ^*(q_{},u)=q_{}',\,\delta (q_{}',a)=p,\,a\in \Sigma ,\,u\in \Sigma ^*\). We assume that a DFA has a unique error (or sink) state \(q_{e}\).

An input string \(\textit{Str}\) over \(\Sigma \) is accepted by DFA \(A\) if the DFA contains a labeled path from \(q_{0}\) to a final state such that this path reads \(\textit{Str}\). We call this path an accepting path. Then, the language \(L(A)\) of \(A\) is the set of all strings spelled out by accepting paths in \(A\).

The DFA membership test determines whether a string is contained in the language of a DFA. The DFA membership test is conducted by computing \(\delta ^*(q_{0},\textit{Str})\) and checking whether the result is a final state. Algorithm 1 denotes the sequential DFA matching algorithm. As a notational convention, we denote the symbol in the \(i\)th position of the input string by \(\textit{Str}[i]\).

### 2.2 Amazon EC2 Infrastructure

The Amazon Elastic Computing Cloud (EC2) allows users to rent virtual computing nodes on which to run applications. EC2 is very popular among researchers and companies in need of instant and scalable computing power. Amazon EC2 provides resizable compute capacity where users pay on an hourly basis for launched (i.e., up-and-running) virtual nodes. By using virtualized resources, a computing cloud can serve a much broader user base with the same set of physical resources. Amazon EC2 virtual computing nodes are virtual machines running on top of a variant of the Xen hypervisor. To create a virtual machine, EC2 provides machine images which contain a pre-configured operating system plus application software. Users can adapt machine images prior to deployment. The launch of a machine image creates a so-called instance, which is a copy of the machine image executing as a virtual server in the cloud. To provide a unit of measure for the compute capacities of instances, Amazon introduced so-called EC2 Compute Units (CUs), which are claimed to provide the equivalent CPU capacity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor [3]. Because there exist many such CPU models in the market, the exact processor capacity equivalent to one CU is not entirely clear. Instance types are grouped into nine families, which differ in their processor, I/O, memory and network capacities. Instances are described in [3]; the instances employed in this paper are outlined in Sect. 5. To create a cluster of EC2 instances, the user requires the launch of one or more instances, for which the instance type and the machine image must be specified. The user can specify any machine image that has been registered with Amazon, including Amazon’s or the user’s own images. Once instances are booted, they are accessible as computing nodes via ssh. A maximum of 20 instances can be used concurrently, but this limit may be increased upon user request [4].

## 3 Overview

Because the DFA will initially be in start state \(q_{0}\), the first chunk (\(c_{0}\)) needs to be matched for \(q_{0}\) only. For all subsequent chunks, the DFA state at the beginning of the chunk is initially unknown. Hence, we use speculative computations to match subsequent chunks for all states the DFA may assume. We will discuss in Sect. 4 how the amount of speculative computations can be kept to a minimum. For our motivating example, we assume the DFA to be in either state \(q_{0}\) or \(q_{1}\) at the beginning of chunks \(c_{1}\) and \(c_{2}\). As depicted by the partition from Fig. 3, processor \(p_{0}\) will match chunk \(c_{0}\) for state \(q_{0}\), whereas processors \(p_{1}\) and \(p_{2}\) will match their assigned chunks for both \(q_{0}\) and \(q_{1}\). To match a chunk for a given state, a variation of the matching loop (lines 1–3) of Algorithm 1 is employed.

After processors \(p_{0}\), \(p_{1}\) and \(p_{2}\) have processed their assigned chunks in parallel, the results from the individual chunks need to be combined to derive the overall result of the matching computation. Combining proceeds from the first to the last chunk by propagating the resulting DFA state from the previous chunk as the initial state for the following chunk. According to Fig. 1, the DFA from our motivating example will be in state \(q_{0}\) after matching chunk \(c_{0}\). State \(q_{0}\) is propagated as the initial state for chunk \(c_{1}\). Processor \(p_{1}\) has matched chunk \(c_{1}\) for both possible initial states, i.e., \(q_{0}\) and \(q_{1}\), from which we obtain that state \(q_{0}\) at the beginning of chunk \(c_{1}\) takes the DFA to state \(q_{1}\) at the end of chunk \(c_{1}\). Likewise, the matching result for chunk \(c_{2}\) is now applied to derive state \(q_{1}\) as the final DFA state.

An input partition that accounts for the work imbalance between the initial and all subsequent chunks is depicted in Fig. 4. Because processors \(p_{1}\) and \(p_{2}\) match chunks for two states each, their chunks are only half the size of the chunk assigned to processor \(p_{0}\). All processors now process 6 characters each, resulting in a balanced load and a \(2\)x speedup over sequential matching.

*a-priory*to be either the error state or the state with the incoming transition labeled \(x\).

A processor can exploit this structural DFA property by performing a *reverse lookahead* to determine the last character from the previous chunk. From this character the DFA state at the beginning of the current chunk can be derived. In Fig. 5, the reverse lookahead for our motivating example is shown. Reverse lookahead characters are shaded in gray. Character \(a\) is the lookahead character in chunk \(c_{0}\); only DFA state \(q_{0}\) from Fig. 1 has an incoming transition labeled \(a\), thus the DFA must be in state \(q_{0}\) at the beginning of chunk \(c_{1}\). Likewise, the DFA must be in state \(q_{1}\) at the beginning of chunk \(c_{2}\), because state \(q_{1}\) is the only DFA state with an incoming transition labeled \(b\) (the lookahead character of chunk \(c_{1}\)). Note that for these considerations the error state \(q_{e}\) can be ignored, because once a DFA has reached the error state, it will stay there (e.g., see Fig. 1). Thus, to compute the DFA matching result it is unnecessary to process the remaining input characters once the error state has been reached.

Because now all processors have to match only a single state per chunk, the chunks are of equal size. For three processors, we achieve a speedup of 3\(\times \) over sequential matching for the motivating example.

It should be noted that in the general case the structure of DFAs will be less ideal, i.e., there will be more than one state with incoming transitions labeled by a particular input character. Consequently, each chunk will have to be matched for more than one DFA state. We will develop a measure for the suitability of a DFA for this type of speculative parallelization in Sect. 4. Our analysis of the time-complexity of this method shows that for \(|P|>1\), a speedup is achievable in general. This has been confirmed by our experimental evaluations on SIMD, shared-memory multicore, and the Amazon EC2 cloud-computing environments. We will discuss the trade-offs that come with multi-character reverse lookahead, and we will incorporate inhomogeneous compute capacities of processors to resolve load imbalances. This is essential to effectively utilize heterogeneous multicore architectures, and to overcome the performance variability of nodes reported with cloud computing environments [5, 41].

## 4 Speculative DFA Matching

Our speculative DFA matching approach is a general method, which allows a variety of algorithms that differ with respect to the underlying hardware platform and the incorporation of structural DFA properties. We start this section with the formalization of our basic speculative DFA matching example from Sect. 3. We then present our approach to exploit structural DFA properties to speed up parallel, speculative DFA matching. Section 5 contains variants tailored for SIMD, shared memory multicores and cloud computing environments.

### 4.1 The Basic Speculative DFA Matching Algorithm

- 1.
Offline profiling to determine the DFA matching capacity of each participating processor,

- 2.
partitioning the input string into chunks such that the utilization of the parallel architecture is maximized,

- 3.
performing the matching process on chunks in parallel such that redundant computations are minimized, and

- 4.
merging partial results across chunks to derive the overall result of the matching computation.

#### 4.1.1 Offline Profiling

For environments with inhomogeneous compute capacities, our offline profiling step determines the DFA matching capacities of all participating processors. This information is required to partition work equally among processors and thus balance the load. With heterogeneous multicore hardware architectures such as the Cell BE [16], offline profiling must be conducted only once to determine the performance of all types of processor cores provided by the architecture. With cloud computing environments such as the Amazon EC2 cloud [3], users only have limited control on the allocation of cloud computing nodes. Moreover, the performance of cloud computing nodes has been found to differ significantly, which is by a large extent attributed to variations in the employed hardware platforms [5, 41]. To compensate for the performance variations between cloud computing nodes, offline profiling will be conducted at cluster startup time. Profiling cluster nodes in parallel takes only on the order of milliseconds, which makes the overhead from profiling negligible compared to the substantial cluster startup times on EC2 (on the order of minutes by our own experience and also reported in [33]).

Computation of chunk sizes for Fig. 6 and three processors of non-uniform processing capacities

Processor | \({m_{k}}\) | \({w_{k}}\) | \(L_{0}\)\(\cdot {w_{k}}\) | Input character range |
---|---|---|---|---|

\(p_{0}\) | \(50\) | \(1.5\) | \(28.8\) | \(0\)–\(27\) |

\(p_{1}\) | \(25\) | \(0.75\) | \(3.6\) | \(28\)–\(31\) |

\(p_{2}\) | \(25\) | \(0.75\) | \(3.6\) | \(32\)–\(35\) |

#### 4.1.2 Input Partitioning

We observed already with our motivating example from Fig. 3 that partitioning the input into equal-sized chunks will result in load-imbalance: because for the first chunk the initial DFA state is known to be \(q_{0}\), the first chunk needs to be matched only once. All other chunks must be matched for all possible initial states of the chunk, i.e., \(|Q|\) times, in the worst case. In what follows, we will derive a partition of the input *Str* into \(|P|\) chunks, assuming that all except the first chunk need to be matched for \(|Q|\) states. In Sect. 4.2, we will exploit structural DFA properties to reduce the number of states to be matched per chunk.

Intuitively, because processor \(p_{0}\) has to match chunk \(c_{0}\) only once, it can process a larger portion of the input *Str* than the processors assigned to subsequent chunks. (This was observed already in Fig. 4, where chunk sizes were adjusted such that all processors processed the same number of characters from the input). The objective of our optimization is to determine chunk sizes in such a way that the processing times for all chunks are equal. The purpose of the following equations is to compute a partition of the input into chunks \(c_{i}\), \(0\le i <|P|\), where chunk \(c_{i}\) is a sequence of symbols from the input allocated to processor \(p_{i}\).

*Str*. Let us further assume that matching of a character from the input takes constant time. Processor \(p_{0}\) matches chunk \(c_{0}\) from starting state \(q_{0}\). All other chunks need to be matched for all possible initial states. To keep work among processors balanced, chunk \(c_{0}\) must be \(|Q|\) times longer than the other chunks, i.e., it must hold that

#### 4.1.3 Matching of Chunks

*possible initial states*to

*possible last active states*of a chunk. This mapping is required to store a chunk’s matching results for all possible initial states. After matching chunks in parallel, the computed mappings will be used to derive the overall DFA matching result. Formally, this mapping is defined as a vector

As an example, we consider chunk \(c_{2}\) from Fig. 3 and the DFA from Fig. 1. Chunk \(c_{2}\) will be matched for the possible initial states \(q_{0}\) and \(q_{1}\), with the resulting last active states \(q_{e}\) and \(q_{1}\) and the result vector \({\mathcal{L }_{2}}=[q_{e},q_{1}]\). The meaning of vector \({\mathcal{L }_{2}}\) is that if the DFA assumes state \(q_{0}\) at the beginning of chunk \(c_{2}\), then it will be in state \(q_{e}\) after matching chunk \(c_{2}\). If the DFA assumes state \(q_{1}\) at the beginning of chunk \(c_{2}\), then it will be in state \(q_{1}\) after matching chunk \(c_{2}\).

#### 4.1.4 Merging of Partial Results

*Str*.

The work in [19] does not provide an evaluation of the relative merits of sequential versus parallel merging of \({\mathcal{L }_{}}\)-vectors. In particular, the details of the employed parallel reduction algorithm are not specified. We conducted experiments on a 40-core shared memory node of the Intel MTL using a binary tree for the parallel reduction to find that the computation associated with the merging of \({\mathcal{L }_{}}\)-vectors is not large enough to justify the overhead of a parallel reduction. Especially the overhead from the synchronization required between each of the \(\mathcal O (\log _2(|P|))\) reduction steps is costly.

Moreover, the overhead becomes significant if communication cost between nodes are introduced such as with cloud computers. We describe our findings on the overheads of intra-node and inter-node communication with the EC2 computing cloud in detail in Sect. 5. Section 5 introduces a new \({\mathcal{L }_{}}\)-vector merging technique to cope with the overhead on cloud computers.

In short, we applied the sequential merging from Eq. (8) with shared-memory multicore architectures and a new hierarchical merging technique for cloud computing architectures, which will be explained in Sect. 5.

### 4.2 Optimizations Based on Structural DFA Properties

The amount of work associated with a given chunk is determined by (1) the length of the chunk, and (2) the number of DFA states for which the chunk needs to be matched. In the following, we will distinguish between the initial chunk \(c_{0}\), and *subsequent* chunks \(c_{i}\), \(i>0\). Before matching the initial chunk \(c_{0}\), the DFA will be in the starting state \(q_{0}\), thus chunk \(c_{0}\) only needs to be matched for \(q_{0}\). Prior to the matching of subsequent chunks, the DFA may assume any state in the general case, thus subsequent chunks need to be matched \(|Q|\) times (see, e.g., the motivating example in Fig. 3). In this section we will exploit structural properties of DFAs to deduce a potentially smaller number \({\mathcal{I }_{\text {max}}}\le |Q|\) of states which is the upper bound of initial states for all subsequent chunks.

The best case, i.e., \({\mathcal{I }_{\text {max}}}=1\), has already been observed with our motivating example DFA from Fig. 1. For each character \(\sigma \in \Sigma \) of this DFA, it holds that there is only one state targeted by a transition labeled \(\sigma \). Irrespective of the particular input character \(\sigma \), the DFA can only assume a single state after matching character \(\sigma \). (As mentioned previously, for these considerations we may safely disregard the error state \(q_{e}\), because from the error state no other state is reachable; thus, a DFA that reached the error state will stay there.) If there is only one possible DFA state after matching an input character, it follows that the DFA can only be in one state after matching the last character prior to each subsequent chunk. Thus the DFA can only be in one possible state at the beginning of each subsequent chunk, and we have \({\mathcal{I }_{\text {max}}}=1\).

In the general case, values for \({\mathcal{I }_{\text {max}}}\) can range between \(1\) and \(|Q|\). In the remainder of this section, we will investigate how to deduce this \({\mathcal{I }_{\text {max}}}\) value for a particular DFA, and how this information can be incorporated with our speculative DFA matching algorithm. We will consider real-world DFAs from PCRE and PROSITE to find that for all considered DFAs it holds that \({\mathcal{I }_{\text {max}}}<|Q|\), and that this property can be used to improve DFA matching performance. We have already observed with the input partition in Fig. 5 that reducing the number of initial states of subsequent chunks enables us to increase the sizes of subsequent chunks. Larger subsequent chunks will reduce the size of the initial chunk \(c_{0}\) in turn. Because we adjust chunk sizes such that all chunks will be processed in the same amount of time, reducing the size of the initial chunk \(c_{0}\) will reduce the overall execution time of the matching process. The overarching reason for this performance improvement is that the reduction of potential initial states reduces the total number of symbols that have to be matched per chunk.

*reverse lookahead symbol*. The number of states to be matched for a reverse lookahead symbol \(\sigma \in \Sigma \) is a static property of a DFA. It will range between \(1\) and \(|Q|\). The maximum number of states to be matched over any reverse lookahead symbol constitutes an upper bound on \({\mathcal{I }_{\text {max}}}\), i.e., an upper bound on the number of states to be matched for any subsequent chunk. Because \({\mathcal{I }_{\text {max}}}\) is a static DFA property, we can use it to partition the input into chunks according to Eq. (10). At run-time, a processor will use the reverse lookahead symbol to determine the initial states to be matched for its assigned chunk.

Algorithm 3 applies initial state sets with the DFA matching procedure. Lines 1–7 compute initial state sets \({\mathcal{I }_{}}_\sigma \) from Eq. (11) and \({\mathcal{I }_{\text {max}}}\) from Eq. (12). Unlike Algorithm 2, the partitioning is now based on the maximum number of possible initial states, \({\mathcal{I }_{\text {max}}}\), instead of \(|Q|\). The \({{\mathrm{StartPos}}}\) and \({{\mathrm{EndPos}}}\) functions that compute the start and end position of each chunk now receive \({\mathcal{I }_{\text {max}}}\) as the second argument (lines 11–12 in Algorithm 3). We updated Eqs. (6) and (7) to include an additional parameter to pass \({\mathcal{I }_{\text {max}}}\). In Eqs. (5)–(7), instead of \(|Q|\) we then use the provided argument value to partition the input string and to compute the start and end position of each chunk.

Because the maximum number of initial states \({\mathcal{I }_{\text {max}}}\) is a static property of a DFA, it can be computed off-line. The overhead to compute \({\mathcal{I }_{\text {max}}}\) can thus be avoided with DFAs that are matched multiple times. For example, with protein patterns maintained in databases, corresponding DFAs can be expected to be matched on several DNA sequences. However, with all our experiments, we computed \({\mathcal{I }_{\text {max}}}\) online for every matching run (as stated in Algorithm 3), to account for the general case were a DFA is matched only once.

Another possible optimization of Algorithm 3 concerns the distribution of cardinalities of initial state sets \({\mathcal{I }_{}}_\sigma \). If the maximum value \({\mathcal{I }_{\text {max}}}\) is significantly larger than the average, then it is desirable to divide the input at boundaries with reverse lookahead symbols that have a small initial state set. This would further decrease the number of possible initial states of subsequent chunks. However, searching the input for the occurrence of particular characters constitutes an effort similar to the matching process itself. Moreover, relying on statistical properties of the input string (i.e., the occurrence of particular characters in the input) may violate the failure-freedom of our speculation: if a reverse lookahead symbol with a low set of initial states cannot be found, then additional states need to be matched, resulting in a possible speed-down. In contrast, by considering \({\mathcal{I }_{\text {max}}}\) states, our optimization always shows equal or better performance than the non-optimized matching procedure that has to match all states in \(Q\).

### 4.3 Multiple Reverse Lookahead Symbols

The following lemma establishes that when increasing the amount of reverse lookahead symbols, the maximum number of possible initial states \({\mathcal{I }_{\text {max}}}_{,r}\) of a DFA is bounded above by \({\mathcal{I }_{\text {max}}}\).

**Lemma 1**

Given a DFA, it holds that \({\mathcal{I }_{\text {max}}}={\mathcal{I }_{\text {max}}}_{,1}\ge {\mathcal{I }_{\text {max}}}_{,2}\ge \ldots \ge {\mathcal{I }_{\text {max}}}_{,\omega }\), where \(\omega \) denotes the length of the longest accepting path through the DFA.

*Proof*

Indirect. Without loss of generality we assume a DFA with exactly one of its transitions labeled by a symbol \(\sigma \in \Sigma \), and state \(q_{}\) being the target state of this transition. For this DFA, \(|{\mathcal{I }_{\sigma }}|=1\). Given another symbol \(\sigma '\in \Sigma \), we assume that \(|{\mathcal{I }_{\sigma '\sigma }}|=2\). Then by the definition of \({\mathcal{I }_{}}_{\!\sigma '\sigma }\) in Eq. (13), this DFA must have two distinct states that are the target of a path labeled by a string with postfix \(\sigma '\sigma \). However, this implies that these two target states have an incoming transition labeled \(\sigma \), which contradicts our initial assumption that \(|{\mathcal{I }_{\sigma }}|=1\). Thus for any two symbols \(\sigma \) and \(\sigma '\), it holds that \(|{\mathcal{I }_{\sigma }}|\ge |{\mathcal{I }_{\sigma '\sigma }}|\). The extension to the general case \(|{\mathcal{I }_{\sigma _1\ldots \sigma _k}}|\ge |{\mathcal{I }_{\sigma '\sigma _1\ldots \sigma _{k}}}|\) is straightforward and the lemma follows. \(\square \)

### 4.4 Time Complexity

## 5 Implementation

Hardware specifications

Name | CPU model | CPUs | \(\frac{\hbox {Cores}}{\hbox {CPU}}\) | Clock Freq. | Note |
---|---|---|---|---|---|

Intel MTL | Intel Xeon E7-4860 | 4 | 10 | 2.27 GHz | n/a |

SDE emulator on local server | AVX2/Haswell on Intel Xeon E5405 host | n/a | n/a | n/a | n/a |

Amazon EC2 (m2.4\(\times \)large) | Intel Xeon X5550 | 2 | 4 | 2.67 GHz | 26 EC2 CUs |

Amazon EC2 (cc2.8\(\times \)large) | Intel Xeon E5-2670 Sandy Bridge | 2 | 8 | 2.60 GHz | 88 EC2 CUs |

We tailored our DFA data-structures to maximize performance and to utilize the AVX2 instruction set, in particular the novel AVX2 32-bit gather operations. To generate minimal DFAs from regular expressions, we use Grail+ [15, 37], which is a formal language toolset for the manipulation and application of regular expressions and automata. Our DFA matching framework reads DFAs and input strings in Grail+ format and converts them to our framework’s internal representation.

Listing 1 shows how a chunk is matched for one possible initial state on multicore architectures. It should be noted that by encoding the transition table’s DFA states as offsets relative to the SBase base address, 2-dimensional table lookups of conventional DFA representations are simplified to a 1-dimensional lookup that avoids the rows-times-column multiplication of 2-dimensional arrays—with our representation, we only add the current state’s offset to the current input symbol (line 8 of Listing 1). We employ pointers to access the input and to detect loop termination, thereby avoiding the need for maintaining a separate loop counter variable. When compiled to x86-64, this matching loop consists of only two add operations, one comparison, one indexed load and one conditional jump, which compares favorable to Grail+’s matching loop implemented in C++, which requires more than an order of magnitude more instructions for the same purpose. We used a variant of Listing 1 for sequential DFA matching. This sequential matching routine was used as an efficient yardstick for the comparison to our parallelized matching algorithms, because we found our sequential matching routine to be more efficient than Grail+, and the closest approach from the related work, i.e., [19], incurred slowdowns over sequential matching (see Fig. 11 and Sect. 7).

### 5.1 Vectorized DFA Matching Using AVX2 Instruction Set Extensions

Listing 2 shows our core matching loop with eightfold vectorization employing AVX2 vector instruction intrinsics [21]. Data type __m256i represents an 8-way vector containing 8 32-bit int variables. Variables States and InpIdx contain the indices into the state transition table SBase and the input array IBase. They are initialized to precomputed starting-positions of chunks in lines 5 and 6. We use the _mm256_i32gather_epi32 intrinsic to perform vectorized, indexed loads from the SBase and IBase arrays. For example, in line 8, 8 input characters are loaded from IBase. Note that the offsets in vector InpIdx are scaled by a factor of 4 (the third argument of the intrinsic), to account for the 32-bit size of type int. For further details on the used intrinsics, we refer to [21]. The reason to count the loop index variable down instead of up is because the decrement instruction will already set the x86 CPU’s sign flag when we cross zero. This way we save a cmp instruction which yields additional 12 % of performance improvement. Neither GCC nor Intel’s ICC managed to generate optimal assembly code from Listing 2, which required us to use inline assembly instead. Auto-vectorization of sequential DFA matching is out of reach for compilers, because of the dependencies between current and next DFA state.

### 5.2 DFA Matching on Cloud Computing Architectures

To account for the message delay and variations on EC2, we devised a variant of parallel reduction that is hierarchical with respect to intra-node and inter-node communication. This 2-tier merging approach is based on the observation that intra-node messages showed substantially lower message transfer times and variations than inter-node communication. Our reduction proceeds in two steps, as depicted in Fig. 9. In the first step, \({\mathcal{L }_{}}\)-vectors are merged locally by a designated node leader. In the second step, node leaders send their \({\mathcal{L }_{}}\)-vectors to the master process which combines them to compute the overall matching result. Without loss of generality, this 2-step merging scheme requires that on each EC2 node, DFA-matching worker processes are allocated to adjacent chunks. Our worker-to-node allocation scheme is parameterized by the number of cores to utilize per node, denoted by \(|C|\). For reasons explained below, we leave one core unallocated per EC2 node. Figure 9 depicts the computation of \({\mathcal{L }_{}}\)-vectors by workers (for one chunk), node leaders (the combined map over all chunks matched on a node) and the master (the overall map from the first to the last chunk). Unallocated cores are denoted by symbol “\(\circ \)”.

Our two-tier merging scheme outperformed parallel binary reduction and sequential merging for even the largest EC2 clusters (i.e., up to 20 nodes, which is the maximum possible EC2 cluster size [4]). We found MPI messages among processes on the same node to show both low latency and low variability. We conjecture that MPICH2 applies shared-memory message passing optimizations similar to [24] for node-local communication. Moreover, node-local communication is free from delay variations induced by the network that connects nodes. Therefore, with our merging scheme the only communication step subjected to EC2’s message variability is the merging step conducted by the master. This compares favorably to any parallel reduction scheme with more than one reduction step involving inter-node communication, because each such reduction step may suffer from message delays caused by the underlying network.

As mentioned above, we deliberately left one core per EC2 node unallocated. We observed that without sacrificing one core per EC2 node, there was a high probability that one of the workers on each node would experience a matching performance on the order of one magnitude lower than the workers on the remaining cores. This performance degradation did not affect the offline profiling step, for which we took the median of a series of partial matching runs. However, this performance degradation randomly showed with DFA matching. Because we could not reproduce this problem on a local cluster of Linux computers, we attribute this performance degradation to EC2 hypervisor activities that occasionally preempted the execution of one arbitrary worker thread per node. Leaving one core unallocated on EC2 eliminated this problem. Given the increasing numbers of cores per CPU, leaving one core unallocated can be considered an increasingly small sacrifice (e.g., our experiments were conducted with EC2 nodes providing 8 and 16 cores, respectively).

## 6 Experimental Results

Figure 10 shows the results of our speculative parallel DFA membership test with and without applying four symbol reverse lookahead, for the PROSITE and PCRE benchmark suites conducted on the Intel MTL. We used GCC 4.5.1 on RedHat RHEL 5.4 (x86_64 kernel version 2.6.18-164.el5). \(x\)-axes denote the number of states \(|Q|\), and \(y\)-axes denote the speedup over sequential matching. We note the following observations: (1) Our algorithms always show better performance than sequential matching, despite the overhead from redundant computations incurred by speculative parallelization. Redundant computations constitute matching of subsequent chunks for multiple DFA states, which contrasts sequential DFA matching where the input is only matched for the start state \(q_{0}\). (The red horizontal lines denote the break-even point where the speedup over sequential matching is 1). The fact that there are no speed-downs validates the failure-freedom of our speculative parallelization. (2) Speedups are always proportional to \(|P|\), as predicted by the complexity analysis in Sect. 4.4. This proves our basic assumption that the number of symbols to be processed per processor decides the overall matching time despite the overhead due to parallelization. The performance improvements due to our \({\mathcal{I }_{\text {max}}}\) optimization are shown in Fig. 10b, d.

Another experiment conducted on the MTL is the comparison to ScanProsite [14, 39], which is the reference implementation from the PROSITE protein database. ScanProsite is used to detect signature matches in protein sequences. The tool is implemented in Perl; it can be used to find all substrings that match a certain PROSITE pattern. We parameterized ScanProsite to find only one match to compare with our optimized DFA matching algorithm which determines whether an input string contains a certain pattern or not. For a second comparison, we employed the UNIX grep utility with ScanProsite. Grep constructs a DFA and uses the Boyer-Moore algorithm for matching [17]; it is faster than Perl which uses backtracking [13]. As shown in Fig. 12, our algorithm using four symbol reverse lookahead is 559.3–15079.7 times faster than ScanProsite, and 62.1–23572.0 times faster than the UNIX grep utility.

### 6.1 Performance of Vectorized DFA Matching Using AVX2 Instruction Set Extensions

*Speedup*throughout Sect. 6.1 thus denotes a ratio of executed machine instructions, as opposed to observed execution time. Performance on real hardware, e.g., Intel’s Haswell microarchitecture, must be expected to vary to the extent of variations in the cycles per instruction (CPI) between scalar and vectorized code. For compilation of code with AVX2 intrinsics, we used ICC version 12.1.4.

Figure 13 compares the speedups of scalar and vectorized DFA membership tests applying the \({\mathcal{I }_{\text {max}}}\) optimizations for one symbol lookahead. Eightfold vectorization using AVX2 instructions achieved a 4.45\(\times \) improvement over scalar code. Furthermore, we observed that the expected speedups on an emulated 8-core machine with AVX2 are on the order of magnitude of a 40-core node of the MTL. The speedups range from 1.2\(\times \) to 35.7\(\times \) for PCRE and 0.7\(\times \) to 13.2\(\times \) for PROSITE. Speedup is again proportional to \(|P|\), showing that vectorization is in-line with our complexity analysis from Sect. 4.4. We observed a 16.0 % speed-down on average (maximum 31.5 %) with very large DFAs due to the overhead of our parallelization for SIMD operations. This speed-down is not innate to the algorithms, but due to our implementation, in particular the way chunks are allocated to SIMD vector units. The speed-down can be overcome by increasing the problem size (which we refrained from, to keep experiments consistent).

### 6.2 DFA Matching Performance on Cloud Computing Architectures

We conducted experiments on the Amazon EC2 elastic computing cloud to determine the performance of our speculative DFA matching algorithms on distributed-memory architectures, employing up to 20 nodes and 288 cores. We explored the adaptation of our load-balancing approach to EC2 nodes of varying processing capacities. For the convenience of operating a cluster of EC2 nodes, we used StarCluster [44] version 0.93.3, an open source cluster-computing toolkit for EC2.

Effectiveness of the load-balancing scheme on six configurations of inhomogeneous clusters consisting of two types of Amazon EC2 instances, m2.4\(\times \)large and cc2.8\(\times \)large

EC2 instances | PROSITE | PCRE | |||||
---|---|---|---|---|---|---|---|

Fast | Slow | Min. | Avg. | Max. | Min. | Avg. | Max. |

0 | 5 | 0.0036 | 0.0102 | 0.0298 | 0.0046 | 0.0149 | 0.0696 |

1 | 4 | 0.0031 | 0.0086 | 0.0360 | 0.0036 | 0.0108 | 0.0355 |

2 | 3 | 0.0033 | 0.0090 | 0.0275 | 0.0062 | 0.0121 | 0.0427 |

3 | 2 | 0.0051 | 0.0116 | 0.0248 | 0.0083 | 0.0186 | 0.0707 |

4 | 1 | 0.0060 | 0.0130 | 0.0700 | 0.0093 | 0.0194 | 0.0707 |

5 | 0 | 0.0056 | 0.0119 | 0.0305 | 0.0095 | 0.0188 | 0.0412 |

### 6.3 Performance Impact of Structural DFA Properties and Scalability to Large Input Sizes

We investigated the sizes of possible initial state sets for the PCRE and PROSITE benchmark suites for 1, 2, 3 and 4 reverse lookahead symbols. Figure 16 depicts the number of states \(|Q|\) and the initial state reduction rates for 299 PCRE benchmark DFAs and 110 PROSITE protein patterns. (For DFAs with the same number of states, the possible initial state set sizes were averaged). For example, the rightmost, largest DFA in Fig. 16b consists of \(|Q|\)= 1,288 states. One-symbol reverse lookahead eliminates 65 % of \(|Q|\). Two-symbol, three-symbol and four-symbol lookahead remove 83, 94 and 97 % of all states. It follows from Fig. 16 that DFAs exhibit variations in their initial state reduction rate, which has an impact on matching performance. For example, the PCRE DFA with 43 states (Fig. 16a) shows \({\mathcal{I }_{\text {max}}}\) reduction rates below 31 %. The resulting impact on performance can be observed in Fig. 10c with the Intel MTL node and in Fig. 14c for the EC2 cloud.

Average size of \({\mathcal{I }_{\text {max}}}_{,r}\) compared to \(|Q|\), for \(r\) reverse lookahead symbols

\(r\) | 0 (%) | 1 (%) | 2 (%) | 3 (%) | 4 (%) |
---|---|---|---|---|---|

PCRE | 100 | 33.7 | 26.4 | 23.7 | 21.7 |

PROSITE | 100 | 47.2 | 29.2 | 20.5 | 16.0 |

Because of the exponential time complexity to compute \({\mathcal{I }_{\text {max}}}_{,r}\), there is a trade-off between the overhead of the reverse lookahead computation and the obtainable performance gains. To quantify this overhead, we investigated the cost of reverse lookahead computations on an Intel Xeon 5120 CPU. Figure 17a shows the overhead in microseconds to compute \({\mathcal{I }_{\text {max}}}_{,r}\) for an example DFA of \(|Q|=5\) up to three reverse lookahead characters. As expected, the overhead is exponential in the size of \(\Sigma \). Figure 17b depicts the overhead for increasing numbers of states. Because \({\mathcal{I }_{\text {max}}}_{,r}\) is a static property of a DFA, it can be computed off-line, and then loaded when the matching operation is performed. This way the overhead can be avoided with DFAs that are matched many times (e.g., protein patterns from databases).

Figure 19a depicts the performance for input sizes of 10 MB, 100 MB, 1 GB and 10 GB for the PROSITE patterns executed on EC2 for 288 cores. Unlike shared memory architectures, our DFA matching algorithm achieves higher performance for long (10 GB) input. This is due to the communication costs, which depend on the DFA size (for transmitting \({\mathcal{L }_{}}\)-vectors with the final state reduction), but which are independent of the DFA input size. Larger DFA inputs incur longer overall chunk matching times, which de-emphasize the high (but constant) communication costs for a given DFA. Figure 19b depicts the proportional communication costs that we measured for varying input sizes. It follows that for 1 and 10 GB input sizes, the proportion of communication overhead with respect to the overall execution time is close to zero. However, with input sizes of 100 MB and especially 10 MB, the time spent for communication constitutes a large part of the overall execution time. As shown in Fig. 19a, our DFA matching approach scaled well for input-sizes of up to 10 GB on the EC2 computing cloud.

For sake of completeness, we state the execution time costs for computing DFAs from PROSITE protein patterns. We used Grail+ to create nondeterministic finite automata (NFAs) from regular expressions, convert NFAs to DFAs, and minimize DFAs. We did not apply parallel versions of algorithms for DFA creation and minimization [12, 45]. On average, it took 8 min 21.5 s to convert a PROSITE pattern to a minimal DFA, and 4 min 35.8 s to convert a pattern to a non-minimal DFA. Our speculative DFA matching approach is suitable to both minimal and non-minimal DFAs.

## 7 Related Work

Locating a string in a larger text has applications with text editing, compiler front-ends and web browsers, Internet search engines, computer security, and DNA sequence analysis. Early string searching algorithms such as Aho–Corasick [2], Boyer–Moore [9] and Rabin–Karp [25] efficiently match a finite set of input strings against an input text.

Regular expressions allow the specification of infinite sets of input strings. Converting a regular expression to a DFA for DFA membership tests is a standard technique to perform regular expression matching. The specification of virus signatures in intrusion prevention systems [10, 38, 43] and the specification of DNA sequences [8, 42] constitute recent applications of regular expression matching with DFAs.

Considerable research effort has been spent on parallel algorithms for DFA membership tests. Ladner et al. [26] applied the parallel prefix computation for DFA membership tests with Mealy machines. Hillis and Steele [18] applied parallel prefix computations for DFA membership tests on the 65,536 processor Connection Machine. Ravikumar’s survey [36] shows how DFA membership tests can be stated as a chained product of matrices. Because of the underlying parallel prefix computation, all three approaches perform a DFA membership test on input size \(n\) in \(\mathcal O (\log (n))\) steps, requiring \(n\) processors. Their algorithms handle arbitrary regular expressions, but the underlying assumption of a massive number of available processors can hardly be met in most practical settings. Misra [31] derived another \(\mathcal O (\log (n))\) string matching algorithm. The number of required processors is on the order of the product of the two string lengths and hence not practical.

A straight-forward way to exploit parallelism with DFA membership tests is to run a single DFA on multiple input streams in parallel, or to run multiple DFAs in parallel. This approach has been taken by Scarpazza et al. [40] with a DFA-based string matching system for network security on the IBM Cell BE processor. Similarly, Wang et al. [47] investigated parallel architectures for packet inspection based on DFAs. Both approaches assume multiple input streams and a vast number of patterns (i.e., virus signatures), which is common with network security applications. However, neither approach parallelizes the DFA membership algorithm itself, which is required to improve applications with single, long-running membership tests such as DNA sequence analysis.

Scarpazza et al. [40] utilize the SIMD units of the Cell BE’s synergistic processing units to match multiple input streams in parallel. However, their vectorized DFA matching algorithm contains several SISD instructions and the reported speedup from 16-way vectorization is only a factor of 2.51. In contrast, our proposed 8-way vectorized DFA membership test avoids SISD instructions, achieving a speedup of 4.45 over the sequential version.

Recent research efforts focused on speculative computations to parallelize DFA membership tests. Holub and Štekr [19] were the first to split the input string into chunks and distribute chunks among available processors. Their speculation introduces a substantial amount of redundant computation, which restricts the obtainable speedup for general DFAs to \(\mathcal O (\frac{|P|}{|Q|})\), where \(|P|\) is the number of processors, and \(|Q|\) is the number of DFA states. Their algorithm degenerates to a speed-down when \(|Q|\) exceeds the number of processors (see also Sect. 6, Fig. 11). To overcome this problem, Holub and Štekr specialized their algorithm for \(k\)-local DFAs. A DFA is \(k\)-local if for every word of length \(k\) and for all states \(p,q\in Q\) it holds that \(\delta ^*(p, w)=\delta ^*(q,w)\). Starting the matching operation \(k\) symbols ahead of a given chunk will synchronize the DFA into the correct initial state by the time matching reaches the beginning of the chunk, which eliminates all speculative computation. Holub and Štekr achieve a linear speedup of \(\mathcal O (|P|)\) for \(k\)-local automata. Unlike Holub and Štekr’s approach, our DFA parallelization avoids speed-downs altogether. We use structural properties of general DFAs to limit the amount of speculation. In particular, the restriction to \(k\)-local automata is not required. We have vectorized our speculative matching routine, and we have extensively evaluated our approach on a 40-core shared memory architecture, for AVX2 vector instructions, and on the Amazon EC2 cloud infrastructure.

Jones et al. [23] reported that with the IE 8 and Firefox web browsers 3–40 % of the execution-time is spent parsing HTML documents. To speed up browsing, Jones et al. employ speculation to parallelize token detection (lexing) of HTML language front-ends. Similar to Holub and Štekr’s \(k\)-local automata, they use the preceding \(k\) characters of a chunk to synchronize a DFA to a particular state. Unlike \(k\)-locality, which is a static DFA property, Jones et al. speculate the DFA to be in a particular, frequently occurring DFA state at the beginning of a chunk. Speculation fails if the DFA turns out to be in a different state, in which case the chunk needs to be re-matched. Lexing HTML documents results in frequent matches, and the structure of regular expressions is reported to be simpler than, e.g., virus signatures [29]. Speculation is facilitated by the fact that the state at the beginning of a token is always the same, regardless where lexing started. A prototype implementation is reported to scale up to six of the eight synergistic processing units of the Cell BE.

The speculative parallel pattern matching (SPPM) approach by Luchaup et al. [28, 29] uses speculation to match the increasing network line-speeds faced by intrusion prevention systems. SPPM DFAs represent virus signatures. Like Jones et al., DFAs are speculated to be in a particular, frequently occurring DFA state at the beginning of a chunk. SPPM starts the speculative matching at the beginning of each chunk. With every input character, a speculative matching process stores the encountered DFA state for subsequent reference. Speculation fails if the DFA turns out to be in a different state at the beginning of a speculatively matched chunk. In this case re-matching continues until the DFA synchronizes with the saved history state (in the worst case, the whole chunk needs to be re-matched). A single-threaded SPPM version is proposed to improve performance by issuing multiple independent memory accesses in parallel. Such pipelining (or interleaving) of DFA matches is orthogonal to our approach, which focuses on latency rather than throughput.

SPPM assumes all regular expressions to be suffix-closed, which is the common scenario with intrusion prevention systems; A regular expression is suffix-closed if matching a given string \(w\) implies that \(w\) followed by any suffix is matched, too. A suffix-closed regular language has the property that \(x\in L\Leftrightarrow \forall w\in \Sigma ^*:xw\in L\).

Unlike SPPM and the approach by Jones et al., our speculative DFA matching approach does not rely on a heavily biased distribution of DFA state frequencies. Instead, we use static DFA properties to minimize speculative matching overhead. Our approach is not restricted to suffix-closed regular expressions, and our speculation does not rely on the common case being a match (Jones et al.), or the common case being a non-match (SPPM). To the best of our knowledge, we are the first to employ SIMD gather-operations to fully vectorized the DFA matching process. Our DFA membership test provides a load-balancing mechanism for clusters and cloud computing environments. Unlike previous approaches, our speculative matching algorithm cannot result in a speed-down. We conducted an extensive experimental evaluation on a 40-core shared memory architecture, on a simulator for AVX2 vector instructions, and on the Amazon EC2 cloud infrastructure. Our benchmarks consist of 299 regular expressions from the PCRE library [34], and of 110 patterns from the PROSITE protein pattern database [42]. We analyzed the complexity of our speculative matching algorithm, and we provide insight on achievable scalability on shared-memory and cloud-computing environments. This paper is the extended, journal version of an informal one-page abstract presented at the 4th annual meeting of the Asian Association for Algorithms and Computation [6], and a preliminary technical report [7].

## 8 Conclusions

We have presented a speculative DFA pattern matching method for shared-memory, SIMD and cloud computing environments. Our parallel matching algorithm exploits structural DFA properties to minimize the speculative overhead. To the best of our knowledge, this is the first speculative DFA matching approach that is *failure-free*, i.e., (1) it maintains sequential semantics, and (2) it avoids speed-downs altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching operation is fully vectorized. Communication patterns specifically for the characteristics of cloud computing environments are provided. The proposed load-balancing scheme uses an off-line profiling step to determine the matching capacity of each participating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous parallel architectures. We have shown that our algorithms have a better time complexity than previous work. We conducted an extensive experimental evaluation of PCRE and PROSITE benchmarks on a 4 CPU (40 cores) shared-memory node of the Intel Academic Program Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on a 20-node (288 cores) cluster of the Amazon EC2 computing cloud. We showed the scalability of our approach for DFAs of up to 1,288 states, and input-strings of up to 10 GB. Our results predict that speculative parallel DFA matching can produce substantial speedups. Unlike previous methods, our technique does not impose any restriction on the matched regular expressions.

## Acknowledgments

Research partially supported by the National Research Foundation of Korea (NRF) grants funded by the Korean government (MEST) (Grant No. 2010-0005234, 2012R1A1A2044562 and 2012K2A1A9054713), through the Global Ph.D. Fellowship Program 2011 of the NRF (Grant No. 2010-0008582), and by the Intel Academic Program Manycore Testing Lab.